Abstract
A feature flag evaluation service handling ~500k req/min across 300 Django pods was hitting hard scaling limits: cascading failures from database pressure, fragile connection pooling, complex timeouts, and ballooning costs. When 10x growth projections made it clear the architecture wouldn't hold, the team rewrote it in Rust in under 3 quarters. This talk covers the full arc – the decisions, the execution, the results, and the consequences:
- Choosing Rust: Evaluating FastAPI, Node.js/HyperExpress, Go, and Rust/axum – and why Rust won
- Architecture: Moving evaluation from SQL into application code, layered caching, parallel flag computation, and replacing PgBouncer with native connection pooling
- Execution: How a small team shipped quickly via incremental rollout and testing against production traffic
- Results: p99 latency from 904ms to 85ms, compute costs down 68%
- Beyond: New failure modes, operational surprises at scale, and the ways "simpler architecture" introduces its own complexity
This talk aims to go beyond the typical rewrite narrative – covering not just why and how, but what happens when the dust settles and you're the one carrying the pager.