Scaling to 100 Million Users: What We Learned
Scaling to 100 Million Users: What We Learned
When you're handling billions of messages for millions of users, you learn things that tutorials don't teach. Here are the lessons that cost us sleep so they don't have to cost you.
Start with the Boring Stuff
Before we talk about distributed systems and message queues, let's talk about the stuff that actually matters early on:
1. Your Database Indexes
90% of early scaling problems are database problems. And 90% of database problems are index problems.
We've seen startups spend weeks optimising code when a single index would have fixed everything. Before you reach for caching, before you think about microservices, profile your queries.
2. Connection Pooling
When you go from 100 to 10,000 concurrent users, your database connections become a bottleneck. Set up connection pooling early. It's boring, it's unglamorous, and it'll save you a 3am incident response.
3. Observability From Day One
You can't fix what you can't see. We've worked with teams who built elaborate systems but couldn't answer basic questions like "what's our p99 latency?" or "which endpoints are slowest?"
Set up proper logging, metrics, and tracing before you need them. It's much harder to retrofit.
The Scaling Journey
Here's roughly how things went for our largest platforms:
0 to 10,000 users
- Single server
- Simple architecture
- Focus on product, not infrastructure
10,000 to 100,000 users
- Add a load balancer
- Separate database server
- Introduce caching (Redis, usually)
- Start monitoring properly
100,000 to 1,000,000 users
- Read replicas for database
- CDN for static assets
- Message queues for background work
- Consider microservices (but be careful)
1,000,000+ users
- Multi-region deployment
- Sophisticated caching strategies
- Event-driven architecture
- Dedicated infrastructure team
The Mistakes We Made
Let's be honest about what didn't work:
Premature Optimisation
We once spent a month building a custom caching solution before we had 1,000 users. Don't do this. Solve problems when they become problems.
Microservices Too Early
Microservices solve specific problems - usually organisational ones. If you have a small team, a monolith is probably faster and simpler. We've seen startups drown in microservice complexity when a well-structured monolith would have served them for years.
Underestimating Data Growth
Messages add up fast. Billions of them. We underestimated storage and backup requirements by orders of magnitude. Plan for 10x more data than you think you'll have.
What Actually Worked
Horizontal Scaling
Build stateless services from the start. It's much easier to add servers than to refactor for statelessness later.
Async Everything
Anything that can be done asynchronously should be. Send email? Queue it. Process an image? Queue it. Generate a report? Queue it.
Graceful Degradation
When things go wrong (and they will), fail gracefully. Show cached data, disable non-essential features, communicate with users. A degraded experience beats no experience.
The Non-Technical Stuff
Technical scaling is half the battle. The other half:
- On-call rotations: Someone needs to be available. Make it sustainable.
- Runbooks: Document how to fix common problems. At 3am, you won't remember.
- Post-mortems: Learn from incidents. Blame the system, not people.
Working on something that needs to scale? We'd love to help.