Scaling to 100 Million Users: What We Learned

When you're handling billions of messages for millions of users, you learn things that tutorials don't teach. Here are the lessons that cost us sleep so they don't have to cost you.

Start with the Boring Stuff

Before we talk about distributed systems and message queues, let's talk about the stuff that actually matters early on:

1. Your Database Indexes

90% of early scaling problems are database problems. And 90% of database problems are index problems.

We've seen startups spend weeks optimising code when a single index would have fixed everything. Before you reach for caching, before you think about microservices, profile your queries.

2. Connection Pooling

When you go from 100 to 10,000 concurrent users, your database connections become a bottleneck. Set up connection pooling early. It's boring, it's unglamorous, and it'll save you a 3am incident response.

3. Observability From Day One

You can't fix what you can't see. We've worked with teams who built elaborate systems but couldn't answer basic questions like "what's our p99 latency?" or "which endpoints are slowest?"

Set up proper logging, metrics, and tracing before you need them. It's much harder to retrofit.

The Scaling Journey

Here's roughly how things went for our largest platforms:

0 to 10,000 users

Single server
Simple architecture
Focus on product, not infrastructure

10,000 to 100,000 users

Add a load balancer
Separate database server
Introduce caching (Redis, usually)
Start monitoring properly

100,000 to 1,000,000 users

Read replicas for database
CDN for static assets
Message queues for background work
Consider microservices (but be careful)

1,000,000+ users

Multi-region deployment
Sophisticated caching strategies
Event-driven architecture
Dedicated infrastructure team

The Mistakes We Made

Let's be honest about what didn't work:

Premature Optimisation

We once spent a month building a custom caching solution before we had 1,000 users. Don't do this. Solve problems when they become problems.

Microservices Too Early

Microservices solve specific problems - usually organisational ones. If you have a small team, a monolith is probably faster and simpler. We've seen startups drown in microservice complexity when a well-structured monolith would have served them for years.

Underestimating Data Growth

Messages add up fast. Billions of them. We underestimated storage and backup requirements by orders of magnitude. Plan for 10x more data than you think you'll have.

What Actually Worked

Horizontal Scaling

Build stateless services from the start. It's much easier to add servers than to refactor for statelessness later.

Async Everything

Anything that can be done asynchronously should be. Send email? Queue it. Process an image? Queue it. Generate a report? Queue it.

Graceful Degradation

When things go wrong (and they will), fail gracefully. Show cached data, disable non-essential features, communicate with users. A degraded experience beats no experience.

The Non-Technical Stuff

Technical scaling is half the battle. The other half:

On-call rotations: Someone needs to be available. Make it sustainable.
Runbooks: Document how to fix common problems. At 3am, you won't remember.
Post-mortems: Learn from incidents. Blame the system, not people.

Working on something that needs to scale? We'd love to help.

Scaling to 100 Million Users: What We Learned

Scaling to 100 Million Users: What We Learned

Start with the Boring Stuff

1. Your Database Indexes

2. Connection Pooling

3. Observability From Day One

The Scaling Journey

0 to 10,000 users

10,000 to 100,000 users

100,000 to 1,000,000 users

1,000,000+ users

The Mistakes We Made

Premature Optimisation

Microservices Too Early

Underestimating Data Growth

What Actually Worked

Horizontal Scaling

Async Everything

Graceful Degradation

The Non-Technical Stuff

Related Posts

Why We Stopped Being an Agency

Equity vs Consultancy: How We Structure Startup Partnerships

Get startup & technical insights

You're in!

We value your privacy