Technical Insights

Scaling to 100 Million Users: What We Learned

Cooply Team 28 November 2024 8 min read
Share: LinkedIn

Scaling to 100 Million Users: What We Learned

When you're handling billions of messages for millions of users, you learn things that tutorials don't teach. Here are the lessons that cost us sleep so they don't have to cost you.

Start with the Boring Stuff

Before we talk about distributed systems and message queues, let's talk about the stuff that actually matters early on:

1. Your Database Indexes

90% of early scaling problems are database problems. And 90% of database problems are index problems.

We've seen startups spend weeks optimising code when a single index would have fixed everything. Before you reach for caching, before you think about microservices, profile your queries.

2. Connection Pooling

When you go from 100 to 10,000 concurrent users, your database connections become a bottleneck. Set up connection pooling early. It's boring, it's unglamorous, and it'll save you a 3am incident response.

3. Observability From Day One

You can't fix what you can't see. We've worked with teams who built elaborate systems but couldn't answer basic questions like "what's our p99 latency?" or "which endpoints are slowest?"

Set up proper logging, metrics, and tracing before you need them. It's much harder to retrofit.

The Scaling Journey

Here's roughly how things went for our largest platforms:

0 to 10,000 users

  • Single server
  • Simple architecture
  • Focus on product, not infrastructure

10,000 to 100,000 users

  • Add a load balancer
  • Separate database server
  • Introduce caching (Redis, usually)
  • Start monitoring properly

100,000 to 1,000,000 users

  • Read replicas for database
  • CDN for static assets
  • Message queues for background work
  • Consider microservices (but be careful)

1,000,000+ users

  • Multi-region deployment
  • Sophisticated caching strategies
  • Event-driven architecture
  • Dedicated infrastructure team

The Mistakes We Made

Let's be honest about what didn't work:

Premature Optimisation

We once spent a month building a custom caching solution before we had 1,000 users. Don't do this. Solve problems when they become problems.

Microservices Too Early

Microservices solve specific problems - usually organisational ones. If you have a small team, a monolith is probably faster and simpler. We've seen startups drown in microservice complexity when a well-structured monolith would have served them for years.

Underestimating Data Growth

Messages add up fast. Billions of them. We underestimated storage and backup requirements by orders of magnitude. Plan for 10x more data than you think you'll have.

What Actually Worked

Horizontal Scaling

Build stateless services from the start. It's much easier to add servers than to refactor for statelessness later.

Async Everything

Anything that can be done asynchronously should be. Send email? Queue it. Process an image? Queue it. Generate a report? Queue it.

Graceful Degradation

When things go wrong (and they will), fail gracefully. Show cached data, disable non-essential features, communicate with users. A degraded experience beats no experience.

The Non-Technical Stuff

Technical scaling is half the battle. The other half:

  • On-call rotations: Someone needs to be available. Make it sustainable.
  • Runbooks: Document how to fix common problems. At 3am, you won't remember.
  • Post-mortems: Learn from incidents. Blame the system, not people.

Working on something that needs to scale? We'd love to help.

Get startup & technical insights

From a team with 25+ successful exits. No spam, just lessons from 25 years in the trenches.

You're in!

Thanks for subscribing. We'll be in touch with insights from 25 years in the trenches.

Follow instead

We value your privacy

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve your experience and analyse site usage. Learn more