Scale to get big

News from the conference room: this is a series of blog posts in which blogging experts briefly review key Tech4Africa 2010 talks and panels from Day 1 and 2.

Day 1

It’s going to be a flood of data as more people connect more often with their mobiles. Says Joe Stump (co-founder of SimpleGeo, previously main dev bod at Digg): “Each smartphone has six-plus sensors, and it’s not long before they add barometers and temperature sensors and more. Data production is following Moore’s Law.”

He did a simple calculation, working out what happens if you were to tag the phones (just time and location) once every minute for the 500 million Facebook users.
Just this little addition would add 37.2GB of data every minute to the piles that already need to be crunched.
He asks: “How are we going to store, scale and serve this mess?”
His main point is: scaling != performance.
Performance is more about i/o, and not so much in your choice of language. Choose Ruby, choose php, it makes little total impact to large-scale systems, he insists.
Mostly, scaling is a specialisation.
“The more traffic you get, the more specialised your infrastructure needs to be,” he says. The key is automation – bits should be able to be called or started or attached automatically. Use the cloud, but treat everything in the cloud as ephemeral. It can and will just disappear. Expect it.

He discussed the two approaches to scaling – namely out, and up.
If you scale out, you spread load across lots of boxes. If you scale up you get a bigger, faster box. Less complex infrastructure, but a really powerful box can cost millions of bucks – only workable if your service is making big money already.

Other gems of wisdom:
* Partition your data from the very beginning
* Make use of queues – very important part of consistency of user experience.
* Caching is critical – especially in supporting queues. Write a record to cache while it’s processed by queue so that user experience stays OK.

These are lessons learned from long years worrying about things like: how do you handle objects such as the front page story on Digg when it’s getting millions of hits?
His other key advice is about people:
“It takes a lot of people to build, scale and maintain infrastructure – you will grow from one or two to 15 or more.” The human management issues become tricky here: “The first two or three devs on board are going to question every decision management makes.”
A good thought: “Look for a trait in developers: laziness. You want someone who looks for a quicker, better way.”

As your site (and dev team) grows, he advises looking to lower barriers to entry for more junior devs. “Get your codebase to a position where you don’t need to hire a Jedi. Jedis are rare. Jedis are expensive.”
He recommends breaking teams up. 4-6 people work well, at 8 it starts breaking. Get a Jedi, and make them the team leader. Note: team leader, not manager. They should act more like a sports team’s captain. Create frameworks (authentication, error handling) to lower barriers to entry as new coders come on.
And use code repositories. Full stop.

He is very passionate about promote ownership in the codebase, so that individuals work on three of four areas and have responsibility for them.
“As you scale and your code bases grow, from 50,000 lines of code to 400,000 lines, no-one can be effective across the whole base,” he says.
Before you start, design the software – don’t just start coding. He is a big fan of stubbing out the API on a whiteboard.

When it comes to testing – automation is good, and use several methods. If you fix something, make sure you run a test on the old version and make sure it fails it. Apply patch, and make sure it now passes.
Documentation. Build time into your planning for documentation. Even if old and stale it adds historical context, maybe helping you understand later why you made a particular decision.
Do peer reviews. “I’ve never sat in on any peer review and didn’t see at least one show-stopping bug.”

There are a number of ways to scale up using powerful technologies. “When I left Digg we were handling 37,000 requests a second,” he says. Now at SimpleGeo, he runs 15 nodes in one Cassandra cluster, 12 nodes in other cluster.
The numbers will go up (if you are even remotely successful). The technology is getting faster and faster, handling volumes that would have been unthinkable before. “You can get 1500 writes a second on a decent SQL box. A couple of years ago if you asked me if I’d need that, I would have laughed,” says Stump. Right now he is putting 5,000 to 7,000 writes/sec on a Cassandra cluster.

Most South African web developers, even those working for the relative giants like news24.com see only a fraction of these volumes – but one thing is sure. Africa is developing its Internet community fast – it won’t be long before servers talking to thousands of users are talking to millions.

Roger Hislop
www.sentientbeing.co.za
@d0dja