When you sign up for a web hosting account you have to select between a myriad of server configurations with different memory sizes and processor speeds. How do you know what to choose? If you are just running a blog or a product sales site then I can tell you that in almost all cases it doesn’t matter. Even the smallest web server will handle the load of a new web site. But what if your web site becomes the next YouTube overnight? Then it doesn’t matter either. A single server will never handle that amount of traffic.
What if you’re John Reese and you have this brilliant idea called BlogRush and you have the marketing resources to make it an overnight hit, then how do you decide beforehand how many servers you need?
I’ve architected web sites for Fortune 500 companies where sites regularly receive 5 million pageviews per day. In coming up with the architecture and the hardware requirements one of my first questions for my client is how much traffic do you expect in the next year. Typically they have no idea. This could be because they have poor web stats for their existing site or they are launching a genuinely new service. That’s when it’s time to make educated guesses and use a thought process something like this:
Let’s use BlogRush as our hypothetical example. Assume that 10,000 blogs sign up initially. Given the value proposition of receiving free traffic I’m guessing that 99% of the blogs that do sign up have very little traffic, say 100 pageviews per day. But some of the more popular bloggers like ShoeMoney, John Chow, DoshDosh, Yaro Starak, Rosalind Gardner, Mike Filsaime and Terry Dean, are also going to sign up. Assume that these larger blogs each contribute on average 10,000 pageviews per day. The total load on the BlogRush servers would then be close to 2 million requests to show their widget per day. Since the blogger audience is global we can assume that traffic is spread about evenly across the day giving us an average of 23 requests per second.
A single well tuned web server can easily serve 23 request per second, if the content is static. The problem is that the content served by BlogRush is very dynamic. The headlines to be served to each blog have to be gathered, the number of credits each blog has accumulated has to be tracked and the number of links displayed for each blog has to be subtracted from the accumulated credits. The calculations get significantly more more complex due to credits being tracking through 10 referral levels. And all this has to be done 23 times per second. Not a trivial problem.
One way that very large sites handle their traffic is through aggressive caching. You pay a service like Akamai $10k+ per month and 90% of your traffic problem goes away. But for this to work the majority of the content needs to be cacheble, and we have already determined that BlogRush does not fit that profile.
Let’s look at the next tier of servers. After a request hits the web server most of the heavy lifting is going to be done by the application or database servers. Since the functionality of BlogRush is very data intensive I would implement most of it in a few stored procedures inside the database.
This brings us to scaling the database server. There are basically two approaches here: Use one very large server to serve all requests and then have one mirrored standby server in case the first one goes down. This is the “big iron” approach and it’s pretty expensive. The other approach is to use several smaller database servers. This is seldom done in reality because few applications are suited to distributing the load across several databases that are not actively linked.
At first thought the BlogRush application seems to fall in this latter “not possible” category since you presumably need to keep track of advertisement credits in one central location. But assume for a moment that the credits are spread out across several unlinked databases that are updated independently of each other. Sure a given blog could run out of credits in one database while there are credits remaining in others. Over time that shouldn’t matter; All credits will accumulate and be used correctly.
So given the 10 servers BlogRush reportedly has, I would dedicate one to the members’ pages that you see when you login to your account, one server for constantly polling new headlines from all blog feeds and the remaining eight servers each running a web server and a database. A load balancer and firewall sits in front of it all to direct traffic to the least utilized server. As traffic grows you just add more servers.
According to Mike Filsaime, John Reese sees BlogRush becoming a $100 million company. But he has a very sizable traffic problem that goes along with a hugely successful business, and it’s growing exponentially. If the number of blogs that sign up to BlogRush increases by a factor of 10, then the load on the servers will increase more than 10 times. That’s what I call a scalability problem. A quite interesting one.
Note that I don’t have any direct insight into the systems or operations behind BlogRush. These are just my educated guesses based on my 10 years of experience as a technical architect for some rather large web sites.