YouTube, founded in Feb 2005, with initial army of 2 developers, 2 architects, 2 network engineers, 1 DBA, YouTube grew incredibly fast to what we know it today as.
It wasn’t just the user popularity that grew, architecture and underlying technology took serious overhauls.
YouTube is world’s 4th most popular website, and no.1 for videos, worldwide. It serves around 1 billion views and several Terabytes of data everyday which accounts to 11,574 views per second .
How can YouTube handle such a large amount of Video traffic without having a noticeable performance lag? Some of the answers are found from high scalability that talks about best industry practices.
It’s no surprise, it’s all powered by Open Source Technologies. Here is a glimpse at the Platform Architecture they deploy:
Linux (earlier SuSe and now multiple flavors)
MySQL (v5.x, highly customized by Google proprietary clusters)
psyco, a dynamic python->C compiler
lighttpd for video instead of Apache
Use of GFS (Google File system)
Today, the service time for a single page on youtube is around 70ms (on an average). Loading a single page with such a large amount of traffic, using above technologies (ignoring GFS), could make normal servers take several minutes for a single request. How does youtube distribute and scale this? Of course it’s all about the Optimizations: the hardest thing to do.
Load Balancing and Caching: Youtube, since beginning and up-till recently, used NetScalar. Now it has been phased to Google’s proprietary file clusters, GFS.
WebServer: Apache with mod_fast_cgi.
Request Processor: Python-based application server.
Optimizations: Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
CPU intensive tasks: All heavy tasks, like video/audio encoding, encryption, C extensions are used.
Pre-processing: Few blocks of dynamic pages can be a big overhead during runtime. e.g. Related Videos. Such HTML is pre-generated and cached.
Caching, Persistence: Youtube extensively makes use of object serialization and their persistence into database, specially for Python objects. Frequently used indexes are cached. this brings a great help in searches.
1. Web Components:
For handling changes, they employ agents that watches for changes, pre-calculates, and sends to all systems. This has become highly complex and smart, specially after the merger with google.
2. Serving Videos
High Availability & Speed: Each video is stored as duplicate within the hosted mini-cluster. Each video is stored and served by more than one machine at a time, to multiple users. In other words, more disks serving content implies more speed for each user. If a machine goes down others can take over. And for the extreme cases, to avoid data loss, there are multiple online backups both scheduled and triggered.
Content Delivery: lighttpd (and not apache) web server is used for serving video. This helps in reducing a lot of overhead that apache comes with.
Popular content is moved to a CDN (content delivery network). CDNs help a lot by replicating content on to multiple geographical areas. Using such a strategy, there is a higher probability that the content could be closer to the user, with fewer hops, and hence faster. On the other hand, less popular content (say 1-40 views per day) uses YouTube servers in various local sites, instead. There is one drawback of this — Long tail effect — This happens when lots of such videos are being played. Caching literally fails to do any good. To tackle this, Youtube tunes RAID controllers, memory on each machine so to optimize random disk access and thereby allowing higher multiple-file-access.
Note: Most of the CDNs are dedicated and internal to youtube , now.
3. Serving Thumbnails
One might argue what is the point considering thumbnails separately when they form hardly a small part as compared to videos? Surprisingly, they ate difficult to do efficiently. As we know, there are 4 thumbnails for each video, so there are 4x times thumbnails than videos. In order to handle these, they are hosted on just a few machines.
The Challenge: Since the image size is often small, it results in lots of disk seeks and problems with inode caches and page caches at OS level.
Storing lots of files in a file system is still not a good idea. A high number of requests/sec can create havoc of a HDD, as web pages can be quantitative with upto 50 thumbnails on a page. Due to bad inode caches, apache performs badly. Using squid (reverse proxy) in front of Apache can be one solution. This worked for youtube for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. They tried using lighttpd but with a single threaded it stalled. Multiprocesses mode caused troubles,as well, because they would each keep a separate cache.
Solution: Youtube uses Google’s BigTable, a distributed data store:
– Avoids small file problem because it clumps files together.
– Fast, fault tolerant. Assumes its working on a unreliable network an hence performs with Acks.
– Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.