What Powers YouTube

YouTube, founded in Feb 2005, with initial army of 2 developers, 2 architects, 2 network engineers, 1 DBA, YouTube grew incredibly fast to what we know it today as.

It wasn’t just the user popularity that grew, architecture and underlying technology took serious overhauls.

YouTube is world’s 4th most popular website, and no.1 for videos, worldwide. It serves around 1 billion views and several Terabytes of data everyday which accounts to 11,574 views per second .

How can YouTube handle such a large amount of Video traffic without having a noticeable performance lag? Some of the answers are found from high scalability that talks about best industry practices.

It’s no surprise, it’s all powered by Open Source Technologies. Here is a glimpse at the Platform Architecture they deploy:

Apache (for html, javascript, css)

Python

Linux (earlier SuSe and now multiple flavors)

MySQL (v5.x, highly customized by Google proprietary clusters)

psyco, a dynamic python->C compiler

lighttpd for video instead of Apache

Use of GFS (Google File system)

Today, the service time for a single page on youtube is around 70ms (on an average). Loading a single page with such a large amount of traffic, using above technologies (ignoring GFS), could make normal servers take several minutes for a single request. How does youtube distribute and scale this? Of course it’s all about the Optimizations: the hardest thing to do.

1. Web Components:

Load Balancing and Caching: Youtube, since beginning and up-till recently, used NetScalar. Now it has been phased to Google’s proprietary file clusters, GFS.

WebServer: Apache with mod_fast_cgi.

Request Processor: Python-based application server.

Optimizations: Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.

CPU intensive tasks: All heavy tasks, like video/audio encoding, encryption, C extensions are used.

Pre-processing: Few blocks of dynamic pages can be a big overhead during runtime. e.g. Related Videos. Such HTML is pre-generated and cached.

Caching, Persistence: Youtube extensively makes use of object serialization and their persistence into database, specially for Python objects. Frequently used indexes are cached. this brings a great help in searches.

For handling changes, they employ agents that watches for changes, pre-calculates, and sends to all systems. This has become highly complex and smart, specially after the merger with google.

2. Serving Videos

High Availability & Speed: Each video is stored as duplicate within the hosted mini-cluster. Each video is stored and served by more than one machine at a time, to multiple users. In other words, more disks serving content implies more speed for each user. If a machine goes down others can take over. And for the extreme cases, to avoid data loss, there are multiple online backups both scheduled and triggered.

Content Delivery: lighttpd (and not apache) web server is used for serving video. This helps in reducing a lot of overhead that apache comes with.

Popular content is moved to a CDN (content delivery network). CDNs help a lot by replicating content on to multiple geographical areas. Using such a strategy, there is a higher probability that the content could be closer to the user, with fewer hops, and hence faster. On the other hand, less popular content (say 1-40 views per day) uses YouTube servers in various local sites, instead. There is one drawback of this — Long tail effect — This happens when lots of such videos are being played. Caching literally fails to do any good. To tackle this, Youtube tunes RAID controllers, memory on each machine so to optimize random disk access and thereby allowing higher multiple-file-access.

Note: Most of the CDNs are dedicated and internal to youtube , now.

3. Serving Thumbnails

One might argue what is the point considering thumbnails separately when they form hardly a small part as compared to videos? Surprisingly, they ate difficult to do efficiently. As we know, there are 4 thumbnails for each video, so there are 4x times thumbnails than videos. In order to handle these, they are hosted on just a few machines.

The Challenge: Since the image size is often small, it results in lots of disk seeks and problems with inode caches and page caches at OS level.
Storing lots of files in a file system is still not a good idea. A high number of requests/sec can create havoc of a HDD, as web pages can be quantitative with upto 50 thumbnails on a page. Due to bad inode caches, apache performs badly. Using squid (reverse proxy) in front of Apache can be one solution. This worked for youtube for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20. They tried using lighttpd but with a single threaded it stalled. Multiprocesses mode caused troubles,as well, because they would each keep a separate cache.

Solution: Youtube uses Google’s BigTable, a distributed data store:

– Avoids small file problem because it clumps files together.

– Fast, fault tolerant. Assumes its working on a unreliable network an hence performs with Acks.

– Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.

4. Databases

This is something crucial. Youtube has come a long way in database level improvements:

Database partitioning based on Month, categories, tags, on their custom MySQL database
Thes trategy employed splits database partitions into shards with users then assigned to different shards. As the adoption of google’s technologies into youtube grabbed acceleration, Google’s advanced caching mechanisms over MySQL clusters came into the picture which makes youtube so powerful.

All said and done

Probably, there’s a lot more I can talk about on Youtube, if only I knew more about it or find about it. If you know something that I missed out, feel free to add or argue on the points, the purpose is to collect the best practices.

Liked the Post? Subscribe to Twitter updates, or RSS, join Facebook fanpage for more Tech updates.

GD Star Rating
loading...

4 thoughts on “What Powers YouTube”

Joao Canudo
October 19, 2009 at 7:13 pm
Great, because google knows how to make it right. 🙂
GD Star Rating
loading...
GD Star Rating
loading...
Brian salet
October 28, 2009 at 2:52 am
well, I`m out of words.
how did you manage to gather that much info?
GD Star Rating
loading...
GD Star Rating
loading...
high speed scanner
September 7, 2010 at 1:32 pm
i just want to say thanks for your information
GD Star Rating
loading...
GD Star Rating
loading...
העלאת מסת שריר והורדת אחוזי שומן
July 26, 2019 at 6:17 pm
Thank you for some other great article. The place else may
anybody get that kind of information in such an ideal
means of writing? I’ve a presentation subsequent week, and I
am at the search for such information.
GD Star Rating
loading...
GD Star Rating
loading...

4 thoughts on “What Powers YouTube”

Leave a Comment Cancel reply