mukesh

Mobile

Monday, 30 April 2012

How the Google File System Works

Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a distributed computing system to provide users with the infrastructure they need to access, create and alter data. Surely Google buys state-of-the-art computers and servers to keep things running smoothly, right?
Wrong. The machines that power Google's operations aren't cutting-edge power computers with lots of bells and whistles. In fact, they're relatively inexpensive machines running on Linux operating systems. How can one of the most influential companies on the Web rely on cheap hardware? It's due to the Google File System (GFS), which capitalizes on the strengths of off-the-shelf servers while compensating for any hardware weaknesses. It's all in the design.
Google uses the GFS to organize and manipulate huge files and to allow application developers the research and development resources they require. The GFS is unique to Google and isn't for sale. But it could serve as a model for file systems for organizations with similar needs.
Some GFS details remain a mystery to anyone outside of Google. For example, Google doesn't reveal how many computers it uses to operate the GFS. In official Google papers, the company only says that there are "thousands" of computers in the system (source: Google). But despite this veil of secrecy, Google has made much of the GFS's structure and operation public knowledge.

Google File System Basics

Google developers routinely deal with large files that can be difficult to manipulate using a traditional computer file system. The size of the files drove many of the decisions programmers had to make for the GFS's design. Another big concern was scalability, which refers to the ease of adding capacity to the system. A system is scalable if it's easy to increase the system's capacity. The system's performance shouldn't suffer as it grows. Google requires a very large network of computers to handle all of its files, so scalability is a top concern.
Because the network is so huge, monitoring and maintaining it is a challenging task. While developing the GFS, programmers decided to automate as much of the administrative duties required to keep the system running as possible. This is a key principle of autonomic computing, a concept in which computers are able to diagnose problems and solve them in real time without the need for human intervention. The challenge for the GFS team was to not only create an automatic monitoring system, but also to design it so that it could work across a huge network of computers.
The key to the team's designs was the concept of simplification. They came to the conclusion that as systems grow more complex, problems arise more often. A simple approach is easier to control, even when the scale of the system is huge.
Based on that philosophy, the GFS team decided that users would have access to basic file commands. These include commands like open, create, read, write and close files. The team also included a couple of specialized commands: append and snapshot. They created the specialized commands based on Google's needs. Append allows clients to add information to an existing file without overwriting previously written data. Snapshot is a command that creates quick copy of a computer's contents.
Files on the GFS tend to be very large, usually in the multi-gigabyte (GB) range. Accessing and manipulating files that large would take up a lot of the network's bandwidth. Bandwidth is the capacity of a system to move data from one location to another. The GFS addresses this problem by breaking files up into chunks of 64 megabytes (MB) each. Every chunk receives a unique 64-bit identification number called a chunk handle. While the GFS can process smaller files, its developers didn't optimize the system for those kinds of tasks.
By requiring all the file chunks to be the same size, the GFS simplifies resource application. It's easy to see which computers in the system are near capacity and which are underused. It's also easy to port chunks from one resource to another to balance the workload across the system.

Google File System Architecture

Google organized the GFS into clusters of computers. A cluster is simply a network of computers. Each cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers.
In the world of GFS, the term "client" refers to any entity that makes a file request. Requests can range from retrieving and manipulating existing files to creating new files on the system. Clients can be other computers or computer applications. You can think of clients as the customers of the GFS.
The master server acts as the coordinator for the cluster. The master's duties include maintaining an operation log, which keeps track of the activities of the master's cluster. The operation log helps keep service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored the operation log can take its place. The master server also keeps track of metadata, which is the information that describes chunks. The metadata tells the master server to which files the chunks belong and where they fit within the overall file. Upon startup, the master polls all the chunkservers in its cluster. The chunkservers respond by telling the master server the contents of their inventories. From that moment on, the master server keeps track of the location of chunks within the cluster.
There's only one active master server per cluster at any one time (though each cluster has multiple copies of the master server in case of a hardware failure). That might sound like a good recipe for a bottleneck -- after all, if there's only one machine coordinating a cluster of thousands of computers, wouldn't that cause data traffic jams? The GFS gets around this sticky situation by keeping the messages the master server sends and receives very small. The master server doesn't actually handle file data at all. It leaves that up to the chunkservers.
Chunkservers are the workhorses of the GFS. They're responsible for storing the 64-MB file chunks. The chunkservers don't send chunks to the master server. Instead, they send requested chunks directly to the client. The GFS copies every chunk multiple times and stores it on different chunkservers. Each copy is called a replica. By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired.

Using the Google File System

File requests follow a standard work flow. A read request is simple -- the client sends a request to the master server to find out where the client can find a particular file on the system. The server responds with the location for the primary replica of the respective chunk. The primary replica holds a lease from the master server for the chunk in question.
­
If no replica currently holds a lease, the master server designates a chunk as the primary. It does this by comparing the IP address of the client to the addresses of the chunkservers containing the replicas. The master server chooses the chunkserver closest to the client. That chunkserver's chunk becomes the primary. The client then contacts the appropriate chunkserver directly, which sends the replica to the client.
Write requests are a little more complicated. The client still sends a request to the master server, which replies with the location of the primary and secondary replicas. The client stores this information in a memory cache. That way, if the client needs to refer to the same replica later on, it can bypass the master server. If the primary replica becomes unavailable or the replica changes, the client will have to consult the master server again before contacting a chunkserver.
The client then sends the write data to all the replicas, starting with the closest replica and ending with the furthest one. It doesn't matter if the closest replica is a primary or secondary. Google compares this data delivery method to a pipeline.
Once the replicas receive the data, the primary replica begins to assign consecutive serial numbers to each change to the file. Changes are called mutations. The serial numbers instruct the replicas on how to order each mutation. The primary then applies the mutations in sequential order to its own data. Then it sends a write request to the secondary replicas, which follow the same application process. If everything works as it should, all the replicas across the cluster incorporate the new data. The secondary replicas report back to the primary once the application process is over.
At that time, the primary replica reports back to the client. If the process was successful, it ends here. If not, the primary replica tells the client what happened. For example, if one secondary replica failed to update with a particular mutation, the primary replica notifies the client and retries the mutation application several more times. If the secondary replica doesn't update correctly, the primary replica tells the secondary replica to start over from the beginning of the write process. If that doesn't work, the master server will identify the affected replica as garbage.

Other Google File System Functions

Apart from the basic services the GFS provides, there are a few special functions that help keep the system running smoothly. While designing the system, the GFS developers knew that certain issues were bound to pop up based upon the system's architecture. They chose to use cheap hardware, which made building a large system a cost-effective process. It also meant that the individual computers in the system wouldn't always be reliable. The cheap price tag went hand-in-hand with computers that have a tendency to fail.
The GFS developers built functions into the system to compensate for the inherent unreliability of individual components. Those functions include master and chunk replication, a streamlined recovery process, rebalancing, stale replica detection, garbage removal and checksumming.
While there's only one active master server per GFS cluster, copies of the master server exist on other machines. Some copies, called shadow masters, provide limited services even when the primary master server is active. Those services are limited to read requests, since those requests don't alter data in any way. The shadow master servers always lag a little behind the primary master server, but it's usually only a matter of fractions of a second. The master server replicas maintain contact with the primary master server, monitoring the operation log and polling chunkservers to keep track of data. If the primary master server fails and cannot restart, a secondary master server can take its place.
The GFS replicates chunks to ensure that data is available even if hardware fails. It stores replicas on different machines across different racks. That way, if an entire rack were to fail, the data would still exist in an accessible format on another machine. The GFS uses the unique chunk identifier to verify that each replica is valid. If one of the replica's handles doesn't match the chunk handle, the master server creates a new replica and assigns it to a chunkserver.
The master server also monitors the cluster as a whole and periodically rebalances the workload by shifting chunks from one chunkserver to another. All chunkservers run at near capacity, but never at full capacity. The master server also monitors chunks and verifies that each replica is current. If a replica doesn't match the chunk's identification number, the master server designates it as a stale replica. The stale replica becomes garbage. After three days, the master server can delete a garbage chunk. This is a safety measure -- users can check on a garbage chunk before it is deleted permanently and prevent unwanted deletions.
To prevent data corruption, the GFS uses a system called checksumming. The system breaks each 64 MB chunk into blocks of 64 kilobytes (KB). Each block within a chunk has its own 32-bit checksum, which is sort of like a fingerprint. The master server monitors chunks by looking at the checksums. If the checksum of a replica doesn't match the checksum in the master server's memory, the master server deletes the replica and creates a new one to replace it.

Google File System Hardware

Google says little about the hardware it currently uses to run the GFS other than it's a collection of off-the-shelf, cheap Linux servers. But in an official GFS report, Google revealed the specifications of the equipment it used to run some benchmarking tests on GFS performance. While the test equipment might not be a true representation of the current GFS hardware, it gives you an idea of the sort of computers Google uses to handle the massive amounts of data it stores and manipulates.
The test equipment included one master server, two master replicas, 16 clients and 16 chunkservers. All of them used the same hardware with the same specifications, and they all ran on Linux operating systems. Each had dual 1.4 gigahertz Pentium III processors, 2 GB of memory and two 80 GB hard drives. In comparison, several vendors currently offer consumer PCs that are more than twice as powerful as the servers Google used in its tests. Google developers proved that the GFS could work efficiently using modest equipment.
The network connecting the machines together consisted of a 100 megabytes-per-second (Mbps) full-duplex Ethernet connection and two Hewlett Packard 2524 network switches. The GFS developers connected the 16 client machines to one switch and the other 19 machines to another switch. They linked the two switches together with a one gigabyte-per-second (Gbps) connection.
By lagging behind the leading edge of hardware technology, Google can purchase equipment and components at bargain prices. The structure of the GFS is such that it's easy to add more machines at any time. If a cluster begins to approach full capacity, Google can add more cheap hardware to the system and rebalance the workload. If a master server's memory is overtaxed, Google can upgrade the master server with more memory. The system is truly scalable.
How did Google decide to use this system? Some credit Google's hiring policy. Google has a reputation for hiring computer science majors right out of graduate school and giving them the resources and space they need to experiment with systems like the GFS. Others say it comes from a "do what you can with what you have" mentality that many computer system developers (including Google's founders) seem to possess. In the end, Google probably chose the GFS because it's geared to handle the kinds of processes that help the company pursue its stated goal of organizing the world's information.


No comments:

Post a Comment