Google is a multi-billion dollar company. It's one of the big power players on the World Wide Web and beyond. The company relies on a distributed computing system to provide users with the infrastructure they need to access, create and alter data. Surely Google buys state-of-the-art computers and servers to keep things running smoothly, right?
Wrong.
 The machines that power Google's operations aren't cutting-edge power 
computers with lots of bells and whistles. In fact, they're relatively 
inexpensive machines running on Linux operating systems. How can one of the most influential companies on the Web rely on cheap hardware? It's due to the Google File System (GFS),
 which capitalizes on the strengths of off-the-shelf servers while 
compensating for any hardware weaknesses. It's all in the design.
Google
 uses the GFS to organize and manipulate huge files and to allow 
application developers the research and development resources they 
require. The GFS is unique to Google and isn't for sale. But it could 
serve as a model for file systems for organizations with similar needs.
Some
 GFS details remain a mystery to anyone outside of Google. For example, 
Google doesn't reveal how many computers it uses to operate the GFS. In 
official Google papers, the company only says that there are "thousands"
 of computers in the system (source: Google). But despite this veil of secrecy, Google has made much of the GFS's structure and operation public knowledge.
Google File System Basics
Google
 developers routinely deal with large files that can be difficult to 
manipulate using a traditional computer file system. The size of the 
files drove many of the decisions programmers had to make for the GFS's 
design. Another big concern was scalability, which 
refers to the ease of adding capacity to the system. A system is 
scalable if it's easy to increase the system's capacity. The system's 
performance shouldn't suffer as it grows. Google requires a very large network of computers to handle all of its files, so scalability is a top concern.
Because
 the network is so huge, monitoring and maintaining it is a challenging 
task. While developing the GFS, programmers decided to automate as much 
of the administrative duties required to keep the system running as 
possible. This is a key principle of autonomic computing,
 a concept in which computers are able to diagnose problems and solve 
them in real time without the need for human intervention. The challenge
 for the GFS team was to not only create an automatic monitoring system,
 but also to design it so that it could work across a huge network of 
computers.
The key to the team's designs was the concept of 
simplification. They came to the conclusion that as systems grow more 
complex, problems arise more often. A simple approach is easier to 
control, even when the scale of the system is huge.
Based on that philosophy, the GFS team decided that users would have access to basic file commands. These include commands like open, create, read, write and close files. The team also included a couple of specialized commands: append and snapshot.
 They created the specialized commands based on Google's needs. Append 
allows clients to add information to an existing file without 
overwriting previously written data. Snapshot is a command that creates 
quick copy of a computer's contents.
Files on the GFS tend to be 
very large, usually in the multi-gigabyte (GB) range. Accessing and 
manipulating files that large would take up a lot of the network's bandwidth. Bandwidth is the capacity of a system to move data
 from one location to another. The GFS addresses this problem by 
breaking files up into chunks of 64 megabytes (MB) each. Every chunk 
receives a unique 64-bit identification number called a chunk handle. While the GFS can process smaller files, its developers didn't optimize the system for those kinds of tasks.
By
 requiring all the file chunks to be the same size, the GFS simplifies 
resource application. It's easy to see which computers in the system are
 near capacity and which are underused. It's also easy to port chunks 
from one resource to another to balance the workload across the system.
Google File System Architecture
Google organized the GFS into clusters of computers. A cluster is simply a network of computers. Each cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers.
In
 the world of GFS, the term "client" refers to any entity that makes a 
file request. Requests can range from retrieving and manipulating 
existing files to creating new files on the system. Clients can be other
 computers or computer applications. You can think of clients as the 
customers of the GFS.
The master server acts as the coordinator for the cluster. The master's duties include maintaining an operation log,
 which keeps track of the activities of the master's cluster. The 
operation log helps keep service interruptions to a minimum -- if the 
master server crashes, a replacement server that has monitored the 
operation log can take its place. The master server also keeps track of metadata,
 which is the information that describes chunks. The metadata tells the 
master server to which files the chunks belong and where they fit within
 the overall file. Upon startup, the master polls all 
the chunkservers in its cluster. The chunkservers respond by telling the
 master server the contents of their inventories. From that moment on, 
the master server keeps track of the location of chunks within the 
cluster.
There's only one active master server per cluster at any 
one time (though each cluster has multiple copies of the master server 
in case of a hardware failure). That might sound like a good recipe for a
 bottleneck -- after all, if there's only one machine coordinating a 
cluster of thousands of computers, wouldn't that cause data
 traffic jams? The GFS gets around this sticky situation by keeping the 
messages the master server sends and receives very small. The master 
server doesn't actually handle file data at all. It leaves that up to 
the chunkservers.
Chunkservers are the workhorses of the GFS. 
They're responsible for storing the 64-MB file chunks. The chunkservers 
don't send chunks to the master server. Instead, they send requested 
chunks directly to the client. The GFS copies every chunk multiple times
 and stores it on different chunkservers. Each copy is called a replica.
 By default, the GFS makes three replicas per chunk, but users can 
change the setting and make more or fewer replicas if desired.
Using the Google File System
File requests follow a standard work flow. A read request is simple 
-- the client sends a request to the master server to find out where the
 client can find a particular file on the system. The server responds 
with the location for the primary replica of the respective chunk. The 
primary replica holds a lease from the master server for the chunk in question.
If no replica currently holds a lease, the master server 
designates a chunk as the primary. It does this by comparing the IP 
address of the client to the addresses of the chunkservers containing 
the replicas. The master server
 chooses the chunkserver closest to the client. That chunkserver's chunk
 becomes the primary. The client then contacts the appropriate 
chunkserver directly, which sends the replica to the client.
Write
 requests are a little more complicated. The client still sends a 
request to the master server, which replies with the location of the 
primary and secondary replicas. The client stores this information in a 
memory cache. That way, if the client needs to refer to the same replica
 later on, it can bypass the master server. If the primary replica 
becomes unavailable or the replica changes, the client will have to 
consult the master server again before contacting a chunkserver.
The
 client then sends the write data to all the replicas, starting with the
 closest replica and ending with the furthest one. It doesn't matter if 
the closest replica is a primary or secondary. Google compares this data delivery method to a pipeline.
Once
 the replicas receive the data, the primary replica begins to assign 
consecutive serial numbers to each change to the file. Changes are 
called mutations. The serial numbers instruct the 
replicas on how to order each mutation. The primary then applies the 
mutations in sequential order to its own data. Then it sends a write 
request to the secondary replicas, which follow the same application 
process. If everything works as it should, all the replicas across the 
cluster incorporate the new data. The secondary replicas report back to 
the primary once the application process is over.
At that time, 
the primary replica reports back to the client. If the process was 
successful, it ends here. If not, the primary replica tells the client 
what happened. For example, if one secondary replica failed to update 
with a particular mutation, the primary replica notifies the client and 
retries the mutation application several more times. If the secondary 
replica doesn't update correctly, the primary replica tells the 
secondary replica to start over from the beginning of the write process.
 If that doesn't work, the master server will identify the affected 
replica as garbage.
Other Google File System Functions
Apart from the basic services the GFS provides, there are a few 
special functions that help keep the system running smoothly. While 
designing the system, the GFS developers knew that certain issues were 
bound to pop up based upon the system's architecture. They chose to use 
cheap hardware, which made building a large system a cost-effective 
process. It also meant that the individual computers in the system wouldn't always be reliable. The cheap price tag went hand-in-hand with computers that have a tendency to fail.
The
 GFS developers built functions into the system to compensate for the 
inherent unreliability of individual components. Those functions include
 master and chunk replication, a streamlined recovery process, 
rebalancing, stale replica detection, garbage removal and checksumming.
While there's only one active master server per GFS cluster, copies of the master server exist on other machines. Some copies, called shadow masters,
 provide limited services even when the primary master server is active.
 Those services are limited to read requests, since those requests don't
 alter data in any way. The shadow master servers always lag a little 
behind the primary master server, but it's usually only a matter of 
fractions of a second. The master server replicas maintain contact with 
the primary master server, monitoring the operation log and polling 
chunkservers to keep track of data. If the primary master server fails 
and cannot restart, a secondary master server can take its place.
The GFS replicates chunks to ensure that data is available even if 
hardware fails. It stores replicas on different machines across 
different racks. That way, if an entire rack were to 
fail, the data would still exist in an accessible format on another 
machine. The GFS uses the unique chunk identifier to verify that each 
replica is valid. If one of the replica's handles doesn't match the 
chunk handle, the master server creates a new replica and assigns it to a
 chunkserver.
The master server also monitors the cluster as a 
whole and periodically rebalances the workload by shifting chunks from 
one chunkserver to another. All chunkservers run at near capacity, but 
never at full capacity. The master server also monitors chunks and 
verifies that each replica is current. If a replica doesn't match the 
chunk's identification number, the master server designates it as a 
stale replica. The stale replica becomes garbage. After three days, the 
master server can delete a garbage chunk. This is a safety measure -- 
users can check on a garbage chunk before it is deleted permanently and 
prevent unwanted deletions.
To prevent data corruption, the GFS 
uses a system called checksumming. The system breaks each 64 MB chunk 
into blocks of 64 kilobytes (KB). Each block within a chunk has its own 
32-bit checksum, which is sort of like a fingerprint.
 The master server monitors chunks by looking at the checksums. If the 
checksum of a replica doesn't match the checksum in the master server's 
memory, the master server deletes the replica and creates a new one to 
replace it.
Google File System Hardware
Google says little about the hardware it currently uses to run the GFS other than it's a collection of off-the-shelf, cheap Linux
 servers. But in an official GFS report, Google revealed the 
specifications of the equipment it used to run some benchmarking tests 
on GFS performance. While the test equipment might not be a true 
representation of the current GFS hardware, it gives you an idea of the 
sort of computers Google uses to handle the massive amounts of data it 
stores and manipulates.
The test equipment included one master 
server, two master replicas, 16 clients and 16 chunkservers. All of them
 used the same hardware with the same specifications, and they all ran 
on Linux operating systems. Each had dual 1.4 gigahertz Pentium III processors, 2 GB of memory and two 80 GB hard drives.
 In comparison, several vendors currently offer consumer PCs that are 
more than twice as powerful as the servers Google used in its tests. 
Google developers proved that the GFS could work efficiently using 
modest equipment.
The network connecting the machines together consisted of a 100 megabytes-per-second (Mbps) full-duplex Ethernet
 connection and two Hewlett Packard 2524 network switches. The GFS 
developers connected the 16 client machines to one switch and the other 
19 machines to another switch. They linked the two switches together 
with a one gigabyte-per-second (Gbps) connection.
By lagging 
behind the leading edge of hardware technology, Google can purchase 
equipment and components at bargain prices. The structure of the GFS is 
such that it's easy to add more machines at any time. If a cluster 
begins to approach full capacity, Google can add more cheap hardware to 
the system and rebalance the workload. If a master server's memory is overtaxed, Google can upgrade the master server with more memory. The system is truly scalable.
How
 did Google decide to use this system? Some credit Google's hiring 
policy. Google has a reputation for hiring computer science majors right
 out of graduate school and giving them the resources and space they 
need to experiment with systems like the GFS. Others say it comes from a
 "do what you can with what you have" mentality that many computer 
system developers (including Google's founders) seem to possess. In the 
end, Google probably chose the GFS because it's geared to handle the 
kinds of processes that help the company pursue its stated goal of 
organizing the world's information.
