| Google Architecture Overview | | Print | |
Google Architecture - In this section, we will give a high level overview of how the whole system works as pictured in Figure 1. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.
In Google, the web crawling (downloading of web pages)
is done by several distributed crawlers. There is a
URLserver that sends lists of URLs to be fetched to the
crawlers. The web pages that are fetched are then sent to
the storeserver. The storeserver then compresses and stores
the web pages into a repository. Every web page has an
associated ID number called a docID which is assigned
whenever a new URL is parsed out of a web page. The
indexing function is performed by the indexer and the
sorter. The indexer performs a number of functions. It reads
the repository, uncompresses the documents, and parses them. Each document is converted into a set of
word occurrences called hits. The hits record the word, position in document, an approximation of font
size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially
sorted forward index. The indexer performs another important function. It parses out all the links in
every web page and stores important information about them in an anchors file. This file contains
enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into
docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points
to. It also generates a database of links which are pairs of docIDs. The links database is used to compute
PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and
resorts them by wordID to generate the inverted index. This is done in place so that little temporary
space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the
inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the
indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and
uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer
queries.
This entry was posted on . You can follow any responses to this entry through the RSS 2.0 feed. You can leave a comment.
| Users' Comments (0) |
|
No comment posted






