Online
 
Thursday, 20 November 2008
 
 

SEO - Web Crawler | Print |  E-Mail
 
SEO - Web Crawler Running a web crawler is a challenging task. There are tricky performance and reliability issues and
even more importantly, there are social issues. Crawling is the most fragile application since it involves
interacting with hundreds of thousands of web servers and various name servers which are all beyond
the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A
single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the
URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections
open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system
can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second
of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it
does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections
can be in a number of different states: looking up DNS, connecting to host, sending request, and
receiving response. These factors make the crawler a complex component of the system. It uses
asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
It turns out that running a crawler which connects to more than half a million servers, and generates tens
of millions of log entries generates a fair amount of email and phone calls. Because of the vast number
of people coming on line, there are always those who do not know what a crawler is, because this is the
first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of
pages from my web site. How did you like it?" There are also some people who do not know about the
robots exclusion protocol, and think their page should be protected from indexing by a statement like,
"This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers
to understand. Also, because of the huge amount of data involved, unexpected things will happen. For
example, our system tried to crawl an online game. This resulted in lots of garbage messages in the
middle of their game! It turns out this was an easy problem to fix. But this problem had not come up
until we had downloaded tens of millions of pages. Because of the immense variation in web pages and
servers, it is virtually impossible to test a crawler without running it on large part of the Internet.
Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole
web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which
access large parts of the Internet need to be designed to be very robust and carefully tested. Since large
complex systems such as crawlers will invariably cause problems, there needs to be significant resources
devoted to reading the email and solving these problems as they come up.


Indexing the Web
Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of
possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag,
non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that
challenge anyone’s imagination to come up with equally creative ones. For maximum speed,
instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which
we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very
robust involved a fair amount of work.

Indexing Documents into Barrels -- After each document is parsed, it is encoded into a number
of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon.
New additions to the lexicon hash table are logged to a file. Once the words are converted into
wordID’s, their occurrences in the current document are translated into hit lists and are written into
the forward barrels. The main difficulty with parallelization of the indexing phase is that the
lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of
all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way
multiple indexers can run in parallel and then the small log file of extra words can be processed by
one final indexer.

Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and
sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted
barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we
parallelize the sorting phase to use as many machines as we have simply by running multiple
sorters, which can process different buckets at the same time. Since the barrels don’t fit into main
memory, the sorter further subdivides them into baskets which do fit into memory based on
wordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contents
into the short inverted barrel and the full inverted barrel.

 

Read Other Article:"SEO - Pagerank "

 Links to this article:"SEO - Web Crawler"

This entry was posted on . You can follow any responses to this entry through the RSS 2.0 feed. You can leave a comment.
Users' Comments (0)

Comment an article
  Name
  E-mail
   Title
Available characters: 4000
 Notify me of follow-up comments
This image contains a scrambled text, it is using a combination of colors, font size, background, angle in order to disallow computer to automate reading. You will have to reproduce it to post on my homepage
Enter what you see:

No comment posted

Jumbo Coklat
 
Top! Top!