What Are Search Engine Crawlers?
The definition of Web Crawler directly from Wiki
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot.
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
So the Web Crawler is the tool that visits your site late at night and traverses as much of your site as possible making that information available to search engines such as Google or Bing so they can easily find your content, and rank it properly. So the Web Crawler is your friend and you want to make it as easy for the crawler as possible when visiting your site. If you don’t help the crawler, it will use generic rules that may hurt your search ranking, or might make certain parts of your site un-searchable. It may waste time indexing system files such as CSS files or other files that are meaningless to the purpose of your site.
Make the Robots Happy and Productive – Use “Robots.Txt” Files
If you believe in Christmas and Santa, then you know what I mean when I say you wouldn’t go to bed without leaving some cookies and milk for Santa to put him in a good mood while visiting your house. A simple and standard file called “robots.txt” is what the Robots or Crawlers look for the moment they get to your site. Imagine their disappointment that you didn’t even think of them by leaving a simple robots.txt file.
How the Robots Work
Seriously speaking, the Crawler or Robot is looking for objects on your site, and objects are files, folders, web links. Upon reaching the root of your site, the Robot looks for the Robots.txt file and uses it to understand your site and maximize the benefit of the crawl. Your robots.txt file will help the robot to assign content to indexes, and understand arranged web page order to structure indexes for faster finding by internet searcher. In this case crawler will filter which are web page, file, folder and which can be indexed or not. Most of web page contain links to other pages and normally spider will start from top left to right down similar to reading a book.
Making your robots.txt File
Robots.txt is text file not html this will be placed on the root of your web site. There are books written on the subject of web crawlers and usage of robots.txt files but here is a simple start:
Location & naming
1. Name it robots.txt *not* robot.txt or Robot.Txt, or spider.txt
2. Add rules to the text file, save, and place a copy at the root of your web site. Many sites available on rule formats.
Example 1 – Disallow All robots for specific folders and files
Make a list of everything on your site you DON’T want robots/spiders to visit and put in file like this. Note: You could replace the wildcard for user agent and put specific robots that you want to ban.
# robots.txt for http://www.sample.com/ User-agent: * Disallow: /chat/ # Online chat files Disallow: /testsite/ # This is a test area Disallow: /login.html # This is a an admin file
I hope this helps give you a basic understanding of the robots.txt and how Web Crawlers work. This information here scratches the surface of what you can do with the robots.txt file. There are tons of sites focusing on it entirely so I won’t bother reinventing the wheel– just wanted to get you started.
The DisplacedGuy (a.k.a. Rich Bianco)
P.S. My daughter Heather is taking ownership of another blog called Otown411 as she wants to help the family situation with me being unemployed. She sees me working day and night and was willing to try to get Otown411 up and running. The site is targeted towards people looking to visit or vacation in the Orlando area and we would offer insider tips about maximizing the vacation since we live here and know all the ins-and-outs. It would be great if you could stop over and give her some motivation to stick with it. She is like me and will be checking visitor stats constantly which is the motivating part. I’m banking on this time I’m spending as being an investment… fingers crossed. IF any of the sponsor sites on this site are appropriate please consider visiting them.