Sysadmins use robots.txt file to give instructions about their site to google bots or web bots. This is called The Robots Exclusions Protocol.
Crawling is the process by which Google and other search engines discovers new and updated pages to be added to the Google index.
The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process: computer programs determine which sites to crawl, how often, and how many pages to fetch from each site.
Google bot visit the website at regular intervals [How frequently the webmaster update the contents of his/her sites] and index the website. The bot determines the authority of your website and how important is your website wirth respect to other sites that are in the web based on the content.
New sites, changes to existing sites, and dead links are noted and used to update the Google index.
The primary purpose of the robots.txt file is to restrict access to your website by search engine robots or bots with the help of text file robots.txt. The file resides in the root level directory of the website. Most of the search engine bots looks for the file robots.txt to take instruction before crawling the website.
For example, if a robot wants to visit a website http://example.com/home.html. Before the bot starts crawling the website it will check the existence of file at the root directory http://example.com/robots.txt
1. How Search Engine locate robots.txt
When a robot splits the path component from the URL starting from the first single slash and replace with ‘/robots.txt’ in its place.
http://example.com/home/index.html -> http://example.com/robots.txt
The following points should be kept in mind while using robots.txt.
- The file name should be robots.txt to robots.txt.
- Malware robots will ignore robots.txt that look for security vulnerabilities and email address harvesters that is used by hackers.
- robots.txt file is a publicly available. So, we should not use this file to hide information.
2. Prevent indexing including the home page
User-agent: * Disallow: /
The ‘/’ forward-slash represent the root level of the website.
User-agent * deny access to all the bots. We can restrict specific bots by replacing ‘*’ with bot name.
Example for bots that google have are: googlebot-news, google-bot etc.
3. Restrict a Specific Folder
We won’t generally restrict access to the whole website, we can restrict bot accessing to specific directory. For that we need to specify each restriction on the line preceded by the keyword ‘Disallow’
User-agent: * Disallow: /admin
Here we are restricting access to the folder ‘admin’ including all the contents inside the folder.
4. Restrict a Specific File
To secure access to individual file syntax is similar.
User-agent: * Disallow: /Secure/myfile.html
Here we are not restricting all the contents in the folder ‘secure’ only for the file ‘myfile.html’
5. Advanced Pattern Matching in robots.txt
This be used to restrict access to all the dynamically generated Url’s that contains ‘?’ as an example.
User-agent: * Disallow: /*?
6. Regular Expression for Folders
Also we can use regular expression to restrict access to folders:
For example, to restrict these folders: image-jpg, image-png, image-ico
The below one-liner will restrict access to all the folders that starts with name ‘image’
User-agent: * Disallow: /image*/
7. Restrict Files of Specific Extension
To restrict access or prevent caching of the file that has a specific extension
User-agent: * Disallow: /*.php$/
$ means the file that end with the particular extension [here its .php]
By using the above examples we will be able to create a well formatted robots.txt to prevent web bots from indexing the contents of the site.
Additional References: How Google Search Works