A protocol to let web crawler know about your website available URL’s to be accessed is called the robot exclusion protocol. Instruction to crawl the website will be given in the file named as robots.txt. Location of the file is inside root folder of installation directory in web server.
In other words, key to the success of the website is robot.txt. As a result, you can avoid unnecessary url to be crawled by the bots.
Path of Robots.txt
Correct place to keep robots.txt is root folder of website in the web server . Although, if you keep this particular in some different location than bots will be able to find the file and will throw crawl error.
Bots has set of rule defined in its logic like, it will take reference of robots.txt file before crawling the website. Also, It will look the file in the root location of the web server.
File can be accessed from the browser using the below format:
Check out my robots.txt
Things to keep in mind
Any person can view this file. Good practice will be not to put any confidential information in this file. example: email id, name, phone number etc.
Understanding Bots and robots.txt
You would like your content to be available in the search immediately as soon as you hit publish. But, the question is how to do this ? For this, web robots will come into picture. Every search engine has their own bots to complete this activity. There are around 302 legitimate bots as per official robotstxt website.
Generally, these bots will come to your website every third day. But you can also give new URL to search engine using webmaster tool.
Once a web robot come to check the new URL it needs to go through the robots.txt file. This file will let bots know that which URL are accessible for crawling.
Tags of robots.txt
You can basically do everything with understanding just three tags of this file.
User-agent, bot name is given under this tag. If you put ‘*’ it means you have allowed all the bots.
To Disallow any URL is given under this tag.
Above example is self-explanatory, It will not crawl all the URL starting with shippingOrder and all it child URL .
Allow, It just you are putting some specific condition for the crawler to crawl.
If you put this allow condition just above the disallow . The impact for this will be first, order.html will be indexed and it will ignore rest . Not to mention if you are using combination of both the example.
You can get the detail information from the official robottxt website.
To conclude, Robot.txt is very crucial file which contain the logic to handle the web robots. Let me know you thoughts and view in the comment section.