What is robots.txt and How a robots.txt file work
➲ What is robots.txt ?
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
When a search engine visits your website, it looks for robots.txt in your root directory (e.g., https://www.example.com/robots.txt) to see if there are any special rules.
Why is robots.txt used?
The file robots. txt is used to give instructions to web robots, such as search engine crawlers, about locations within the web site that robots are allowed, or not allowed, to crawl and index. The presence of the robots.
That said, there are 3 main reasons that you’d want to use a robots.txt file
Block Non-Public Pages: Sometimes, you have pages on your site that you don’t want indexed. For example, you might have a staging version of a page, a login page, or an internal search results page. These pages need to exist, but you don’t want random people landing on them. In this case, you’d use robots.txt to block these pages from search engine crawlers and bots.
Maximize Crawl Budget: If you’re having trouble getting all of your pages indexed, you might have a crawl budget problem. By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that actually matter.
Prevent Search Engine Indexing of Resources: Using meta directives can work just as well as Robots.txt for preventing pages from getting indexed. However, meta directives don’t work well for multimedia resources, like PDFs and images. That’s where robots.txt comes into play.
How does a robots.txt file work?
A robots.txt file is just a text file with no HTML markup code (hence the .txt extension). The robots.txt file is hosted on the web server just like any other file on the website. In fact, the robots.txt file for any given website can typically be viewed by typing the full URL for the homepage and then adding /robots.txt, like https://www.cloudflare.com/robots.txt. The file isn't linked to anywhere else on the site, so users aren't likely to stumble upon it, but most web crawler bots will look for this file first before crawling the rest of the site.
While a robots.txt file provides instructions for bots, it can't actually enforce the instructions. A good bot, such as a web crawler or a news feed bot, will attempt to visit the robots.txt file first before viewing any other pages on a domain, and will follow the instructions. A bad bot will either ignore the robots.txt file or will process it in order to find the webpages that are forbidden.
A web crawler bot will follow the most specific set of instructions in the robots.txt file. If there are contradictory commands in the file, the bot will follow the more granular command.
One important thing to note is that all subdomains need their own robots.txt file. For instance, while www.cloudflare.com has its own file, all the Cloudflare subdomains (blog.cloudflare.com, community.cloudflare.com, etc.) need their own as well.
What is a user agent? What does 'User-agent: *' mean?
Any person or program active on the Internet will have a "user agent," or an assigned name. For human users, this includes information like the browser type and the operating system version but no personal information; it helps websites show content that's compatible with the user's system. For bots, the user agent (theoretically) helps website administrators know what kind of bots are crawling the site.
In a robots.txt file, website administrators are able to provide specific instructions for specific bots by writing different instructions for bot user agents. For instance, if an administrator wants a certain page to show up in Google search results but not Bing searches, they could include two sets of commands in the robots.txt file: one set preceded by "User-agent: Bingbot" and one set preceded by "User-agent: Googlebot".
In the example above, Cloudflare has included "User-agent: *" in the robots.txt file. The asterisk represents a "wild card" user agent, and it means the instructions apply to every bot, not any specific bot.
Common search engine bot user agent names include:
Google:
- Googlebot
- Googlebot-Image (for images)
- Googlebot-News (for news)
- Googlebot-Video (for video)
Bing
- Bingbot
- MSNBot-Media (for images and video)
Baidu
- Baiduspider

Comments
Post a Comment