We have a development server at dev.example.com that is being indexed by Google. We are using AWS Lightsail to duplicate the development server to our production environment in totality — the same robots.txt file is used on both dev.example.com and example.com.
Google's robots.txt documentation doesn't explicitly state whether root domains can be defined. Can I implement domain specific rules to the robots.txt file? For example, is this acceptable:
User-agent: *
Disallow: https://dev.example.com/
User-agent: *
Allow: https://example.com/
Sitemap: https://example.com/sitemap.xml
To add, this can be resolved through .htaccess rewrite engine — my question is specifically about robots.txt.
No, you can't specify domain in robots.txt. Disallow: https://dev.example.com/ is not valid. Page 6 of the robots.txt exclusion standard says that a disallow line should contain a "path" as opposed to a full URL including the domain.
Each host name (domain or subdomain) has its own robots.txt file. So to prevent Googlebot from crawling http://dev.example.com/ you would need to serve https://dev.example.com/robots.txt with the content:
User-agent: *
Disallow: /
At the same time you would need to serve a different file from http://example.com/, perhaps with the content:
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
If the same code base powers both your dev and production servers, you will need to conditionalize the content of robots.txt based on whether it is running in production or not.
Alternately, you could allow Googlebot to crawl both, but include <link rel=canonical href=...> tags in every page that point to the URL for the page on the live site. See How to use rel='canonical' properly
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With