Google launched its robots.txt analysis and comparison library as open source on Monday with the hope that its now public code will help web developers agree on a standard way to spell the right tag for crawlers Web.
The C ++ library drives Googlebot, the company's tracker to index websites according to the Robots Exclusion Protocol (REP), a scheme that allows website owners to declare how the code should behave which visits websites to index them. REP specifies how directives can be included in a text file, robots.txt, to inform visiting crawlers like Googlebot what website resources can be visited and which ones can be indexed.
In the 25 years since Martijn Koster, creator of the first web search engine: created the rules, REP has been widely adopted by web publishers, but was never blessed as an official Internet standard.
"[S] since its inception, the REP has not been updated to cover today's corner cases," explained a trio of Googlers, Henner Zeller, Lizzi Harvey and Gary Illyes, in a blog post. "This is a challenging problem for website owners because the de facto ambiguous standard made it difficult to write the rules correctly."
For example, differences in the way text editors handle newline characters in different operating systems can avoid the robots.txt file, the files work as expected.
The Google library does everything possible to try to make those files less fragile. For example, it includes code to accept five different spelling errors from the "do not allow" directive in robots.txt.
To make REP implementations more consistent, Google is pushing for the REP to be an Internet Engineering Task Force standard. He has published a draft proposal in the hope that anyone concerned with such things will express an opinion on what is needed.
The latest draft expands HTTP robots.txt to any URI-based transfer protocol, including FTP and CoAP. Other changes include the requirement that developers analyze only the first 500 kibibytes of a robots.txt file, to minimize demands on servers, and a maximum caching time of 24 hours, unless the robots.txt file Do not be accessible.
The Googlers trio point out that RFC means "request for comments" and insist they really want to hear from developers if they think about improving the standard.
"While we work to give web creators the controls they need, tell us how much information they want to be available to the Google robot and, by extension, eligible to appear in the search, we have to make sure we do it right," they said.
The chocolate factory is not always so solicitous entry. The commercial of last month decided to promote the adoption of a notification element
The reaction to Google's toast proposal has been skeptical. Developers have worried that Chrome’s dominant market share, amplified by Microsoft’s recent adoption of the Chromium open source project as the basis for its Edge browser, makes Google’s technical decisions as de facto standards. The company has so much influence on the web, they worry, they don't have to check with the web community.
"It feels like a beneficial idea designed by Google, approved by Google and has been abandoned on the Web without any consideration for others," said developer Terence Eden last month.
Dave Cramer, a developer who works for the Hachette book publisher, edits the EPUB specification and contributes to the CSS Working Group, published a similar regret about Google's habit of presenting new web technology before engaging outsiders.
"There seems to have been no discussion with other browser providers or standards bodies before the intention to implement," he said in a GitHub post. "Why is it a problem? Google is looking for comments on a solution, not how to solve the problem." ®
Balancing consumerization and corporate control