Robots to the rescue
June 26th 2024Much of the web has hoovered up by now but with AI and the ongoing training still happening, it’s still worth putting some measures in place to stop our sites from being used as fodder. While there might not be a reliable way to prevent our sites from being accessed by these scrapers, there’s no harm in adding scripts that might help. I don’t understand it all enough but I will gladly follow the lead of some of the smart web people out there ツ
This post will serve as an archive of useful articles and links, as well as examples of how to address this issue. Or just food for thought relating to the subject of AI in regards to our own sites.
Useful links, examples & articles
- AI / LLM User-Agents: Blocking Guide, as seen 23/01/25, recommeded by David
- Set Up Your Robots.txt, as seen on 01/07/2024, Dark Visitors
- Consent, LLM scrapers, and poisoning the well, published 26/06/2024, Eric Bailey
- Blockin’ bots., published 12/04/2024, Ethan Marcotte
- robots.txt, as seen 10/2023, Seirdy
- ai.robots.txt – as seen 09/2023, A community effort to identify and block AI crawlers.
- Google allows sites to opt out of training its LLMs for GenAI, as seen 09/2023, Jon Henshaw
+ disallow-genai-bots.txt
New icons for human created content
Just like we did in the old days ツ When web standards were not yet a default state of the web, we often added little icons to our sites to state that they were built following web standards with valid code. This practice has now all but disappeared ~ and it seems a new trend is rising. This is a screenshot (taken June 2024) of no-ai-icon.com offering different versions for a little icon badge which we can use to indicate that our content is made by a human without any AI tools involved ツ nice ツ
