Blocked GPTBot in the top 1 Million websites
Following the announcement from OpenAI that they will be using web crawling bot GPTBot for training ChatGPT, many in SEO community have been discussing blocking the bot in robots.txt.
I decided to check just how many websites were actually following these directions.
I use ChatGPT frequently, for everything from simplifying work processes to finding new BBQ recipes, and right now I have no plans to block GPTBot from learning from mine or my clients’ content – ChatGPT is a significant development that has incredible potential, so I’m happy to contribute to it in whatever small way. However I can see why people are worried about the language model learning from their content and sharing their insights without directing people to the website, but that’s basically what Google has been trying to do for a while with featured snippets.
This project aims to shed some light on the uptake of the User-agent: GPTBot Disallow: / directive in robots.txt files within the 1M most popular websites on the internet.
The results will be updated monthly to analyse changes in the blocking of GPTBot
Why do I think this will be interesting?
It’s a fairly simple project and not too time consuming, and it’s always good to have some numbers to go along with changes in the SEO industry. There are often hysterical outbursts whenever a new thing comes along – remember Mobilegeddon? I like to treat SEO news with a healthy pinch of scepticism.
Steps taken to get the data:
- Download list of top 1 Million websites according to the Majestic Million
- Remove all data except from the list of domains
- Ensure domains list has no trailing slash
- Add the suffix: /robots.txt to the end of each domain
- Create python script to scrape all websites, following redirects for http://, https://, www. etc.
- Allow for up to 10 seconds for timeout in case of slow sites.
- Search for the text GPTBot, different capitalisation such as GPTBot, gptbot, GPTbot etc.
- Add results to the .csv, returning: 0 for text doesn’t exist (website not blocking ChatGPT), 1 for text does exist (website blocking ChatGPT), and x if the robots.txt isn’t accessible or doesn’t exist.
- Output results for websites 1-100, 1-1000, 1-10000, 1-100000, and 1-1000000
- Run script monthly to analyse changes in behaviour towards GPTBot
Key takeaways from the first scrape (September 27, 2023)
- Of the top 100 and top 1000 websites, 14.0 and 14.2% were blocking GPTBot at the time of crawling
- That number drops to roughly half that for top 10,000 websites, to 7.46%
- Further significant drop in the top 100,000 websites, down to 3.14%
- Low blocking within the top 1,000,000 websites, only 1.1% have added the GPTBot to their robots.txt file disallow list.
Noteworthy domains in the top 100 blocking GPTBot
Noteworthy domains in the top 1000 blocking GPTBot
Data collection method
The data used was the Majestic Million, which changes frequently, and is built up Majestic’s proprietary ranking system, so is not a perfect representation of the 1 Million most visited domains.
robots.txt not discovered
A significant number of domains either didn’t have a robots.txt file, or my crawler didn’t find it. I added these to the “not blocking GPTBot category”
I’m just collecting the data and reporting on the numbers, I don’t know why sites are blocking GPTBot or not. This is intended as an interesting project, interpret the results to back your claims how you wish.
Differences in results with other similar studies
Other studies have shown different results. Originality.ai, for example, scraped the top 1000 websites and reported that 26% of websites are blocking GPTBot.
When I checked their data to see why their results were so different to mine, I noticed that they are counting domains blocking GPTBot, ChatGPT-User, CCBot, and Anthropic AI, whereas my data only includes domains specifically blocking GPTBot. Checking the column for GPTBot shows the data aligns closely with that from this research.
You can download the data here.