BHL Traffic Challenges
In the past several weeks, we’ve seen a large, disruptive increase in traffic to the BHL main website. This blog post is meant to summarize the event, the effect on BHL and its servers, our response, and what we have planned should it happen again.
The problems experienced on BHL’s website are not related in any way to the transition away from the Smithsonian. We believe the timing of this activity and the resulting downtime to be purely coincidental.
In early June, the Technical Team observed increased load on the full-text-search server, though it remained within acceptable limits and BHL performance was unaffected. Since this is not unusual for short periods of time and the server continued handling traffic, we weren’t too concerned.
In the week of 7 June, we started to see that BHL was sometimes slower to respond. It was then that we noticed that traffic to the search server was dramatically higher than normal. A quick analysis of the server logs found that BHL was serving more than 10 times the usual number of searches. And more importantly, none of this activity was appearing in Google Analytics. It was clear that we were dealing with traffic from a bot. The bot wasn’t simply performing searches, it was also loading other parts of the sites, usually calling each part with a different taxonomic name each time.
What is most surprising is that this was not API traffic. BHL’s API provides rich data to anyone who uses it — data that is meant to be consumed by software. This bot was loading the BHL web pages as a person would using their computer’s web browser. It was getting results that were less computer-friendly than it would if it had used the API!
Unfortunately, the traffic was coming from all over the world and the bot was clearly not identifying itself as GoogleBot, BingBot or even OpenAI’s GPTBot. Slowing it down or blocking it entirely was proving to be a challenge. We activated rate limits on the web server, but the distributed nature of the bot prevented that from being effective. We investigated implementing Cloudflare Turnstile, but configuring that on short notice is a challenge. We thought about slowing down all queries to the server, but we knew that would affect all BHL users. In the end, we found that the bots had a sort of “fingerprint” that was different than that of regular users. We were prepared to start limiting traffic using the fingerprint when…. the traffic stopped. On 19 June, BHL started once again responding with its usual speed.
Looking at the graph of how many times the /search URL is called every day, we see some surprising numbers before traffic dropped on 19 June.
In general, BHL’s web infrastructure is able to easily support the current level of approximately 200k-300k searches per day. It is even possible to support up to about 400,000 searches per day, but that’s reaching the limit of what our infrastructure can handle. This is evidence that the hardware and software we have dedicated to search is very robust and well-suited to a busy site.
We haven’t performed an extensive analysis of the sources of this traffic, but at a glance they do seem to be from a variety of locations around the world. While the traffic may have been centered on certain countries or networks (such as the Google Cloud or others), we have no definitive evidence that it’s any one service.
One note: In the second chart, 6 June and 15 June show a much lower number of searches compared to the days before and after. This is not due to the bots slowing down, but instead were caused by BHL itself performing regular weekly and monthly tasks at the same time the bots were busy. This combination of activities overwhelmed the system and BHL was not responding at all.
When we look at the news, we find that our experience mirrors that of other cultural heritage sites, such as those reported in The Register:
Claburn, Thomas. (17 Jun 2025) Bots are overwhelming websites with their hunger for AI data. The Register.
…and also from smaller organizations such as our colleagues at FromThePage:
Brumfeld, Sara. (20 June 2025) Bot Traffic, AI Training, and Infrastructure Strain. FromThePage Blog.
Knowing we are not alone in this challenge is comforting, but it doesn’t eliminate the challenges. We continue to closely monitor BHL’s website performance and remain committed to keeping the website and API up and running as much as possible while we explore the best ways to mitigate the effects on BHL.
Leave a Comment