whateverthing.com

They Drank Our Milkshakes

"milkshake", from my 2024 Photo-A-Day Project

View Full Image »

As a hobbyist developer on a tight budget, my servers generally run on inexpensive VPSes (Referral Link). They're not configured to autoscale. Floods of unexpected traffic cause them to fall over. This fragility is by design, so that my hosting costs won't cause a budget overflow.

On the modern web, however, this presents a problem.

We Begin Our Story ...

On the World Wide Web of ages long past, otherwise known as the 2000s and 2010s, a bargain was struck with the search engines. They would be allowed to scrape websites, so long as they behaved responsibly and obeyed the rules outlined in each site's robots.txt file.

The robots.txt file declared what was to be considered "private property, no trespassing" – and, in some technical cases, "abandon all hope ye who enter here".

This arrangement was beneficial to the websites because it boosted their visibility in search results, bringing in more views. And it benefitted the search engines by giving them better and more accurate results.

The search engines entrenched themselves as the foundational way to find anything on the internet.

And with so many eyes on them, it was inevitable that they would embrace advertising as their revenue centre. But this blog post isn't about ads. This is about something more recent, and much worse for the Web.

Things Turn Sour ...

In those bygone days, a major threat for website uptime was the Distributed Denial of Service (DDoS) attack. This most often involved tens of thousands of malware-riddled desktop computers being assembled into a "botnet". All the nodes in the botnet could flood servers with millions of requests, overwhelming them to the point that they crashed. The goal was to prevent people from accessing those servers during and after the attack.

When a website generates thousands of dollars per day, being knocked offline is a direct hit to revenue. The threat of DDoS loomed over the web, like a looming thing. A thing that looms. A loomer.

Some of these botnets & related tools ended up becoming real projects with cool names and business plans, such as the Low Orbit Ion Cannon.

At one point, a variation of this attack emerged.

Instead of sending huge numbers of requests to a server, specially-crafted requests would open connections, but never close them. Servers would exhaust all of their resources waiting for the next phase of the connection.

As per tradition, this attack was also given a memorable name, the Slow Loris attack.

And Here We Are ...

Flash forward to today, and many of these issues have been addressed. At least for large corporations. Tech solutions like CDNs, auto scaling, Kubernetes, load balancing, and other orchestration/infrastructural tools, exist to help administrators withstand these sorts of attacks. Such tools are available to hobbyist developers like myself, but they are often outside of our budgets — or they may have questionable business practices and political aims. And the traditional solution, allowing servers to autoscale when under heavy load, is a recipe for emptying wallets.

Some hobby developers have been surprised by bills of thousands or tens of thousands of dollars on a project that was launched as a joke and shouldn't have cost more than $10.

Here in the 2026, a new threat has emerged.

Vast networks of automated scraping tools are threshing the Internet for content to feed into large language model training data sets.

These have the same traffic patterns as distributed denial of service botnets, except instead of using infected machines, they use nodes that are owned by the organization gathering the data. Much like a search engine scraper - but with far fewer scruples.

The scraping tools routinely ignore the robot.txt file that was established to allow search engines to responsibly scan for data. They spread their requests out over hundreds of IP addresses in order to disguise what they are doing. Their lack of respect for the websites they are scanning is a direct threat to hobby developers like myself.

Paying for autoscaling and paying for systems that are able to withstand the onslaught of a scraper net or a DDoS is a hefty bill. Large corporations can absorb this cost — though they probably aren't happy about it.

Open source projects are getting slammed from multiple ends. Not only are their servers overwhelmed, but often their source code repositories are also being peppered with useless "AI-powered" contribution attempts.

And for folks like me, running on as little as a single hobby VPS, the server often cannot cope.

In my opinion, this traffic is indistinguishable from an attack. I've even given it a name:

The Necro-Loris Attack

One of my hobby websites is a forum that has been online since 2008. For much of its existence, the robots.txt file has disallowed all search engine and indexing. We don't need the traffic influx that search engine engines were offering as their for-the-greater-good bargain to justify their scraping.

The content that we've posted to this forum, 6 million posts, is for us. No, it's not locked behind a password - it's visible to anybody who wants to go back and read it. But it is for us. It is not intended for random companies to harvest it and snarf it in their own tools.

And it is certainly unwelcome for such companies to scrape it so heavily that it denies access to our community members.

So you can imagine how we felt when some AI-model-scraping botnet glommed onto our forum. It caused the server to grind to a halt. Database records from nearly 20 years ago were being requested at a concerning rate by this malicious actor. But, crucially, not all from the same IP address. Each batch of requests was spread out over hundreds of IPs. This disguised the traffic in the logging tools, making it hard to spot - except for one quirk.

If someone were to reply to posts as old as the ones being requested, it would be considered "Necro posting" - dredging up old conversations for no good reason.

So from my perspective, harvesting all of these old posts is necro-scraping. And much like the Slow Loris attack, this scraping process exhausted my server resources and prevented my users from using the forum.

All that is why I call it the Necro-Loris Attack.

Maybe you've suffered from it as well?

If your projects don't have a clear signal that they are being drained in this way, you might not have noticedthe initial steps of the attack. I didn't at first. The speed at which they were crawling, and the variety of IPs they were using, was within my server's resources to handle — until it wasn't.

If you're a large company that is capable of autoscaling its services, maybe you have noticed the issue - but opted to absorb the cost. Or maybe you've been unable to zero in on the culprit for a spike in costs.

A Temporary Fix Using Fail2ban

In a way, I was in a lucky position for fixing this issue. I had recently been considering taking the forum completely private and requiring passwords for everything. This attack gave me a fresh angle to consider things from.

Instead of locking down the forum as a whole, I locked down the old posts. Anything older than a specific time frame bounces to the password page. And, if you trigger that bounce too many times, your IP gets flagged for a ban.

Implementing this change was easy.

It cleared the malicious traffic off the board almost immediately.

And it has been working smoothly ever since — with only the occasional accidental (and temporary) ban of a legitimate user.

In the main controller for the site, I added PHP logic to reject the request and log the event:

$twoWeeksAgo = (new \DateTimeImmutable())->modify("-14 days");
$contentAge = $content->getAge();

if ($contentAge < $twoWeeksAgo) {
    // Redirect anonymous users to the login page
    if (!$loggedIn) {
        $logger->error('Redirecting anonymous user', ['content_id' => $content_id]);

        return $this->redirect('https://127.1/login/');
    }
}

Then, I configured Fail2ban with a regex to scan for that error, in a file called /etc/fail2ban/filter.d/symfony-anon-redir.yml:

[Definition]
failregex = ^.+ERROR: Redirecting anonymous user .*{.*"ip":"<HOST>"

And then I updated the jail file:

[DEFAULT]
banaction = iptables-allports
ignoreip = 127.0.0.1/8 ::1

[symfony-anon-redir]
enabled=true
port=http,https
filter=symfony-anon-redir
maxretry=5
bantime=1d
logpath=/var/www//logs/forum.log

Apart from a few hiccups that I mentioned above, this has greatly helped to tamp down the Necro-Loris behaviour.

To unban those accidental bycatches:

sudo fail2ban-client unban ip <ip-address>

There Will Be Blood

One of the biggest films of 2008, the year I launched my forum, was There Will Be Blood. It was about the oil industry in the southwest of America. A powerful film about greed and gluttony and power and selfishness. And alcoholism, I guess.

In the film, the main character is an oil baron. At one point, he uses a metaphor to describe a process that allows him to steal oil from a nearby oil claim so that he can build his own empire by exploiting the work and claims of others.

He would use a very long straw to drink their milkshake from across the room. He would drink their milkshake! Drink it up!

And I see that as a parallel to what is happening in the world today. Companies that train LLMs are engaging in wholesale theft of data (the milkshake, aka the oil) by using bots and vast numbers of IP addresses to disguise their greedy, disrespectful behaviour (the long straw).

In the attack on my forum, it appeared that over 50% of posts had been ingested by this corporate- or nation-backed botnet. Three million.

I am not happy about that. I'm insulted by it. Enraged, even. Especially because the robots.txt says for ALL robots to ignore ALL content on the site.

Where does that leave us?

Search Engines offered a trade for their scanning: harvesting data in exchange for exposure.

AI harvesting scraper networks serve only their own ends.

That's not the only difference, but it is a crucial one. These scraper bots have stolen our creative works in order to build their product offering, which they alone profit from.

Like Daniel Plainview siphoning from another oil claim: They drank our milkshakes.

They are freeloading parasites.

They want to position themselves as the landlords of knowledge, billing us for the very data they stole from us.

That is why these companies do not deserve the same grace that was granted to search engines.

We should do everything in our power to stop these companies, and to shame them.

They can't be fixed.


Published: March 23, 2026

Categories: opinion

Tags: opinion, machine-learning, computer-history



<busker-box>

Thank you for reading!

If you enjoyed this post, please consider supporting me on Ko-Fi. Monthly Supporters get early access to my writing and updates on my hobby projects.

To be notified of new posts on Whateverthing, please follow the RSS feed:


subscribe to the rss feed

</busker-box>