dammIT

A rantbox by Michiel Scholten

Afraid to Git


Trees with flowering branches

I have been running a Git server for over a decade now (before that it was mercurial and before that versioning system I rocked SVN, dating back at least twenty years). It has hosted my private repositories and served as a mirror for my GitHub projects.

More recently though I have been shifting away from GitHub and I started two open source projects that do not have a repository on GitHub, just on my Gitea instance.

Notice how I did not link to any repositories or servers in the above paragraph? There are a few reasons.

Xe Iaso tells it best when they introduced their Anubis tool and its workings:

A majority of the AI scrapers are not well-behaved, and they will ignore your robots.txt, ignore your User-Agent blocks, and ignore your X-Robots-Tag headers. They will scrape your site until it falls over, and then they will scrape it some more. They will click every link on every link on every link viewing the same pages over and over and over and over. Some of them will even click on the same link multiple times in the same second. It's madness and unsustainable.

Xe has been mentioned on Arstechnica and TechCrunch in the light of AI crawlers harassing websites and infrastructure, and the tool they wrote in response to it, which effectively and intentionally blocks a lot of the crawlers.

Part of the README of Anubis:

Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand. I hate that I have to do this, but this is what we get for the modern Internet because bots don't conform to standards like robots.txt, even when they claim to.

These web crawlers are dumb and malicious, ignoring things like the mentioned robots.txt and for example hammering all git blame, history and diff links they can find, bringing down Git services and causing thousands of dollars of network traffic bills.

I mean, to quote a small part of the Arstechnica article linked above:

GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE's GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat.

And as recently as last Thursday, Linux' kernel.org is behind Anubis too, because of the exact same reasons.

However, this leads to a little conundrum I am currently having in private: I would really like to move away from GitHub as that is owned by Microsoft, is extensively scraped by the OpenAI bots and most likely all other 'AI' bots in existence, and I like hosting my own work on my own little garden, outside of reach/ownership of certain states and/or big tech.

That is part of why I am putting more use in my own Git server, but as sketched out above, this puts me at risk of effectively being Slashdotted by run-away misbehaving bots. Of course, I can just #YOLO it and find out if it is really going to be a problem, but when that (inevitably?) happens, how am I going to save my little garden from being trampled to oblivion, other than also deploying an instance of Anubis or similar counter measures? Why do I and other tech enthusiasts even need to consider this?

Friends of mine consider going entirely offline with both their versioning hosting and their weblogs and other websites, just to escape this hell of both misbehaving scrapers and the LLM's just mindlessly absorbing everyone's work and regurgitating it as their own; they would then resort to Tailscaling it all together with private connections, and sharing their own articles and interesting links only in some private chatrooms, which would effectively mean going dark with all kinds of interesting write-ups.

For me, it makes me reconsider even writing about this nifty little tool I wrote recently to fix a problem I had, and which can help others as well. But do I want to link to my Git server with the code and examples, putting it at risk of being reduced to smoke?

It is both amazing what you can create with those LLM tools, be it the text-based ones, or the versions creating imagery. However, as things currently stand, it is destroying the internet to its soul as well.

article header image article header image