Every single person who has wrote a book is happy if others read their book. They might be less enthusiastic about printing million copies and shipping them to random people with their own money.
- developers upset with the threat of losing their jobs or making their jobs more dreadful
- craftspeople upset with the rise in slop
- teachers upset with the consequences of students using LLMs irresponsibly
- and most recently, webmasters upset that LLM services are crawling their servers irresponsibly
Maybe the LLMs don't seem so hostile if you don't fall into those categories? I understand some pro-LLM sentiments, like content creators trying to gain visibility, or developers finding productivity gains. But I think that for many people here, the cons outweigh the pros, and little acts of resistance like this "poisoning well" resonate with them. https://chronicles.mad-scientist.club/cgi-bin/guestbook.pl is another example.
Yes, it matters a lot.
You know of authors by name because you read their works under their name. This has allowed them to profit (not necessarily in direct monetary value) and publish more works. Chucking everything into a LLM takes the profit from individual authors and puts them into pockets of gigacorporations.
Not to mention the facts the current generation of LLMs will straight up hallucinate things, sometimes turning the message you're trying to send on its head.
Then there's the question of copyright. I can't pirate a movie, but Facebook can pirate whole libraries, create a LLM and sell it and it's OK? I'd have a lot less of an issue if this was done ethically.
In simpler terms, it comes down to the « you made this ?, I made this » meme.
Now if your ‘content’ is garbage that takes longer to publish than to write, I can get your point of view.
But for the authors who write articles that people actually want to read, because it’s interesting and well written, it’s like robbery.
Unlike humans, you can’t say that LLM create new things from what they read. LLM just sum up and repeat, evaluating with algorithms what word should be next.
Meanwhile humans… Oscar Wilde — 'I have spent most of the day putting in a comma and the rest of the day taking it out.'
I prefer this approach because it specifically targets problematic behavior without impacting clients who don't engage in it.
People, yes. Well-behaved crawlers that follow established methods to prevent overload and obey established restrictions like robots.txt, yes. Bots that ignore all that and hammer my site dozens of times a second, no.
I don't see the odds of someone finding my site through an LLM being high enough to put up with all the bad behavior. In my own use of LLMs, they only occasionally include links, and even more rarely do I click on one. The chance that an LLM is going to use my content and include a link to it, and that the answer will leave something out so the person needs to click the link for more, seems very remote.
The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots
Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.
The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.
Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.
But it's certainly also true that anyone feeding the scrapings to an LLM will filter it first. It's very naive of this author to think that his adlib-spun prose won't get detected and filtered out long before it's used for training. Even the pre-LLM internet had endless pages of this sort of thing, from aspiring SEO spammers. Yes, you're wasting a bit of the scraper's resources, but you can bet they're already calculating in that waste.
Of course. But those aren't the ones that explicitly say "here is how to block us in robots.txt"
The exact quote from the article that I'm pushing back on here is:
"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality."
Which appears directly below this:
User-agent: GPTBot
Disallow: /
Facebook attempted to legally acquire massive amounts of textual training data for their LLM development project. They discovered that acquiring this data in an aboveboard manner would be in part too expensive [0], and in part simply not possible [1]. Rather than either doing without this training data or generating new training data, Facebook decided to just pirate it.
Regardless of whether you agree with my expectations, I hope you'll understand why I expect many-to-most companies in this section of the industry to publicly assert that they're behaving ethically, but do all sorts of shady shit behind the scenes. There's so much money sloshing around, and the penalties for doing intensely anti-social things in pursuit of that money are effectively nonexistent.
[0] because of the expected total cost of licensing fees
[1] in part because some copyright owners refused to permit the use, and in part because some copyright owners were impossible to contact for a variety of reasons
That's why I care so much about differentiating between the shady stuff that they DO and the stuff that they don't. Saying "we will obey your robots.txt file" and lying about it is a different category of shady. I care about that difference.
(In theory the former is supposed to be a capital-C criminal offence -- felony copyright infringement.)
That’s testable and you can find content “protected” by robots.txt regurgitated by LLM’s. In practice it doesn’t matter if that’s through companies lying or some 3rd party scraping your content and then getting scraped.
> Since a user requested the fetch, this fetcher generally ignores robots.txt rules.
And it clearly voids simonw's stance
> it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it
They explicitly document that they do not obey robots.txt for that form of crawling (user-triggered, data is not gathered for training.)
Their documentation is very clear: https://docs.perplexity.ai/guides/bots
There's plenty to criticize AI companies for. I think it's better to stick to things that are true.
Their documentation could say Sam Altman is the queen of England. It wouldn’t make it true. OpenAI has been repeatedly caught lying about respecting robots.txt.
https://www.businessinsider.com/openai-anthropic-ai-ignore-r...
https://web.archive.org/web/20250802052421/https://mailman.n...
https://www.reddit.com/r/AskProgramming/comments/1i15gxq/ope...
The Business Insider one is a paywalled rehash of this Reuters story https://www.reuters.com/technology/artificial-intelligence/m... - which was itself a report based on some data-driven PR by a startup, TollBit, who sell anti-scraping technology. Here's that report: https://tollbit.com/bots/24q4/
I downloaded a copy and found it actually says "OpenAI respects the signals provided by content owners via robots.txt allowing them to disallow any or all of its crawlers". I don't know where the idea that TollBit say OpenAI don't obey robots.txt comes from.
The second one is someone saying that their site which didn't use robots.txt was aggressively crawled.
The third one claims to prove OpenAI are ignoring robots.txt but shows request logs for user-agent ChatGPT-User which is NOT the same thing as GPTBot, as documented on https://platform.openai.com/docs/bots
For example, who's the "user" that "ask[s] Perplexity a question" here? Putting on my software engineer hat with its urge for automation, it could very well be that Perplexity maintains a list of all the sites blocked for the PerplexityBot user agent through robots.txt rules. Such a list would help for crawling optimization, but could also be used to later have any employer asking Perplexity a certain question that would attempt to re-crawl the site with the Perplexity‑User user agent anyway (the one ignoring robot.txt rules). Call it the work of the QA department.
Unless we'd work for such a company in a high position we'd never really know, and the existing violations of trust - just in regard to copyrighted works alone(!) - is enough rightful reason to keep a certain mistrust by default when it comes to young companies that are already evaluated in the billions and the handling of their most basic resources.
If I am writing for entertainment value, I see no problem with blocking all AI agents - the goal of text is to be read by humans after all.
For technical texts, one might want to block AI agents as well - they often omit critical parts and hallucinate. If you want your "DON'T DO THIS" sections to be read, better block them.
I think that distinction is lost on a lot of people, which is understandable.
Even if the large LLM vendors respect it, there's enough venture capital going around that plenty of smaller vendors are attempting to train their own LLMs and they'll take every edge they can get, robots.txt be damned.
So, uh... where's all the extra traffic coming from?
"OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI."
Then for GPT it says:
"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."
What are you seeing here that I'm missing?
I hadn't backlinked the site anywhere and was just testing, so I hadn't thought to put up a robots.txt. They must have found me through my cert registration.
After I put up my robots.txt (with explicit UA blocks instead of wildcards, I heard some ignore them), I found after a day or so the scraping stopped completely. The only ones I get now are vulnerability scanners, or random spiders taking just the homepage.
I know my site is of no consequence, but for those claiming OpenAI et al ignore robots.txt I would really like to see some evidence. They are evil and disrespectful and I'm gutted they stole my code for profit, but I'm still sceptical of these claims.
Cloudflare have done lots of work here and have never mentioned crawlers ignoring robots.txt:
https://blog.cloudflare.com/control-content-use-for-ai-train...
I don't think the content I produce is worth that much, I'm glad if it can serve anyone, but I find amusing the idea to poison the well
What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag.
This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag.
You're also under the blatantly wrong misconception that people are worried about their data, when they are worried about the load of a poorly configured crawler.
The crawler will scrape the whole website on a regular interval anyway, so what is the point of this "optimization" that optimizes for highly infrequent events?
Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.
Would take an LLM (heh) 10 seconds to write you the necessary script.
A more tounge-in-cheek point: all scripts take an LLM ~10 seconds to write, doesn't mean it's right though.
Probably more effective at poisoning the dataset if one has the resources to run it.
One of the many pressing issues is that people believe that ownership of content should be absolute, that hammer makers should be able to dictate what is made with hammers they sell. This is absolutely poison as a concept.
Content belongs to everyone. Creators of content have a limited term, limited right to exploit that content. They should be protected from perfect reconstruction and sale of that content, and nothing else. Every IP law counter to that is toxic to culture and society.
My thoughts will have more room for nuance when they stop abusing the hell out of my resources they’re “borrowing”.
I wrote a little script where I throw in an IP and it generates a Caddy IP-matcher block with an “abort” rule for every netblock in that IP’s ASN. I’m sure there are more elegant ways to share my work with the world while blocking the scoundrels, but this is kind of satisfying for the moment.
Various LLM-training scrapers were absolutely crippling my tiny (~125 weekly unique users) browser game until I put its Wiki behind a login wall. There is no possible way they could see any meaningful return from doing so.
The last one I blocked was hitting my site 24 times/second, and a lot of them were the same CSS file over and over.
I did say this
>They should be protected from perfect reconstruction and sale
But I dont even really believe in that much so go nuts.
> dont even really believe in that much
If I write a book, let’s say it’s a really good book, and self-publish it, you’re saying you think it’s totally kosher for Amazon to take that book, make a copy, and then make it a best seller (because they have vastly better marketing and sales tools), while putting their own name in as author?
That seems, to you, like a totally fine and desirable thing? That literally all content should only ever be monetized by the biggest corporations who can throw their weight around and shut everyone else out?
Or is this maybe a completely half-baked load of nonsense that sounded better around the metaphorical bong circle?
Come on, now.
As soon as any work became popular, anyone could undercut Amazon. If you really think that Amazon is in a position where they can charge significant money for something others can provide for much less, then you are talking about an anticompetitive monopoly.
If that's the case the problem is not with copyright, it's lack of competition. The situation we have now is just one where copyright means they can't publish just anything, but Amazon can always acquire the rights to something and apply those same resources to make it a best seller. They don't care if the book is great or not. They just want to be able to sell it. Being able to be the only producer of the thing incentives making the thing that they own popular, not the thing that is good. Having the option to pick what succeeds puts them in a dominant negotiating position so they can acquire rights cheaply.
I guess if that were the case though it would be easy to spot things that were popular when though they seemingly lack merit or any real reason other than a strong marketing department. It would really suck in that world. Not only would there be talented people making good works and earning little money, but most people would not even get to see what they had created. For many creatives, that would be the worst part of it.
For printed books, economies of scale work in their favor as well - if it costs them $1.20 to manufacture/store/ship a paperback, and me $1.50, how am I supposed to undercut them?
The problem is when you steal my content, repackage it, and resell it. At that point, my content doesn't belong to everyone, or even to me, but to you.
* I'd have no problem with OpenAI, the non-profit developing open source AI models and governance models, scraping everyone's web pages and using it for the public good.
* I have every problem with OpenAI, the sketchy for-profit, stealing content from my web page so their LLMs can regenerate my content for proprietary products, cutting me out-of-the-loop.
You’re conflating and confusing two different concepts. “Content” is not a tool. Content is like a meal, it’s a finished product meant to be consumed; a tool, like a hammer, is used to create something else, the content which will then be consumed. You’re comparing a JPEG to Photoshop.
You can remix content, but to do that you use a tool and the result is related but different content.
> Content belongs to everyone.
Even if we conceded that point, that still wouldn’t excuse the way in which these companies are going about getting the content, hammering every page of every website with badly-behaved scrapers. They are bringing websites down and increasing costs for their maintainers, meaning other people have limited or no access to it. If “content belongs to everyone”, then they don’t have the right to prevent everyone else from accessing it.
I agree current copyright law is toxic and harmful to culture and society, but that doesn’t make what these companies are doing acceptable. The way to counter a bad system is not to shit on it from a different angle.
I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else is a
And now the corpus is collected, what difference does a blog post make, does it nudge the dial to comprehension 0.001% in a better direction? How many blog posts over how many weeks makes the difference.
"Starting in 2006, much of the still-useful text in the 1911 Encyclopaedia was adapted and absorbed into Wikipedia. Special focus was given to topics that had no equivalent in Wikipedia at the time. The process we used is outlined at the end of this page."
Wikipedia started in 2001. Looks like they absorbed a bunch of out-of-copyright Britannica 1911 content five years later.
There are still 13,000 pages on Wikipedia today that are tagged as deriving from that project: https://en.m.wikipedia.org/wiki/Template:EB1911
If guys like this have their way, AI will remain stupid and limited and we will all be worse off for it.
Or, shorter: I hold you in as much contempt as you hold me.
> If guys like this have their way, AI will remain stupid and limited
AI doesn’t have a right to my stuff just because it’ll suck without it.
Anyway, a big problem people have isn't "AI bad", it's "AI crawlers bad", because they eat up a huge chunk of your bandwidth by behaving badly for content you intend to serve to humans.
Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?
Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.