The rise and fall of robots.txt: As unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

alyaza [they/she]@beehaw.org · 9 months ago

The rise and fall of robots.txt: As unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

gayhitler420@lemm.ee · 9 months ago

robots.txt isn’t a basic social contract, it’s a file intended to save web crawlers precious resources.

umbraroze@kbin.social · 9 months ago

Yup. The robots.txt file is not only meant to block robots from accessing the site, it’s also meant to block bots from accessing resources that are not interesting for human readers, even indirectly.

For example, MediaWiki installations are pretty clever in that by default, /w/ is blocked and /wiki/ is encouraged. Because nobody wants technical pages and wiki histories in search results, they only want the current versions of the pages.

Fun tidbit: in the late 1990s, there was a real epidemic of spammers scraping the web pages for email addresses. Some people developed wpoison.cgi, a script whose sole purpose was to generate garbage web pages with bogus email addresses. Real search engines ignored these, thanks to robots.txt. Guess what the spam bots did?

Do the AI bros really want to go there? Are they asking for model collapse?

gayhitler420@lemm.ee · 9 months ago

Of course they want the model collapse. Literally no American tech company has been about reliably, sustainably supplying a good or service or stewarding some public good.

They’re doing the vc -> juice stock -> gut resources cycle. Nobody cares about the model.

Gamma@beehaw.org · 9 months ago

Considering Reddit has decided to start selling user content for training, yeah I guess they want their models to collapse. There’s so much bot generated content nowadays

MachineFab812@discuss.tchncs.de · edit-2 9 months ago

The basic social contract of the web was to keep things accessible, including to bots. No one has the storage capacity to rip the entire web like all these jokers are pretending - even google merely indexes it except for the most popular pages.

The thing ruining the social contract of the web is the profit motive of all these companies trying to convince people that they should be able to sell data that is otherwise publically accessible, for the purposes of allowing bots to look at it - they can’t memorize it.

Of course ChatGPT and the other AI companies ARE partially to blame: it seems they’ve poisoned the pot by giving their AIs continual access to the training sets and/or even the broader internet on the backend without making this clear to users, allowing Journalists to claim that these AIs have somehow memorized Pettabytes of data into a few Gigabytes. That is an ABSURD, basically impossible, compression ratio for anyone with even the slightest comprehension with the topic.

No, your random article you tricked ChatGPT into spitting out is not worth memorizing, not even to the lie and hallucination prone AI chatbots we have available to prod for free or otherwise. Oh, you paid for it, and your complaint is that its spitting accurate information? YOU’RE PAYING FOR THEM TO HOST THE CHATBOT FOR YOU AND PROVIDE IT ACCESS TO INFORMATION IT WOULD OTHERWISE NOT HAVE ACCESS TO ON THE BACK-END.

By all means, sue the companies into paying for their data, and force them to divulge the data-sets they keep on-hand so that they can be charged for information in them, but stop pretending the AIs themselves contain copies of it, or that its impossible to make them pay ex-post-facto (as opposed to the ENTIRETY of the rest of our legal system and enforcement) …

AND PEOPLE, stop letting all these companies trick you into thinking that this is a valid excuse to further lock-down the web, or that you must poison your fanart with methods that WILL be bypassed. Its just another potential expense and technical burden these companies want you to believe you must bear rather than sticking to the things you enjoy and/or that put food on your table.

darkphotonstudio@beehaw.org · 9 months ago

The thing ruining the social contract of the web is the profit motive

Dingdingdingdingding!

bedrooms@kbin.social · edit-2 9 months ago

As I always write, trying to restrict AI training on the ground of copyright will only backfire. The sad truth is that malicious parties (dictatorships) will get more training materials because they won’t abide by rules. The end result is, dictators would outperform democracies in terms of future generation AIs, if we treat AI training like human reading.

zaphod@lemmy.ca · edit-2 9 months ago

You know what?

I’m fine with that hypothetical risk.

“The bad guys will do it anyway so we need to do it, too” is the worst kind of fatalism. That kind of logic can be used to justify any number of heinous acts, and I refuse to live in a world where the worst of us are allowed to drag down the rest of us.

frog 🐸@beehaw.org · 9 months ago

Yeah, I mean bad guys are going to commit murder too, doesn’t mean it shouldn’t be illegal.

bedrooms@kbin.social · edit-2 9 months ago

The consequence of falling behind is gravely different from most heinous acts. It can impact the military, elections, espionage, or whatever.

zaphod@lemmy.ca · edit-2 9 months ago

Really? I’m supposed to believe AI is somehow more existentially risky than, say, chemical or biological weapons, or human cloning and genetic engineering (all of which are banned or heavily regulated in developed nations)? Please.

I understand the AI hype artists have done a masterful job convincing everyone that their tech is so insanely powerful (and thus incredibly valuable to prospective investors) that it’ll wipe out humanity, but let’s try to be realistic.

But you know, let’s take your premise as a given. Even despite that risk, I refuse to let an unknowable hypothetical be used to hold our better natures hostage. The examples are countless of governments and corporations using vague threats as a way to get us to accept bad deals at the barrel of a virtual gun. Sorry, I will not play along.

davehtaylor@beehaw.org · 9 months ago

If you don’t see how even the most basic of AI images, videos, deepfakes, etc. can manipulate the public, the electorate, popular opinion, or even sow just enough doubt as a cause a problem, then I don’t know what to tell you.

People are already dying because of deepfakes and fake AI porn. We know that most people who see some headline on Facebook will never click farther to read it, and will just accept the headline and/or the synopsis as fact. They will accept something a 1000x re-shared image says, without sources or verification. The fact that a picture or vid might have a person with 8 fingers on one hand in the background isn’t going to prevent them from taking in the message. And we’ve all literally seen people around the web say , explicitly, something to the effect of “I don’t care if the story is true or not, it’s a real issue we need to consider” when we know for a fact that it is not.

Yes, mis- and dis-information are far more of an existential thread than chem or bio weapons, and we know this because we are already seeing the consequences of it. If you refuse to see that, then you are lost.

zaphod@lemmy.ca · edit-2 9 months ago

You don’t need AI for any of that. Determined state actors have been fabricating information and propagandizing the public, mechanical Turk style, for a long long time now. When you can recruit thousands of people as cheap labour to make shit up online, you don’t need an LLM.

So no, I don’t believe AI represents a new or unique risk at the hands of state actors, and therefore no, I’m not so worried about these technologies landing in the hands of adversaries that I think we should abandon our values or beliefs Just In Case. We’ve had enough of that already, thank you very much.

And that’s ignoring the fact that an adversarial state actor having access to advanced LLMs isn’t somehow negated or offset by us having them, too. There’s no MAD for generative AI.

davehtaylor@beehaw.org · 9 months ago

I’m not so worried about these technologies landing in the hands of adversaries that I think we should abandon our values or beliefs Just In Case

What beliefs and values would we be abandoning by fighting back against tech that is literally costing people their literal lives?

zaphod@lemmy.ca · edit-2 9 months ago

Hah I… think we’re on the same side?

The original comment was justifying unregulated and unmitigated research into AI on the premise that it’s so dangerous that we can’t allow adversaries to have the tech unless we have it too.

My claim is AI is not so existentially risky that holding back its development in our part of the world will somehow put us at risk if an adversarial nation charges ahead.

So no, it’s not harmless, but it’s also not “shit this is basically like nukes” harmful either. It’s just the usual, shitty SV kind of harmful: it will eliminate jobs, increase wealth inequality, destroy the livelihoods of artists, and make the internet a generally worse place to be. And it’s more important for us to mitigate those harms, now, than to worry about some future nation state threat that I don’t believe actually exists.

(It’ll also have lots of positive impact as well, but that’s not what we’re talking about here)

zaphod@lemmy.ca · edit-2 9 months ago

deleted by creator

Blisterexe@lemmy.zip · 9 months ago

But, if we make training ai without copyright illegal, it will hamper open source models, while not affecting closed source ones , because they could just buy it off of big social media conglomerates

Chahk@beehaw.org · edit-2 9 months ago

Alrighty then. If corps want to train their AI on all the content they can scrape without worrying about copyright, then they can’t complain when I torrent their shit without worrying about copyright too! Deal? Somehow I don’t see them taking that deal.

zaphod@lemmy.ca · edit-2 9 months ago

Training new models is already the domain of large actors only, simply due to the GPU requirements, which serve as a massive moat. That ship has sailed. There isn’t a single open source model, today, that wasn’t trained by a corporate entity first, and then only fined tuned by the community later.

davehtaylor@beehaw.org · 9 months ago

“Bad guys are going to do bad things, so we shouldn’t even bother trying to do anything to make things better, and just let the dystopia happen” is not the answer

AutoTL;DR@lemmings.world · 9 months ago

🤖 I’m a bot that provides automatic summaries for articles:

Click here to see the summary

If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.

You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.

The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.

Saved 92% of original text.