After a machine learning librarian released and then deleted a dataset of one million Bluesky posts, several other, even bigger datasets have appeared in its place—including one of almost 300 million non-anonymized posts.
Your Mastodon and Lemmy (and all other ActivityPub-talkin’ platforms) posts certainly are. I’m not sure it’s even technically possible to have federation without being open to AI ETLs. A centralized platform, maybe, but I expect this is the price we pay for decentralization.
Yeah. A public internet means a public internet, for good and for ill. People have been trained to see the internet as private, and we’re now reaping those sown seeds, and people really hate the harvest.
I mean personally I did block all the AI scrapers I could find on my instance, around a month or so ago. There were a lot, mostly unscrupulous, some big names included. Probably should look at the logs to see what’s new.
The amount of traffic was quite significant too. I have a theory that they expect legislation soon, so are not playing nice and slow like crawlers do, but are vacuuming as fast as they can.
But you’re right. Everyone would need to do it, to make a difference.
How does that help? My personal instance currently has a database of several million posts thanks to the various Mastodon relays. I don’t need to scrape your instance to sell your posts. I don’t, of course, but it’d be easy for some company to create friendlycutekittens.social and just start collecting posts. Do you really have time to audit every instance you federate with?
But, they aren’t. They’re not after Activitypub specifically. They’re scraping the whole internet, most of them using clear bot User Agents. So, I routinely block their bots because the AI ones are usually hitting you multiple times a second non-stop. If they started making fake Activitypub nodes they would not be scraping as a bot, and they would want specifically fediverse data. Important to note here though, an Activitypub node doesn’t “collect” data, they subscribe (to mastadon users/hashtags or communities) and then get new data delivered to them. So they wouldn’t get the old stuff.
Having said that, I’ve seen some obvious bots using genuine browser user agents on IP addresses from certain very large Chinese companies. For those I just blocked their whole AS number.
So most modern activitypub servers backfill threads and profiles. My single user instance processes 30000 notes a day. If I was actually trying, I’m sure it’d be easy to grab much more while appearing well behaved.
It’s not how ActivityPub (at least Lemmy/*bin servers) works. There isn’t so far as I’ve ever seen an API that allows for this within ActivityPub (now specific to Lemmy/*bin implementations there’s the API the browser/apps use that must provide this, but that’s not ActivityPub). It actually looks to be cleverly designed to prevent it. It might look like backfilling is happening because old stuff appears, but there are reasons for this.
How it works from my experience (I did some work on the federation in kbin a year or so ago).
Instance A subscribes to community B hosted on Instance C.
Instance C notes this and does nothing. No previous content is sent, only future activities will be.
User on Instance D already subscribed to community B upvotes a comment on a post in community B.
Instance D sends the activity to Instance C.
Instance C sends the activity to Instance A.
Instance A gets the notice of the upvote, but realises it has no context for the upvote. But luckily the upvote has the comment ID of the comment that it was related to. So, now Instance A makes a request for the comment from Instance C.
Instance A receives the response from Instance C. But it turns out that comment was in reply to another comment. But the comment contains the ID of the parent comment. So Instance A requests that comment (and any parent comments until it gets the parent post).
By now Instance A has the information about the like, all comments from the liked comment to the post. These are saved to the database and will appear on the local system.
For each of the likes, comments and posts. If the user isn’t known locally the profile will also be fetched from their instance and stored locally.
And so old posts and comments will begin to appear as activities linked to them happen. But there isn’t a method to ask for “all the posts in community X” using activity pub. I remember because I was specifically looking for this a year or so ago. It let’s you see the parent object but not any children.
Maybe Mastadon etc does it different? No idea.
And all of this is moot because if I block a User Agent, or I block an AS number/IP block. They’re not getting anything either by ActivityPub or scraping unless they change User Agent, AS number, or both.
Not necessarily expecting any legislation; it might be the simple inequality of you having an instance, while they have a bunch of datacenters.
What’s 1TB/s more or less, a rounding error?
Big names scrape the whole web all the time; best case scenario, they’ll have an optimized scraper for federated networks; worst case, they’ll scrape as they would any other website and not even notice the difference.
I don’t think they’re optimising much at all. I think it’s likely just a modified web crawler but without the kind of throttling normal search engine crawlers use. They’re following links recursively. Then probably some basic parsing or even parsing with AI to prepare the data to make another AI model.
in a practical sense you’re completely right. However in a legal sense, I am not sure implementing ActivityPub on your website and not restricting federation doesn’t mean you’re not allowed to still impose legal conditions on access to the data that your website is hosting. I am not sure that the nature of the protocol completely absolves you of liability.
to be extra clear. I am not making any kind of claims here. I’m only saying that I am not sure the answer is a simple one
I’m sure you’re allowed to impose legal conditions on your data, but the AI folks have very clearly shown they don’t care and would prefer to just fight it out in court years and years later, if ever.
Legislation is so far behind these issues, I expect AP to be replaced by whatever comes next before legal considerations have any impact. And what’s Joe Smallserver going to do? Sue Google?
I agree with your theory, but while in theory, theory is the same as practice, in practice, it doesn’t.
From a copyright point of view… the rights to each piece of content are of each owner… but each owner is sharing that content with an instance, with the intent of it getting re-shared to further instances.
In a strict sense, most instances are in breach of copyright law: they don’t require users to agree to an EULA specifying how the content will be used, they don’t require federated instances to agree to the same terms, they don’t make end users agree to the terms of other instances, and generally allow users to submit someone else’s content (see: memes) without the owner’s authorization, then share and re-share it across the federated network. A fully “copyright compliant” protocol, would need to have these things baked into it from the beginning… which would make joining the federated network a royal PITA.
With the current approach of “like, chill bro”… anyone can set up an instance, federate with whatever target or federated-of-a-target one, and save all the data without any consequences. The fact of receiving federated data, carries an implicit consent to process that data, and definitely does nothing to prevent random processing.
Scraping the web endpoint of an instance, carries the rules set by the EULA of that endpoint… which tend to be none, or in the best case, are that of the least restrictive instance offering that federated data.
All of that, before scrapers simply ignoring any requirements.
Your Mastodon and Lemmy (and all other ActivityPub-talkin’ platforms) posts certainly are. I’m not sure it’s even technically possible to have federation without being open to AI ETLs. A centralized platform, maybe, but I expect this is the price we pay for decentralization.
Yeah. A public internet means a public internet, for good and for ill. People have been trained to see the internet as private, and we’re now reaping those sown seeds, and people really hate the harvest.
I mean personally I did block all the AI scrapers I could find on my instance, around a month or so ago. There were a lot, mostly unscrupulous, some big names included. Probably should look at the logs to see what’s new.
The amount of traffic was quite significant too. I have a theory that they expect legislation soon, so are not playing nice and slow like crawlers do, but are vacuuming as fast as they can.
But you’re right. Everyone would need to do it, to make a difference.
How does that help? My personal instance currently has a database of several million posts thanks to the various Mastodon relays. I don’t need to scrape your instance to sell your posts. I don’t, of course, but it’d be easy for some company to create friendlycutekittens.social and just start collecting posts. Do you really have time to audit every instance you federate with?
But, they aren’t. They’re not after Activitypub specifically. They’re scraping the whole internet, most of them using clear bot User Agents. So, I routinely block their bots because the AI ones are usually hitting you multiple times a second non-stop. If they started making fake Activitypub nodes they would not be scraping as a bot, and they would want specifically fediverse data. Important to note here though, an Activitypub node doesn’t “collect” data, they subscribe (to mastadon users/hashtags or communities) and then get new data delivered to them. So they wouldn’t get the old stuff.
Having said that, I’ve seen some obvious bots using genuine browser user agents on IP addresses from certain very large Chinese companies. For those I just blocked their whole AS number.
So most modern activitypub servers backfill threads and profiles. My single user instance processes 30000 notes a day. If I was actually trying, I’m sure it’d be easy to grab much more while appearing well behaved.
It’s not how ActivityPub (at least Lemmy/*bin servers) works. There isn’t so far as I’ve ever seen an API that allows for this within ActivityPub (now specific to Lemmy/*bin implementations there’s the API the browser/apps use that must provide this, but that’s not ActivityPub). It actually looks to be cleverly designed to prevent it. It might look like backfilling is happening because old stuff appears, but there are reasons for this.
How it works from my experience (I did some work on the federation in kbin a year or so ago).
And so old posts and comments will begin to appear as activities linked to them happen. But there isn’t a method to ask for “all the posts in community X” using activity pub. I remember because I was specifically looking for this a year or so ago. It let’s you see the parent object but not any children.
Maybe Mastadon etc does it different? No idea.
And all of this is moot because if I block a User Agent, or I block an AS number/IP block. They’re not getting anything either by ActivityPub or scraping unless they change User Agent, AS number, or both.
Not necessarily expecting any legislation; it might be the simple inequality of you having an instance, while they have a bunch of datacenters.
What’s 1TB/s more or less, a rounding error?
Big names scrape the whole web all the time; best case scenario, they’ll have an optimized scraper for federated networks; worst case, they’ll scrape as they would any other website and not even notice the difference.
I don’t think they’re optimising much at all. I think it’s likely just a modified web crawler but without the kind of throttling normal search engine crawlers use. They’re following links recursively. Then probably some basic parsing or even parsing with AI to prepare the data to make another AI model.
What??? But… I posted a disclaimer and license! How can they slap???
in a practical sense you’re completely right. However in a legal sense, I am not sure implementing ActivityPub on your website and not restricting federation doesn’t mean you’re not allowed to still impose legal conditions on access to the data that your website is hosting. I am not sure that the nature of the protocol completely absolves you of liability.
to be extra clear. I am not making any kind of claims here. I’m only saying that I am not sure the answer is a simple one
I’m sure you’re allowed to impose legal conditions on your data, but the AI folks have very clearly shown they don’t care and would prefer to just fight it out in court years and years later, if ever.
Tech bros have never understood concepts like personal property, privacy, or stopping when asked.
Legislation is so far behind these issues, I expect AP to be replaced by whatever comes next before legal considerations have any impact. And what’s Joe Smallserver going to do? Sue Google?
I agree with your theory, but while in theory, theory is the same as practice, in practice, it doesn’t.
From a copyright point of view… the rights to each piece of content are of each owner… but each owner is sharing that content with an instance, with the intent of it getting re-shared to further instances.
In a strict sense, most instances are in breach of copyright law: they don’t require users to agree to an EULA specifying how the content will be used, they don’t require federated instances to agree to the same terms, they don’t make end users agree to the terms of other instances, and generally allow users to submit someone else’s content (see: memes) without the owner’s authorization, then share and re-share it across the federated network. A fully “copyright compliant” protocol, would need to have these things baked into it from the beginning… which would make joining the federated network a royal PITA.
With the current approach of “like, chill bro”… anyone can set up an instance, federate with whatever target or federated-of-a-target one, and save all the data without any consequences. The fact of receiving federated data, carries an implicit consent to process that data, and definitely does nothing to prevent random processing.
Scraping the web endpoint of an instance, carries the rules set by the EULA of that endpoint… which tend to be none, or in the best case, are that of the least restrictive instance offering that federated data.
All of that, before scrapers simply ignoring any requirements.
great context! thanks for pointing that out
And not just AI datasets but the CIA AI datasets… 🤣🤣
That’s why I never talk about that time in refurbished Mexico.
What happens in Mexico, stays in Mexico.
Probably even easier than places like twitter, as your can set up a server and others will even push all the data to you.