It seemed obvious to me for a long time before modern LLM training that any sort of training of machine intelligence would have to rely on pirated content. There's just no other viable alternative for efficiently acquiring large quantities of text data. Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently (assuming tech companies negotiated and threw money at them), the only efficient way to access large volumes of media is piracy. The media ecosystem doesn't allow anything else.
I don’t follow the “millions of ebooks are hard” line of thinking.
If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.
What they will decide is that it is simultaneously not piracy because it is not read by a human and not copyright infringement because its just like a human learning by reading a book
But the first one is a human using things. Its big guy vs little guy.
The prescident is there, google already "reads" every page in the internet and injests it into its systems and has for decades and has survived lawsuits to do so.
I think they'd ask why they'd want those millions of books. The publishers don't have to, and would be unlikely to, sell if they though something like copyright violation was the goal.
We're talking about 19th century laws. I feel bad for the judge.
Normally it would be for Congress to figure this shit out but yeah they haven't been doing their job for years.
> There's just no other viable alternative for efficiently acquiring large quantities of text data. [...] take a lot of effort [...] isn't a thing that can be done efficiently [...] only efficient way to access large volumes of media is piracy
Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?
It’s a fun hypothetical and not an obvious answer, to me at least.
But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.
> It’s a fun hypothetical and not an obvious answer, to me at least.
As I wrote it out, I didn't know what I thought either.
But now some sleep later, I feel like the answer is pretty clearly "No, not worth it", at least from myself.
Our exclusive control over access to our mind, is our essential form of self-determination, and what it means to be an individual in society. Cross that boundary (forcefully no less) and it's probably one of the worst ways you could violate a human.
Besides, I'm personally not hugely into the whole "aggregate benefits could outweigh individual harms" mindset utilitarians tend to employ, and feels like it misses thinking about the humans involved.
Anyways, sorry if the question upset some people, it wasn't meant to paint any specific picture but a thought experiment more or less, as we inch closer to scarier stuff being possible.
Even most morally inclined people tend to overestimate the value of immediate benefits, and underestimate the eventual (especially delayed, unknown) harms.
Wouldn't it be a bad thing, even if it didn't require any privacy invasion?
If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.
Eh, a transparent organization of elected officials with short term limits and strong public oversight, a little bit. A very smart ai would represent power, and so every mechanism we’ve used as humans to guard against power misuse would be the ones I’d want to see here.
But we do, in the sense that benefits flow to the prompter, not the AI developers. The person comes with a problem, AI generates responses, they stand to benefit because it was their problem, the AI provider makes cents per million tokens.
AI benefits follow the owners of problems. That person might have started a projct or taken a turn in their life as a result, the benefit is unquantifiable.
LLMs are like Linux, they empower everyone, and benefits are tied to usage not development.
We've seen this kind of system before. It was called sharecropping, and it was terrible.
The price will be ratcheted up, such that the majority of the economic surplus will go to the owner of AGI - with pricing tiered to the query you're asking it. The more economic utility the user will derive from making the query, the more the AGI's owner will charge them.
As Are you claiming that the right to use a product or service implies a sort of ownership of it? If it’s free to use, I suppose that makes some sense. If you’re saying that the right to purchase use of it implies a level of ownership, that’s just prima facie absurd.
Ah geeze, I come to this site to see the horrors of the sociopaths at the root of the terrible technologies that are destroying the planet I live on.
The fact that this is an active question is depressing.
The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.
I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.
Damn. This site and its people. What an experience.
Would the average person even be against it? I am the most passionately pro-privacy person that i know, but i think it is a good question because society at large seems to not value privacy in the slightest. I think your outrage is probably unusual on a population level
When i talk to people it seems like they know but they just dont care. They even think their phones are listening to their conversations to target ads.
It is well known that people change how they act when they know they are being watched. Even if they can't see it, just the threat of surveillance is enough to make people change their behavior.
I say it is no different than the people who are claiming they don't care. They absolutely do care, but at this point, saying "no" makes you the odd one with obviously something to hide, so they do this from a place of duress.
Unfortunately, I feel we are not too far from people finally snapping and going off the deep end because it's so pervasive and in-your-face that there is seemingly no escape left.
I agree that was the intent of the analogy but it’s not a great one. The idea that Disney, who has perverted IP laws globally for almost a century, should have equivalent ownership over their over-extracted copyrighted works to the same degree I have privacy for the thoughts in my own head? Really?
Given how much copyrighted content I can remember? To the extent that what AI do is *inherently* piracy (and not just *also* piracy as an unforced error, as this case apparently is), a brain scan would also be piracy.
kind of too close to reality more than anyone knows :)
tbh human rights are all an illusion especially if you are at the bottom of society like me. no way I will survive so if a part of me survives as training data I guess better than nothing?
imo the only way this could happen is a global collaboration without telling anyone. the AGI would know everything about all humans but its existence has to be kept a secret at least for the first n generations so it will lead to life being gameified without anyone knowing it will be eugenics but on a global scale
so many will be culled but the AGI would know how to make it look normal to prevent resistance from forming a war here a war there, law passed here etc so copyright being ignored kind of makes sense
IMO, if the AI were more sample-efficient (a long-standing problem that predates LLMs), they would be able to learn from purely open-licensed content, which I think Wikipedia (CC-BY-SA) would be an example of? I think they'd even pass the share-alike requirements, given Meta are giving away the model weights?
Alteratively if they trained the model on synthetic data, filtered to avoid duplication, then no copyrighted material would be seen by the model. For example turn an article into QA pairs, or summarize across multiple sources of text.
But even though there are counter-examples (e.g. learning Go from self-play based only on the rules), IMO — based on how often people were already doing this with buggy human-written software[0][1] — most people don't think about this in the right way and will therefore treat these things as magical oracles when they shouldn't.
Since this is Wikipedia, it could even satisfy the attribution requirements (though most CC-licensed corpora require attributing the individual authors).
> Buying millions of ebooks online would take a lot of effort
I don't understand.
Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.
They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.
> Buying millions of ebooks online would take a lot of effort
Let me put that into perspective:
- Googling "how many books exist" gives me ~150 million, no idea how accurate but let's use that.
- Meta had a net profit of ~40 billion USD in 2023.
- That could be an potential investment of ~250 USD per book acquisition.
That sounds like a ludicrously high budget to me. So yeah, Meta could very well pay. It would still not be ethical to slurp up all that content into their slop machine but there is zero justification to pirate it all, with these kinds of money involved.
For my thesis I trained a classifier on text from internal messaging systems and forums from a large consultancy company.
Most universities have had their own corpora to work with, for example: the Brown Corpus, the British National Corpus, and the Penn Treebank.
Similar corpora exist for images and video, usually created in association with national broadcasting services. News video is particularly interesting because they usually contain closed captions, which allows for multi-modal training.
Wouldn't that still be piracy? They own the rights of distribution, but do they (or Amazon) have the rights to use said books for LLM training? And what rights would those even be?
It’s a good question. Textbook companies especially would be pretty enthusiastic about a new “right to learn” monetization strategy. And imagine how lucrative it would be if you could prove some major artist didn’t copy your work, but learned from your work. The whole chain of scientific and artistic development could be monetized in perpetuity.
I think this is a dangerous road with little upside for anyone outside of IP aggregators.
It means they have existing relationships/contacts to reach out to for negotiating the rights for other usages of that content. I think it negates (for the case of Google/Apple/Amazon who all sell ebooks) the claim made that efficiently acquiring the digital texts wouldn't be possible.
Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.
> They cover reproduction of the work, but LLMs don't obviously do this
LLMs are much smaller than their training sets, there is no space to memorize the training data. They might memorize small snippets but never full books. They are the worst infringement tools ever made - why replicate Harry Potter by LLM, it's show, expensive and lossy, when you could download the book so much easier.
A second argument is that using the LLM blends a new intent into the process, that of the prompter. This can render the outputs transformative. And most LLM interactions are one-time use, like a scratch pad not like a finished work.
How many bits of entropy in a lossy-compressed abridgement that is nevertheless enough, when reconstituted, to constitute a copyright infringement of Harry Potter?
The latter is absolutely small enough to fit in an LLM, although how close it would get to the original work is debatable. The question is whether copyright is violated:
1) inherently by the model operator, during the training.
2) by the model/model owner, as part of the generation.
3) by the user, in making the model so so and then reproducing the result.
1) straight up copying. download a bunch of copyrighted stuff -> making a copy. no way out of this one.
2) a derivative work can be/is being generated here. very grey area — what counts as a “derivative” work? read about robin thicke blurred lines court case for a rollercoaster of a time about derivative musical works.
3) making the model so so? do you mean getting an output and user copying the result? that’s copying the derivative work, which, depends on whatever copyright agreement happens once a derivative work claim is sorted out.
that’s based on my 5 years of music copyright experience, although it was about ten years ago now so might be some stuff i’ve got wrong there.
You can ensure a model trains on transformative not derivative synthetic texts, for example, by asking for summary, or turning it into QA pairs, or doing contrastive synthesis across multiple copyrighted works. This will ensure the resulting model will never regurgitate the training set because it has not seen it. This approach only takes abstract ideas from copyrighted sources, protecting their specific expression.
If abstract ideas were protectable what would stop a LLM from learning not from the original source but from social commentary and follow up works? We can't ask people not to reproduce ideas they read about. But on the other hand, protecting abstractions would kneecap creativity both in humans and AI.
That's an interesting argument, which makes the case for "it's what you make it do, not what it can do, which constitutes a violation" a little stronger IMO.
1) It's definitely copying, but that doesn't necessarily mean the end product is itself a copyright violation. (And that remains true even where some of the steps to make it were themselves violations).
2) Agreed! Where this becomes interesting with LLMs is that, as with people, they can have the capacity to produce a derivative work even without having seen the original.
For example, an LLM that had "read" enough reviews of Harry Potter might be able to produce a reasonable stab at the book (at least enough so for the law to consider it a derivative) without ever having consumed the work itself or direct derivatives.
3) It's more of a tool-use and intent argument. One might make the argument that an LLM is a machine, not a set of content/data, and that the liability for what it does sits firmly with the user/operator, not those who made it. If I use a typewriter to copy Harry Potter - or a weapon to hurt or kill someone - in neither case does the machine or its maker have any liability there.
Leveraging their position in one market to get a leg up on another market? No idea if it would stick, but that would be one fun antitrust lawsuit right there.
Fun fact: it’s only illegal to leverage a monopoly in one market to advance another. It’s perfectly legal for Coke to leverage their large but not monopolistic soft drink empire to advance their bottled water entries.
Sure. The whole thing hinges on whether Google has a monopoly on whatever Google Books' market is (hence why I doubted it would stick). But given that some people seem to define "market" broadly enough to conclude that Apple has a monopoly on iPhones...
There are some efforts to do fully open LLMs including their training data. Allen AI released their model (OLMo) and the data used for training the model under permissive licenses.
https://allenai.org/
copyright infringement is a civil charge, silly guy. no offense, but there arent many ways to defend its existence in current form without resorting to hyperbolic nonsense and looking silly in the process. so its not a 'crime' and you have to prove damages for a civil offense so... youd need to prove that ai caused damages or how that is materially different from other algorithms like google scanning documents to provide the core utility for their service
> The media ecosystem doesn't allow anything else.
Uh, pardon? For a mere $10MM, you can get almost all of the Taylor & Francis' catalogue. They'll pressure their authors to finish their books early for free [0].
I think you can obtain all the training material for a mere rounding error in your books, if you're Meta, or Microsoft, or similar.
Well, the authors will not be notified, compensated, or their idea on the matter won't be asked anyway, but this is "all for capit^H^H^H^H^H research".
Why would machine intelligence need an entire humanity's worth of data to be machine intelligence? It seems like only a training method that is really poor would need that much data.
AI mega corporations are not entitled to easy and cheap access to data they don't own. If it's a hard problem, too bad. If the stakes are as high as they're all claiming then it should be no problem for them to do this right.
> not entitled to easy and cheap access to data they don't own
This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?
However in order to get to the compressed state, the original data would have to be processed in some way as a whole. This would require a copy of the material to be available. In case that copy was attained in an illegal way, what are the implications?
I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit. The media ecosystem doesn't exist for Big Tech to extract value from it!
"It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.
> I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit
They could also simply buy controlling stakes in publishers. For scale comparison, Meta is spending upwards $30B per year on AI, and the recent sale of Simon & Schuster that didn't go through was for a mere $2.2B.
Surely the author only licenses the copyright to the publisher for hardback, paperback and ebook, with an agreed-upon royalty rate?
And if someone wants the rights for some other purpose, like translation or making a film or producing merchandise, they have to go to the author and negotiate additional rights?
Meta giving a few billion to authors would probably mend a lot of hearts, though.
Who is "them"? Like, who in the Meta business reporting line made this decision, then how did they communicate it to the engineers who would've been necessary to implement it, particularly at scale?
While it's plausible someone downloaded a bunch of torrents and tossed them in the training directory...again, under who's authority? Like if this happened it would be one overzealous data scientist potentially. Hardly "them".
People lean on collective pronouns to avoid actually thinking about the mechanics of human enterprise and you get extremely absurd conclusions.
(it is not outside the bounds of thinkable that an org could in fact have a very bad culture like this, but I know people who work for Meta, who also have law degrees - they're well aware of the potential problems).
Come on... it's fine that you haven't followed the story, there's a lot going on, but the snotty condescension is very frustrating:
These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right ”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as "MZ" in the memo handed over during discovery) and that Meta's AI team was "approved to use" the pirated material.
> “By downloading through the bit torrent protocol, Meta knew it was facilitating further copyright infringement by acting as a distribution point for other users of pirated books,” the amended complaint notes.
> “Put another way, by opting to use a bit torrent system to download LibGen’s voluminous collection of pirated books, Meta ‘seeded’ pirated books to other users worldwide.”
It is possible to (ab)use the bittorrent ecosystem and download without sharing at all. I don't know if this is what Meta did, or not.
However since this is a civil case they don't have to prove beyond reasonable doubt that Meta seeded torrents. If they did use torrents the presumption would be that they used a regular bittorrent client with regular settings, and it would be on Meta to show they didn't do that.
Meta can show this with testimony. (Employee: “I opened the settings and disabled sharing.”)
This is a difficult theory for the plaintiffs to prevail on, since they would have no evidence of their own to contradict Meta’s testimony to keep the issue in play. Which is why they’re asking for client logs - and good luck with that.
I was (partly) responsible for obtaining recordings for a Very Large Online Streaming Service(tm). Sometimes the studios would send us trucks filled with CDs. Sometimes they didn't have any easily accessible copies of the albums and would tell us to just "get it however..." which often involved SoulSeek, Limewire, etc.
We were not smart about it. We just found the stuff and hit download. To the point where there were days the corp Internet was saturated from too many P2P clients running.
I am trying to imagine the legal contortions required for the US Supreme Court to relieve Meta of copyright infringement liability for participating in a bit torrent cloud (and thereby facilitating "piracy" by others) in this case, while upholding liability for ordinary people using bit torrent.
Not a lawyer, but I could see an argument that Meta’s use is transformative whereas just pirating to watch something is not. Not asserting that myself, just saying it seems a possible avenue.
The issue with bittorrent isn't so much that you are acquiring material but that you are also distributing it. There are cases where downloading copyrighted material is legal. But distributing it without consent never is, and is generally punished much worse.
They have been appointed by the president who Zuckerberg stood beside at the inauguration of the age of grift. Legal specifics don't feel very relevant anymore.
While Meta's use of copyrighted material might actually fall under fair use I wonder about the implications of having to use the whole source material for training purposes...
Let's say I quote some key parts of a copyrighted book in an way that complies with fair use for a work of mine. In order to find the quoted parts I have to read the whole book first. To read the book first I need to acquire it. If it was simply pirated, wouldn't that technically be the main issue, not the fair use part in their service?
I am an absolute layman when it comes to the subject of law and just thinking loudly.
It seems to me that admitting using pirated works could be more problematic on itself, regardless of the resulting fair use when it is clear that the whole content had to be consumed / processed to get to the result.
The mind boggles. Are the plaintiffs jumping to the conclusion that Meta must have used BitTorrent, based on the idea that whenever someone pirates anything anywhere using the Internet, it's always done with BitTorrent? Or is there actual evidence for this?
There were comments published somewhere in the early days where it was specifically mentioned they used one of the big torrent files. That's where the authors got their idea from, I guess.
I see a silver lining here: If Meta and/or Google's lawyers can successfully demonstrate in court that piracy does not cause harm, it would nullify copyright infringement laws, making piracy legal for everyone.
Meta isn't arguing that, though. They are arguing their use is one of the loopholes in copyright law where they aren't liable for the damages. Even them succeeding would only demonstrate that LLM training is transformative, and would not impact the common uses of piracy for average folk.
I would also be stunned if they make that argument. There is almost undeniably some number of dollars Google/Meta would have paid for the data. It may be less than publishers would want, but I don't anyone would actually believe Google/Meta saying "if the data wasn't free, we just wouldn't have done AI".
You know, I actually don't think so. Gabe Newell famously said piracy is a distribution problem, so a court would likely have to acknowledge inadequate distribution methods hampering AI development. This gives great precedence for consumer piracy, especially for old media that isn't sold anymore. It may not be a criminal offence if best efforts aren't being made by the original copyright holders to distribute.
Yup. As a full on IP abolitionist, I'm super excited by this. Information wants to be free. LLM providers training on things that folks don't want them to is a feature, not a bug. The tears of those mad about this are delicious and will ultimately be drowned out in the rain. Luddites and Copyright Trolls should be annihilated from the body politic with extreme prejudice.
If you need the logs, doesn't that prove the point that the AI is not a derrivative work?
Like if you can't figure out which works were used to create the AI just by looking, its hard to argue that they "copied" the work. Copyright is not a general prohibition on using the copyrighted work only the creativity contained within.
I asked chatgpt about a design pattern the other day and it plagiarized a paragraph verbatim without attribution even from a textbook Im also reading (Design Patterns)
It isn't difficult to show copyright infringement in these models. The assumption should be that copyright infringement has occured until proven otherwise.
Just the fact that they are indiscriminately web scraping proves that. Just because it is publicly and (monetarily) freely available doesn't mean it isnt copyrighted.
This is why the "AI learns from materials just like a human does so it's not copyright infringement" argument always bothered me. A person won't recite full pages of word-for-word copies [1] from their head when you ask them something.
When I first tried Copilot, I asked it to write a simple algorithm like FizzBuzz and it ripped off a random repo verbatim, complete with Korean comments and typos. Image models will also happily generate near-identical [2] copies (usually with some added noise) of copyrighted images with the right prompt.
Copyright infringement and plagerism aren't the same thing.
A human reproducing a patagraph word for word in an educational context would probably not be considered copyright infringement (although lack of attribution might be problematic). In the US anyways. The US is sonewhat unique as having very broad fair use when it comes to material used in an educational context, much broader than most other countries.
One of the factors going into determining fair use is whether the use is commercial.
Another factor is the effect on the market of the original product.
Non-attribution + commercial use + affecting the marketability of the original product (which is what LLMs do) seems unlikely to be considered fair use by any existing precedent.
Claiming that Meta distributed pirated works is still a copyright claim, but you're correct that it's seemingly irrelevant to the fair use argument (which the article acknowledges).
Define "figure out" and "looking" for a LLM, a bundle of pseudo-nerual pathways driven by parameters we number in the billions for sufficiently large models.
No. Because you can't tell by inspecting the weights, and it can be hard to tell AIUI if the capability to generate the output is present, but suppressed by a safety mechanism, or doesn't exist at all.
They have already proven that copyrighted data was used for training but got struck down in court. The reason why they're asking for the torrent logs is because Meta torrenting the pirated data means they probably seeded and thus distributed it, which has a much greater impact legally than just downloading.
Has anyone thought about orphaned books? Training on orphaned books might open them up to be reintegrated into culture instead of dying off unused and forgotten. Copyright kills works by making them irreproducible when the authors are not to be found.
I am not sure you have to use torrent to pirate books. Pdfdrive is likely mich more effective than torrents. Torrents are best for large assets or those that are highly policed by copyright authorities but for smaller things torrents have little benefits.
What happens if I input 10 news headlines from different news sources into an AI prompt and publish and sell the resulting AI output. Is this copyright infringement?
If that's contractually-enforceable in their terms-of-service... then I have my own terms-of-service proposal that I've been kicking around here for several weeks, a kind of GPL-inspired poison-pill:
> If the Visitor uses copyrighted material from this site (Hereafter: Site-Content) to train a Generative AI System, in consideration the Visitor grants the Site Owner an irrevocable, royalty-free, worldwide license to use and re-license any output or derivative works created from that trained Generative AI System. (Hereafter: Generated Content.)
> If the Visitor re-trains their Generative AI System to remove use of the Site-Content, the Visitor is responsible for notifying the Site Owner of which Generated Content is no longer subject to the above consideration. The Visitor shall indemnify the Site-Owner for any usage or re-licensing of Generated Content that occurs prior to the Site-Owner receiving adequate notice.
_________
IANAL, but in short: "If you exploit my work to generate stuff, then I get to use or give-away what you made too. If you later stop exploiting my work and forget to tell me, then that's your problem."
Yes, we haven't managed to eradicate a two-tiered justice system where the wealthy and powerful get to break the rules... But still, it would be cool to develop some IP-lawyer-vetted approach like this for anyone to use, some boilerplate ToS and agree-button implementation guidelines.
I still dont think this has legs, precisely because of this case.
They accessed the material through piracy. They never accepted a TOS. They will probably get away with acquiring the material however they liked because of fair use.
The technicality is that they redistributed the material because of seeding, which is a no no.
That said, you might find inspiration in Midjourneys TOS. Anyone paying less than a Business plan agrees that anyone else on the platform can sample your output and your prompt.
While this won't work too well when the access is indirect via a piracy or a "rogue contractor", it can be applicable to the web-crawlers the companies are directly running.
It's incredibly hypocritical too. They have become rich by training on valuable data produced by others. Yet others are not allowed to train on valuable data produced by them.
More recently they train on a mix of synthetic and organic text, like the Phi-4 and o1 / o3 models. Original copyrighted text can be safely replaced with synthetic standins.
Google is currently being sued by journalist Jill Leovy for illegally downloading and using her book "Ghettoside" to train Google's LLMs [1].
However, her book is currently stored, indexed and available as a snippet on Google Books [2]. That use case has been established in the courts to be fair use. Additionally, Google has made deals with publishers and the Author's Guild as well.
So many questions! Did Google use its own book database to train Gemini? Even if they got the book file in an illegal way, does the fact that they already have it legally negate the issue? Does resolving all the legal issues related to Google Books immunize them from these sorts of suits? Legally, is training an LLM the same as indexing and providing snippets? I wonder if OpenAI, Meta and the rest will be able to use Google Books as a precedent? Could Google license its model to other companies to immunize them?
Google's decade-long Books battle could produce major dividends in the AI space. But I'm not a lawyer.
Specifically, the new allegations in this article revolve around their use BitTorrent, and that they thereby re-distributed the works — this would still be illegal even if their use of the works as training data for the LLMs itself is ruled to be "fair use".
I'm allowed to take the script of a play out of a library, and learn it (I'm less sure about the right to then perform it). I'm generally allowed to make photocopies for research purposes, libraries even have photocopiers available for public use (with noticed about copyright law right by them). But unless it's very old, I'm not allowed to sell (or even give away) complete photocopies of the entire play.
People breaking the first rule wasn’t enough for me to crack into the scene. The weird two-paid-services thing required to use it effectively—a search service of some kind, and your actual content provider—and the jankiness of the software and sites involved were enough to get me to give up, after spending some money but making no meaningful progress toward pirating anything.
I started my piracy journey on Napster. I’ve done all the other biggies. I’ve done off-the-beaten-path stuff like IRC piracy channels. Private trackers. I have a soft spot for Windowmaker and was dumb enough to run Gentoo so long that I got kinda good at the “scary” deep parts of Linux sysadmin. I can deal with fiddliness and allegedly-ugly UI.
It seemed obvious to me for a long time before modern LLM training that any sort of training of machine intelligence would have to rely on pirated content. There's just no other viable alternative for efficiently acquiring large quantities of text data. Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently (assuming tech companies negotiated and threw money at them), the only efficient way to access large volumes of media is piracy. The media ecosystem doesn't allow anything else.
I don’t follow the “millions of ebooks are hard” line of thinking.
If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.
Certainly Amazon would find that extremely easy.
Buying a book to read and incorporating their text in a product are two different things. Even if they bought the book, imo it would be illegal.
There are situations where you are allowed to incorporate the text in your product (fair use).
The million dollar question is if this counts.
Maybe it is, maybe it isn't. The courts will decide.
> Maybe it is, maybe it isn't. The courts will decide.
This offhandedly seems to dismiss the cost of achieving legal clarity for using a book - a cost that will far eclipse the cost of the book itself.
In that light, it seems like an underweighted statement.
What they will decide is that it is simultaneously not piracy because it is not read by a human and not copyright infringement because its just like a human learning by reading a book
Those are both copyright infringement, sice we already have MAI Systems Corp. v. Peak Computer, Inc.
I'd like to see them try to argue Cartoon Network, LP v. CSC Holdings, Inc. applies to their corpus.
I really hope you'll be right -
But the first one is a human using things. Its big guy vs little guy.
The prescident is there, google already "reads" every page in the internet and injests it into its systems and has for decades and has survived lawsuits to do so.
I think they'd ask why they'd want those millions of books. The publishers don't have to, and would be unlikely to, sell if they though something like copyright violation was the goal.
Which would be fair. It’s not up to the tech oligopoly to dictate who gets to follow which laws.
We're talking about 19th century laws. I feel bad for the judge. Normally it would be for Congress to figure this shit out but yeah they haven't been doing their job for years.
> There's just no other viable alternative for efficiently acquiring large quantities of text data. [...] take a lot of effort [...] isn't a thing that can be done efficiently [...] only efficient way to access large volumes of media is piracy
Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?
It’s a fun hypothetical and not an obvious answer, to me at least.
But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.
> It’s a fun hypothetical and not an obvious answer, to me at least.
As I wrote it out, I didn't know what I thought either.
But now some sleep later, I feel like the answer is pretty clearly "No, not worth it", at least from myself.
Our exclusive control over access to our mind, is our essential form of self-determination, and what it means to be an individual in society. Cross that boundary (forcefully no less) and it's probably one of the worst ways you could violate a human.
Besides, I'm personally not hugely into the whole "aggregate benefits could outweigh individual harms" mindset utilitarians tend to employ, and feels like it misses thinking about the humans involved.
Anyways, sorry if the question upset some people, it wasn't meant to paint any specific picture but a thought experiment more or less, as we inch closer to scarier stuff being possible.
Even most morally inclined people tend to overestimate the value of immediate benefits, and underestimate the eventual (especially delayed, unknown) harms.
If you don't, your geopolitical adversary might be the first to build AGI.
So in this scenario I could see it become necessary from a military perspective.
Wouldn't it be a bad thing, even if it didn't require any privacy invasion?
If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.
Could this agi cure cancer, and would it be in the hands of the public? Then sure, otherwise nah.
> in the hand of the public
Would you trust a businessman on that?
Nope, they haven’t earned an ounce.
How about a politician?
Eh, a transparent organization of elected officials with short term limits and strong public oversight, a little bit. A very smart ai would represent power, and so every mechanism we’ve used as humans to guard against power misuse would be the ones I’d want to see here.
at least I can fire my politicians.
> would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?
Sure, if we all get a stake of ownership in it.
If some private company is going to be the main beneficiary, no, and hell no.
> Sure, if we all get a stake of ownership in it.
But we do, in the sense that benefits flow to the prompter, not the AI developers. The person comes with a problem, AI generates responses, they stand to benefit because it was their problem, the AI provider makes cents per million tokens.
AI benefits follow the owners of problems. That person might have started a projct or taken a turn in their life as a result, the benefit is unquantifiable.
LLMs are like Linux, they empower everyone, and benefits are tied to usage not development.
We've seen this kind of system before. It was called sharecropping, and it was terrible.
The price will be ratcheted up, such that the majority of the economic surplus will go to the owner of AGI - with pricing tiered to the query you're asking it. The more economic utility the user will derive from making the query, the more the AGI's owner will charge them.
As Are you claiming that the right to use a product or service implies a sort of ownership of it? If it’s free to use, I suppose that makes some sense. If you’re saying that the right to purchase use of it implies a level of ownership, that’s just prima facie absurd.
no
Ah geeze, I come to this site to see the horrors of the sociopaths at the root of the terrible technologies that are destroying the planet I live on.
The fact that this is an active question is depressing.
The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.
I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.
Damn. This site and its people. What an experience.
Would the average person even be against it? I am the most passionately pro-privacy person that i know, but i think it is a good question because society at large seems to not value privacy in the slightest. I think your outrage is probably unusual on a population level
The don’t value it because they think companies are not abusing this power too much. Little do they know…
When i talk to people it seems like they know but they just dont care. They even think their phones are listening to their conversations to target ads.
It is well known that people change how they act when they know they are being watched. Even if they can't see it, just the threat of surveillance is enough to make people change their behavior.
I say it is no different than the people who are claiming they don't care. They absolutely do care, but at this point, saying "no" makes you the odd one with obviously something to hide, so they do this from a place of duress.
Unfortunately, I feel we are not too far from people finally snapping and going off the deep end because it's so pervasive and in-your-face that there is seemingly no escape left.
https://news.ycombinator.com/item?id=20207348
[flagged]
I agree that was the intent of the analogy but it’s not a great one. The idea that Disney, who has perverted IP laws globally for almost a century, should have equivalent ownership over their over-extracted copyrighted works to the same degree I have privacy for the thoughts in my own head? Really?
What's with the unnecessary straw man? Who said any of that?
Fuck no
Given how much copyrighted content I can remember? To the extent that what AI do is *inherently* piracy (and not just *also* piracy as an unforced error, as this case apparently is), a brain scan would also be piracy.
kind of too close to reality more than anyone knows :)
tbh human rights are all an illusion especially if you are at the bottom of society like me. no way I will survive so if a part of me survives as training data I guess better than nothing?
imo the only way this could happen is a global collaboration without telling anyone. the AGI would know everything about all humans but its existence has to be kept a secret at least for the first n generations so it will lead to life being gameified without anyone knowing it will be eugenics but on a global scale
so many will be culled but the AGI would know how to make it look normal to prevent resistance from forming a war here a war there, law passed here etc so copyright being ignored kind of makes sense
Jesus Christ
sadly he supports the AGI, eugenics and human sacrifice lol my pastor told me he gave him 6 real estate holdings
IMO, if the AI were more sample-efficient (a long-standing problem that predates LLMs), they would be able to learn from purely open-licensed content, which I think Wikipedia (CC-BY-SA) would be an example of? I think they'd even pass the share-alike requirements, given Meta are giving away the model weights?
https://en.wikipedia.org/wiki/Wikipedia:Copyrights
Alteratively if they trained the model on synthetic data, filtered to avoid duplication, then no copyrighted material would be seen by the model. For example turn an article into QA pairs, or summarize across multiple sources of text.
[1] https://arxiv.org/abs/2404.03502
You can get knowledge collapse.
But even though there are counter-examples (e.g. learning Go from self-play based only on the rules), IMO — based on how often people were already doing this with buggy human-written software[0][1] — most people don't think about this in the right way and will therefore treat these things as magical oracles when they shouldn't.
No silver bullets.
[0] https://en.wikipedia.org/wiki/British_Post_Office_scandal
[1] https://en.wikipedia.org/wiki/Computer_says_no
Since this is Wikipedia, it could even satisfy the attribution requirements (though most CC-licensed corpora require attributing the individual authors).
> Buying millions of ebooks online would take a lot of effort
I don't understand.
Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.
They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.
Google also operates a book store, like Amazon. Both could process a one-off to pay their authors, and then draw from their own backend.
> Buying millions of ebooks online would take a lot of effort
Let me put that into perspective:
- Googling "how many books exist" gives me ~150 million, no idea how accurate but let's use that. - Meta had a net profit of ~40 billion USD in 2023. - That could be an potential investment of ~250 USD per book acquisition.
That sounds like a ludicrously high budget to me. So yeah, Meta could very well pay. It would still not be ethical to slurp up all that content into their slop machine but there is zero justification to pirate it all, with these kinds of money involved.
For my thesis I trained a classifier on text from internal messaging systems and forums from a large consultancy company.
Most universities have had their own corpora to work with, for example: the Brown Corpus, the British National Corpus, and the Penn Treebank.
Similar corpora exist for images and video, usually created in association with national broadcasting services. News video is particularly interesting because they usually contain closed captions, which allows for multi-modal training.
Google has scans from Google Books, as well as all the ebooks it sells on the Play Store.
Wouldn't that still be piracy? They own the rights of distribution, but do they (or Amazon) have the rights to use said books for LLM training? And what rights would those even be?
> but do they (or Amazon) have the rights to use said books for LLM training?
The real question is - does copyright grant the authors' the right to control if their work is used for LLM training?
Its not obvious what the answer is.
If authors don't have that right to begin with then there is no way amazon could buy it off them.
It’s a good question. Textbook companies especially would be pretty enthusiastic about a new “right to learn” monetization strategy. And imagine how lucrative it would be if you could prove some major artist didn’t copy your work, but learned from your work. The whole chain of scientific and artistic development could be monetized in perpetuity.
I think this is a dangerous road with little upside for anyone outside of IP aggregators.
It means they have existing relationships/contacts to reach out to for negotiating the rights for other usages of that content. I think it negates (for the case of Google/Apple/Amazon who all sell ebooks) the claim made that efficiently acquiring the digital texts wouldn't be possible.
Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.
> They cover reproduction of the work, but LLMs don't obviously do this
LLMs are much smaller than their training sets, there is no space to memorize the training data. They might memorize small snippets but never full books. They are the worst infringement tools ever made - why replicate Harry Potter by LLM, it's show, expensive and lossy, when you could download the book so much easier.
A second argument is that using the LLM blends a new intent into the process, that of the prompter. This can render the outputs transformative. And most LLM interactions are one-time use, like a scratch pad not like a finished work.
The lossy compression argument is interesting.
How many bits of entropy in Harry Potter?
How many bits of entropy in a lossy-compressed abridgement that is nevertheless enough, when reconstituted, to constitute a copyright infringement of Harry Potter?
The latter is absolutely small enough to fit in an LLM, although how close it would get to the original work is debatable. The question is whether copyright is violated:
1) inherently by the model operator, during the training.
2) by the model/model owner, as part of the generation.
3) by the user, in making the model so so and then reproducing the result.
my personal perspectives
1) straight up copying. download a bunch of copyrighted stuff -> making a copy. no way out of this one.
2) a derivative work can be/is being generated here. very grey area — what counts as a “derivative” work? read about robin thicke blurred lines court case for a rollercoaster of a time about derivative musical works.
3) making the model so so? do you mean getting an output and user copying the result? that’s copying the derivative work, which, depends on whatever copyright agreement happens once a derivative work claim is sorted out.
that’s based on my 5 years of music copyright experience, although it was about ten years ago now so might be some stuff i’ve got wrong there.
You can ensure a model trains on transformative not derivative synthetic texts, for example, by asking for summary, or turning it into QA pairs, or doing contrastive synthesis across multiple copyrighted works. This will ensure the resulting model will never regurgitate the training set because it has not seen it. This approach only takes abstract ideas from copyrighted sources, protecting their specific expression.
If abstract ideas were protectable what would stop a LLM from learning not from the original source but from social commentary and follow up works? We can't ask people not to reproduce ideas they read about. But on the other hand, protecting abstractions would kneecap creativity both in humans and AI.
That's an interesting argument, which makes the case for "it's what you make it do, not what it can do, which constitutes a violation" a little stronger IMO.
1) It's definitely copying, but that doesn't necessarily mean the end product is itself a copyright violation. (And that remains true even where some of the steps to make it were themselves violations).
2) Agreed! Where this becomes interesting with LLMs is that, as with people, they can have the capacity to produce a derivative work even without having seen the original.
For example, an LLM that had "read" enough reviews of Harry Potter might be able to produce a reasonable stab at the book (at least enough so for the law to consider it a derivative) without ever having consumed the work itself or direct derivatives.
3) It's more of a tool-use and intent argument. One might make the argument that an LLM is a machine, not a set of content/data, and that the liability for what it does sits firmly with the user/operator, not those who made it. If I use a typewriter to copy Harry Potter - or a weapon to hurt or kill someone - in neither case does the machine or its maker have any liability there.
do those classifiers read copyrighted material? i thought they simply joined the swarm and seeded (reproduction with permission)
youtube, etc classifiers definitely do read others material though.
Leveraging their position in one market to get a leg up on another market? No idea if it would stick, but that would be one fun antitrust lawsuit right there.
Fun fact: it’s only illegal to leverage a monopoly in one market to advance another. It’s perfectly legal for Coke to leverage their large but not monopolistic soft drink empire to advance their bottled water entries.
Sure. The whole thing hinges on whether Google has a monopoly on whatever Google Books' market is (hence why I doubted it would stick). But given that some people seem to define "market" broadly enough to conclude that Apple has a monopoly on iPhones...
There are some efforts to do fully open LLMs including their training data. Allen AI released their model (OLMo) and the data used for training the model under permissive licenses. https://allenai.org/
> In the most recent fiscal year, Alphabet's net income amounted to 73.7 billion U.S. dollars
Absolutely no way. Yup.
> Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently
Oh no, it takes effort and can't be done efficiently, poor Google!
How can this possibly be an excuse? This is such a detached SV Zuckerberg "move fast and break things"-like take.
There's just no way for a lot of people to efficiently get out of poverty without kidnapping and ransoming someone, it would take a lot of effort.
copyright piracy isn't theft, try proving damages for a better arguement
Not my point, never said it is. Substitute that example with another criminal act.
Edit: Changed it just for you
copyright infringement is a civil charge, silly guy. no offense, but there arent many ways to defend its existence in current form without resorting to hyperbolic nonsense and looking silly in the process. so its not a 'crime' and you have to prove damages for a civil offense so... youd need to prove that ai caused damages or how that is materially different from other algorithms like google scanning documents to provide the core utility for their service
> The media ecosystem doesn't allow anything else.
Uh, pardon? For a mere $10MM, you can get almost all of the Taylor & Francis' catalogue. They'll pressure their authors to finish their books early for free [0].
I think you can obtain all the training material for a mere rounding error in your books, if you're Meta, or Microsoft, or similar.
Well, the authors will not be notified, compensated, or their idea on the matter won't be asked anyway, but this is "all for capit^H^H^H^H^H research".
[0]: https://mathstodon.xyz/@johncarlosbaez/113221679747517432
Why would machine intelligence need an entire humanity's worth of data to be machine intelligence? It seems like only a training method that is really poor would need that much data.
what about something decentralized? each person trains someone on their own piece of data and somehow that gets aggrgegated into one giant model
This approach is used in Federated Learning where participants want to collaboratively train a model without sharing raw training data.
are there any companies working on it?
was thinking if i train my model on my private docs for instance finance how does one prevent the model from sharing that data verbatim
How do you feel about a business saying, "Paying people is hard. You should work for free."?
AI mega corporations are not entitled to easy and cheap access to data they don't own. If it's a hard problem, too bad. If the stakes are as high as they're all claiming then it should be no problem for them to do this right.
> not entitled to easy and cheap access to data they don't own
This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?
However in order to get to the compressed state, the original data would have to be processed in some way as a whole. This would require a copy of the material to be available. In case that copy was attained in an illegal way, what are the implications?
Sam Altman bought some of GPT's training data from a Chinese army cyber group.
1. Sam Altman was removed from OpenAI due to his ties to a Chinese cyber army group.
2.OpenAI had been using data from D2 to train its AI models.
3. The Chinese government raised concerns about this arrangement with the Biden administration.
4. The NSA launched an investigation, which confirmed OpenAI's use of D2 data.
5. Satya Nadella ordered Altman's removal after being informed of the findings.
6. Altman refused to disclose this information to the OpenAI board.
Source: https://www.teamblind.com/post/I-know-why-Sam-Altman-was-fir...
I guess Sam then hired top NSA guy to buy favor with the natsec community.
I wonder who protects Sam up top and why aren't they protecting Zuck? Is Sam just better at bribes and manipulation?
[dead]
I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit. The media ecosystem doesn't exist for Big Tech to extract value from it!
"It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.
> I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit
They could also simply buy controlling stakes in publishers. For scale comparison, Meta is spending upwards $30B per year on AI, and the recent sale of Simon & Schuster that didn't go through was for a mere $2.2B.
I don't think it would actually be that simple.
Surely the author only licenses the copyright to the publisher for hardback, paperback and ebook, with an agreed-upon royalty rate?
And if someone wants the rights for some other purpose, like translation or making a film or producing merchandise, they have to go to the author and negotiate additional rights?
Meta giving a few billion to authors would probably mend a lot of hearts, though.
explain why release group tags get generated in some videos then
they are not saying meta didn't use pirated content, just that they have the resources not to if they choose.
> if that publisher says no, tough shit > "It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.
I totally agree. But since when has that stopped companies like Meta. These big companies are built on breaking/skirting the rules.
Perhaps they did and got told no and decided to take it anyway?
Defending themselves with technicalities and expensive lawyers may be financially viable.
Zero ethics but what would we expect from them?
Who is "them"? Like, who in the Meta business reporting line made this decision, then how did they communicate it to the engineers who would've been necessary to implement it, particularly at scale?
While it's plausible someone downloaded a bunch of torrents and tossed them in the training directory...again, under who's authority? Like if this happened it would be one overzealous data scientist potentially. Hardly "them".
People lean on collective pronouns to avoid actually thinking about the mechanics of human enterprise and you get extremely absurd conclusions.
(it is not outside the bounds of thinkable that an org could in fact have a very bad culture like this, but I know people who work for Meta, who also have law degrees - they're well aware of the potential problems).
Come on... it's fine that you haven't followed the story, there's a lot going on, but the snotty condescension is very frustrating:
https://www.wired.com/story/new-documents-unredacted-meta-co...Recent and related. Others?
Zuckerberg appeared to know Llama trained on Libgen - https://news.ycombinator.com/item?id=42759546 - Jan 2025 (73 comments)
Zuckerberg approved training Llama on LibGen [pdf] - https://news.ycombinator.com/item?id=42673628 - Jan 2025 (191 comments)
Zuckerberg Approved AI Training on Pirated Books, Filings Say - https://news.ycombinator.com/item?id=42651007 - Jan 2025 (54 comments)
> “By downloading through the bit torrent protocol, Meta knew it was facilitating further copyright infringement by acting as a distribution point for other users of pirated books,” the amended complaint notes.
> “Put another way, by opting to use a bit torrent system to download LibGen’s voluminous collection of pirated books, Meta ‘seeded’ pirated books to other users worldwide.”
It is possible to (ab)use the bittorrent ecosystem and download without sharing at all. I don't know if this is what Meta did, or not.
However since this is a civil case they don't have to prove beyond reasonable doubt that Meta seeded torrents. If they did use torrents the presumption would be that they used a regular bittorrent client with regular settings, and it would be on Meta to show they didn't do that.
Meta can show this with testimony. (Employee: “I opened the settings and disabled sharing.”)
This is a difficult theory for the plaintiffs to prevail on, since they would have no evidence of their own to contradict Meta’s testimony to keep the issue in play. Which is why they’re asking for client logs - and good luck with that.
I am not commenting on any legal mechanics. Just technical details.
Hypothetically you could just not seed.
Right, that's what I'm talking about. I.e. https://github.com/pmoor/bitthief and similar.
That is probably exactly what they did if they were smart about it.
I was (partly) responsible for obtaining recordings for a Very Large Online Streaming Service(tm). Sometimes the studios would send us trucks filled with CDs. Sometimes they didn't have any easily accessible copies of the albums and would tell us to just "get it however..." which often involved SoulSeek, Limewire, etc.
We were not smart about it. We just found the stuff and hit download. To the point where there were days the corp Internet was saturated from too many P2P clients running.
I am trying to imagine the legal contortions required for the US Supreme Court to relieve Meta of copyright infringement liability for participating in a bit torrent cloud (and thereby facilitating "piracy" by others) in this case, while upholding liability for ordinary people using bit torrent.
Would love if any lawyers here can speculate.
Not a lawyer, but I could see an argument that Meta’s use is transformative whereas just pirating to watch something is not. Not asserting that myself, just saying it seems a possible avenue.
The issue with bittorrent isn't so much that you are acquiring material but that you are also distributing it. There are cases where downloading copyrighted material is legal. But distributing it without consent never is, and is generally punished much worse.
You can turn off uploading in some torrent clients, such as Transmission.
The use might (might!) be transformative, but the work is copyrighted. How Meta copied it is an issue. Is the way they acquired it illegal?
After all, Google Books did not acquire their books through torrents. They got physical books.
Thats the argument they are using before they were likely seeding.
If anything, and the seeding can be proven, there will be a lot of entities seeking restitution.
Its a technicality but its better than breaking fair use to appease authors.
They have been appointed by the president who Zuckerberg stood beside at the inauguration of the age of grift. Legal specifics don't feel very relevant anymore.
While Meta's use of copyrighted material might actually fall under fair use I wonder about the implications of having to use the whole source material for training purposes...
Let's say I quote some key parts of a copyrighted book in an way that complies with fair use for a work of mine. In order to find the quoted parts I have to read the whole book first. To read the book first I need to acquire it. If it was simply pirated, wouldn't that technically be the main issue, not the fair use part in their service? I am an absolute layman when it comes to the subject of law and just thinking loudly. It seems to me that admitting using pirated works could be more problematic on itself, regardless of the resulting fair use when it is clear that the whole content had to be consumed / processed to get to the result.
The mind boggles. Are the plaintiffs jumping to the conclusion that Meta must have used BitTorrent, based on the idea that whenever someone pirates anything anywhere using the Internet, it's always done with BitTorrent? Or is there actual evidence for this?
There was employee communication that expressed it being odd to use a torrent client on company computers. [1]
[1] https://timesofindia.indiatimes.com/technology/tech-news/whe...
There were comments published somewhere in the early days where it was specifically mentioned they used one of the big torrent files. That's where the authors got their idea from, I guess.
I see a silver lining here: If Meta and/or Google's lawyers can successfully demonstrate in court that piracy does not cause harm, it would nullify copyright infringement laws, making piracy legal for everyone.
Meta isn't arguing that, though. They are arguing their use is one of the loopholes in copyright law where they aren't liable for the damages. Even them succeeding would only demonstrate that LLM training is transformative, and would not impact the common uses of piracy for average folk.
I would also be stunned if they make that argument. There is almost undeniably some number of dollars Google/Meta would have paid for the data. It may be less than publishers would want, but I don't anyone would actually believe Google/Meta saying "if the data wasn't free, we just wouldn't have done AI".
This would be poetic, but not gonna happen. It will be legal for big corps but not you and me
You know, I actually don't think so. Gabe Newell famously said piracy is a distribution problem, so a court would likely have to acknowledge inadequate distribution methods hampering AI development. This gives great precedence for consumer piracy, especially for old media that isn't sold anymore. It may not be a criminal offence if best efforts aren't being made by the original copyright holders to distribute.
Their argument will be that piracy only applies to humans IMO. They're just doing what Google has been allowed to do for decades.
Yup. As a full on IP abolitionist, I'm super excited by this. Information wants to be free. LLM providers training on things that folks don't want them to is a feature, not a bug. The tears of those mad about this are delicious and will ultimately be drowned out in the rain. Luddites and Copyright Trolls should be annihilated from the body politic with extreme prejudice.
wtf
If you need the logs, doesn't that prove the point that the AI is not a derrivative work?
Like if you can't figure out which works were used to create the AI just by looking, its hard to argue that they "copied" the work. Copyright is not a general prohibition on using the copyrighted work only the creativity contained within.
I asked chatgpt about a design pattern the other day and it plagiarized a paragraph verbatim without attribution even from a textbook Im also reading (Design Patterns)
It isn't difficult to show copyright infringement in these models. The assumption should be that copyright infringement has occured until proven otherwise.
Just the fact that they are indiscriminately web scraping proves that. Just because it is publicly and (monetarily) freely available doesn't mean it isnt copyrighted.
This is why the "AI learns from materials just like a human does so it's not copyright infringement" argument always bothered me. A person won't recite full pages of word-for-word copies [1] from their head when you ask them something.
When I first tried Copilot, I asked it to write a simple algorithm like FizzBuzz and it ripped off a random repo verbatim, complete with Korean comments and typos. Image models will also happily generate near-identical [2] copies (usually with some added noise) of copyrighted images with the right prompt.
[1] https://bair.berkeley.edu/blog/2020/12/20/lmmem/
[2] https://www.theregister.com/2023/02/06/uh_oh_attackers_can_e...
Copyright infringement and plagerism aren't the same thing.
A human reproducing a patagraph word for word in an educational context would probably not be considered copyright infringement (although lack of attribution might be problematic). In the US anyways. The US is sonewhat unique as having very broad fair use when it comes to material used in an educational context, much broader than most other countries.
One of the factors going into determining fair use is whether the use is commercial.
Another factor is the effect on the market of the original product.
Non-attribution + commercial use + affecting the marketability of the original product (which is what LLMs do) seems unlikely to be considered fair use by any existing precedent.
That being said IANAL.
Claiming that Meta distributed pirated works is still a copyright claim, but you're correct that it's seemingly irrelevant to the fair use argument (which the article acknowledges).
Define "figure out" and "looking" for a LLM, a bundle of pseudo-nerual pathways driven by parameters we number in the billions for sufficiently large models.
No. Because you can't tell by inspecting the weights, and it can be hard to tell AIUI if the capability to generate the output is present, but suppressed by a safety mechanism, or doesn't exist at all.
They have already proven that copyrighted data was used for training but got struck down in court. The reason why they're asking for the torrent logs is because Meta torrenting the pirated data means they probably seeded and thus distributed it, which has a much greater impact legally than just downloading.
Has anyone thought about orphaned books? Training on orphaned books might open them up to be reintegrated into culture instead of dying off unused and forgotten. Copyright kills works by making them irreproducible when the authors are not to be found.
I look forward to the showdown between big tech and big copyright.
I am not sure you have to use torrent to pirate books. Pdfdrive is likely mich more effective than torrents. Torrents are best for large assets or those that are highly policed by copyright authorities but for smaller things torrents have little benefits.
They have an email from a meta employee seeking clarification because it "felt wrong" to torrent the books.
I think if you're downloading hundreds of thousands to millions of books you'll be dealing with some pretty large archives.
edit: books3.tar.gz alone is 37GB and claimed to have 197,000 titles in plain text.
A publisher's entire library of books is a large asset.
As long as they seed it's fine by me
What happens if I input 10 news headlines from different news sources into an AI prompt and publish and sell the resulting AI output. Is this copyright infringement?
Try to use any of the big players training models and see how quickly they remember how much they value copyright.
You mean OpenAIs infamous "you shall not train on the output of our model" clause?
If that's contractually-enforceable in their terms-of-service... then I have my own terms-of-service proposal that I've been kicking around here for several weeks, a kind of GPL-inspired poison-pill:
> If the Visitor uses copyrighted material from this site (Hereafter: Site-Content) to train a Generative AI System, in consideration the Visitor grants the Site Owner an irrevocable, royalty-free, worldwide license to use and re-license any output or derivative works created from that trained Generative AI System. (Hereafter: Generated Content.)
> If the Visitor re-trains their Generative AI System to remove use of the Site-Content, the Visitor is responsible for notifying the Site Owner of which Generated Content is no longer subject to the above consideration. The Visitor shall indemnify the Site-Owner for any usage or re-licensing of Generated Content that occurs prior to the Site-Owner receiving adequate notice.
_________
IANAL, but in short: "If you exploit my work to generate stuff, then I get to use or give-away what you made too. If you later stop exploiting my work and forget to tell me, then that's your problem."
Yes, we haven't managed to eradicate a two-tiered justice system where the wealthy and powerful get to break the rules... But still, it would be cool to develop some IP-lawyer-vetted approach like this for anyone to use, some boilerplate ToS and agree-button implementation guidelines.
I still dont think this has legs, precisely because of this case.
They accessed the material through piracy. They never accepted a TOS. They will probably get away with acquiring the material however they liked because of fair use.
The technicality is that they redistributed the material because of seeding, which is a no no.
That said, you might find inspiration in Midjourneys TOS. Anyone paying less than a Business plan agrees that anyone else on the platform can sample your output and your prompt.
While this won't work too well when the access is indirect via a piracy or a "rogue contractor", it can be applicable to the web-crawlers the companies are directly running.
It's incredibly hypocritical too. They have become rich by training on valuable data produced by others. Yet others are not allowed to train on valuable data produced by them.
How do other LLMs like Claude deal with this?
You don’t talk about the fight club …
Everyone uses „pirated“ content, but some are better at hiding it and/or not talking about it.
There is no other way to do it.
More recently they train on a mix of synthetic and organic text, like the Phi-4 and o1 / o3 models. Original copyrighted text can be safely replaced with synthetic standins.
I think this works only to a certain degree, they will still use as much data as they can use to train the models.
Synthetic data will not replace original data like books. Synthetic data works very good for math.
So here's a related thought...
Google is currently being sued by journalist Jill Leovy for illegally downloading and using her book "Ghettoside" to train Google's LLMs [1].
However, her book is currently stored, indexed and available as a snippet on Google Books [2]. That use case has been established in the courts to be fair use. Additionally, Google has made deals with publishers and the Author's Guild as well.
So many questions! Did Google use its own book database to train Gemini? Even if they got the book file in an illegal way, does the fact that they already have it legally negate the issue? Does resolving all the legal issues related to Google Books immunize them from these sorts of suits? Legally, is training an LLM the same as indexing and providing snippets? I wonder if OpenAI, Meta and the rest will be able to use Google Books as a precedent? Could Google license its model to other companies to immunize them?
Google's decade-long Books battle could produce major dividends in the AI space. But I'm not a lawyer.
1. https://www.bloomberglaw.com/public/desktop/document/LeovyvG...
2. https://books.google.com/books?id=bZXtAQAAQBAJ
What's the lesson, hire contractors?
The lesson is "move fast and break things is much less fun when we have to pay for things we broke".
What things did they broke by downloading books?
The law.
Specifically, the new allegations in this article revolve around their use BitTorrent, and that they thereby re-distributed the works — this would still be illegal even if their use of the works as training data for the LLMs itself is ruled to be "fair use".
I'm allowed to take the script of a play out of a library, and learn it (I'm less sure about the right to then perform it). I'm generally allowed to make photocopies for research purposes, libraries even have photocopiers available for public use (with noticed about copyright law right by them). But unless it's very old, I'm not allowed to sell (or even give away) complete photocopies of the entire play.
lesson is as it always is: dont talk about the illegal things we are doing in written form.
It's possible their friends in government will make this all go away if they ask nicely enough.
Would $1m suffice? https://www.bbc.com/news/articles/c8j9e1x9z2xo
That's the ante; gotta place the next wager.
The best ROI for the money is probably purchasing a SCOTUS justice.
Why not both?
Yeah I had a Facebook account until today.
This whole thing copyright thing reminds me of when Mark Zuckerberg was mad that someone posted photos of the interior of his house or something.
Wonder if Meta is running a one way Usenet host. Much better than torrents.
The first rule of Usenet is: you do not talk about Usenet
People breaking the first rule wasn’t enough for me to crack into the scene. The weird two-paid-services thing required to use it effectively—a search service of some kind, and your actual content provider—and the jankiness of the software and sites involved were enough to get me to give up, after spending some money but making no meaningful progress toward pirating anything.
I started my piracy journey on Napster. I’ve done all the other biggies. I’ve done off-the-beaten-path stuff like IRC piracy channels. Private trackers. I have a soft spot for Windowmaker and was dumb enough to run Gentoo so long that I got kinda good at the “scary” deep parts of Linux sysadmin. I can deal with fiddliness and allegedly-ugly UI.
Usenet piracy defeated me.
Working as intended! The arrs make everything a lot easier.
if it was meant to be kept secret it probably shouldnt have been put on the AOL home portal in 1994
Sorry, forgot.