I feel like role-playing as a lawyer, I'm curious how would you defend against this in court?
I don't think anyone denies that frontier models were trained on copyrighted material - it's well documented and public knowledge. (and a separate legal question regarding fair-use and acquisition)
I also don't think anyone denies that a model that strongly fits the training data approximates the copy-paste function. (Or at the very least, if A then B, consistently)
In practice, training resembles lossy compression of the data. Technically one could frame an LLM as a database of compressed training inputs.
This paper argues and demonstrates that "extraction is evidence of memorization" which affirms the above.
In terms of LLM output (the valuable product customers are paying for) this is familiar, albeit grey, legal territory.
https://en.wikipedia.org/wiki/Substantial_similarity
When a customer pays for an AI service, they're paying for access to a database of compressed training data - the additional layers of indirection sometimes produce novel output, and many times do not.
Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?
I wonder however, if this paper might imply the answer.
"But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely."
I wonder if we could exclude the full text of these books from the training data and still approximate this result? Harry Potter and 1984 are probably some of the most quoted texts on the internet.
>Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?
I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
I just tried ChatGPT:
>I can’t provide the full text of Harry Potter, as it’s copyrighted material. However, I can summarize it, discuss specific scenes or characters, or help analyze the themes or writing style if that’s useful. Let me know what you're after.
For my money, as long as the AI companies treat the reproduction of copyrighted material as a failure state, the nature of the training data is irrelevant.
> I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Suppose a company stores the whole of stack exchange in plaintext, then implements a chat-like interface that fuzzy matches on question, extracts answers from plain-text database, fuzzes top-rated/accepted answers together and outputs something, not necessarily quoting one distinct answer, but pretty damn close.
How much "fuzziness" is required for this to stop being copyright violation? LLM-advocates try to say that LLMs are "fuzzy enough" without clearly defining what that enough means.
>Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Right so sort of like a search engine that caches thumbnails of copyrighted images to display quick search results? Something I have been using for years and have no issues with, where the legal arguments are framed more about where the links go, and how easy the search engine makes it for me to acquire the original image?
Would your argument be the same if it was a human? If a person memorizes a book verbatim, however uses safety/common sense not the transcribe the book for others because it is a copyright infringement disallow him from using the information memorized whatsoever because he can duplicate it?
What if it was an alien, or a magical being?
There is no reason the same reasoning must apply for humans as it does for machines or code. Our laws already work this way.
I don't follow. Are you implying humans are not real, or can't memorize copyrighted material verbatim?
I'm curious how would you defend against this in court?
If by “you” you mean Google or OpenAI or Microsoft, etc., you use your much much deeper pockets to pay lawyers to act in your interests.
All authors, publishers, etc. are outgunned. Firepower is what resolves civil cases in one party’s favor and a day in court is easily a decade or more away.
Deep pockets are not a get out of jail free card. If a case escalates to the SCOTUS there will be many firms that submit amicus curiae outlining their position on the matter and how it threatens their rights. Those people, arguably, represent more money and influence than Google, OpenAI, Microsoft, etc. So if we accept the premise that all legal matters are decided on a basis of pure politics as mediated by money, then ultimately every court battle is a battle to assert that your actions don't actually affect the interests of interested parties and that you'll fight them if they try to assert otherwise, and on that count it is reasonable to surmise that there are more interested parties with deeper pockets than any firm or firms fielding LLMs that might be caught up in a lawsuit over this.
Ultimately, if an author can demonstrate protectable expression has been incorporated into an AI's training set and is emitted by said AI, no matter how small, they've got a case of copyright infringement. That being the case, LLM-based companies are going to suffer death by a thousand paper cuts.
If a case escalates to the SCOTUS
For a civil case, that ain’t gonna be cheap or fast or likely.
We just got through talking about how the players involved have deep pockets, and have a vested interest in seeing their way prevail... so cheap doesn't matter, likely is malleable, which leaves only "fast" which I do not contest.
But if you pay Thomas and Alito the right money and have the right politics then it's in the bag.
Yeah, people don't want to admit it but 90% of US law is based on who can spend the most money on lawyers and drain their oppositions coffers first, in both civil and criminal cases.
I think paper itself expresses that
Page 9: There is no deterministic path from model memorization to outputs of infringing works. While we’ve used probabilistic extraction as proof of memorization, to actually extract a given piece of 50 tokens of copied text often takes hundreds or thousands of prompts. Using the adversarial extraction method of Hayes et al. [54], we’ve proven that it can be done, and therefore that there is memorization in the model [16, 27]. But this is where, even though extraction is evidence of memorization, it may become important that they are not identical processes (Section 2). Memorization is a property of the model itself; extraction comes into play when someone uses the model [27]. This paper makes claims about the former, not the latter. Nevertheless, it’s worth mentioning that it’s unlikely anyone in the real world would actually use the model in practice with this extraction method to deliberately produce infringing outputs, because doing so would require huge numbers of generations to get non-trivial amounts of text in practice
Yes perhaps deliberate extraction is impractical, but I wonder about accidental cases? One group of researchers is a drop in the bucket compared to the total number of prompts happening everyday. I would like to see a broad statistical sampling of responses matched against training data to demonstrate the true rate of occurrence. Which begs the question, what is the acceptable rate?
Exactly. I feel like the AI companies are intentionally moving the goal posts- regardless of whether the resulting generated content is the same as the original, they still committed the crime of downloading and using the original copyright content in the first place!
After all they wouldn’t have used that content unless it provided some utility over not using it…
This ground was already covered for search engines. In USA law the answer is transformative Fair Use.
We don't have transformative Fair Use, nor a Fair Dealing equivalent, in the UK - I don't see anything that allows this type of behaviour?
Agreed -- there is a kind of compression being done. But what will happen is that the law will be changed to suit whoever has the most money, probably with the excuse that "but China will beat us otherwise".
The model itself is transitive, and since the output alone of a model can't be copyrighted I feel like it may not be possible to sue over the output of a model.
Yes, if I read a book, memorize some passages, and use those memorized passages in a work without citation, it is plagiarism. I don't see how this is any different without relying on arbitrary but human-centric distinctions.
More to the point, if you steal the book and never even read it, you are still guilty of a crime.
> the additional layers of indirection sometimes produce novel output, and many times do not.
I think this is the key insight. It differs from something like say, JPEG (de)compression, in that it also produces novel but sensible combinations of both a number of copyrighted and non-copyrighted data, independent of their original context. In fact, I'd argue that is its main purpose. To describe it as just a lossy compressed natural-language-queryable database as a result would be reductive to its function and a mischaracterization. It can recall extended segments of its training data as demonstrated by the paper, yes, but it also cannot plagiarize the entirety of a given source data, as also described by the paper.
> why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works?
Because these companies and services on their own are not producing the output that is substantially similar. They (possibly) do it on user input. You could make a case that they should perform filtering and detection, but I'm not sure that's a good idea, since the user might totally have the rights to create a substantially similar work to something copyrighted, such as when they themselves own the rights or have a license to that thing. At which point, you can only hold the user themselves responsible. I guess detection on its own might be reasonable to require, in order to provide the user with the capability to not incriminate themselves, should that indeed not be their goal. This is a lot like with famous people detection and filtering, which I'm sure tech reviewers have to battle from time to time.
This isn't to say they shouldn't be held responsible for pirating these copyrighted bits of content in the first place though. And if they perform automated generation of substantially similar content, that would still be problematic following this logic. Not thinking of chain-of-thought here mind you, but something more silly, like writing a harness to scrape sentiment and reactively generate things based on that. Or to use, idk, weather or current time and their own prompts as the trigger.
Let me give you a possibly terrible example. Should Blizzard be held accountable in Germany, when users from there on the servers located on there stand in a shape of a nazi swastika ingame, and then publish screenshots and screen recordings of this on the internet? I don't think so. User action played crucial role in the reproduction of the hate symbol in question there. Conversely, LLMs aren't just spouting off whatever, they're prompted. The researchers in the paper had to put in focused efforts to perform extraction. Despite popular characterization, these are not copycat machines, and they're not just pulling out all their answers out of a magic basket cause we all ask obvious things answered before on the internet. Maybe if the aforementioned detections were added, people would finally stop coping about them this way.
One runs the risk of being reductive when examining a mechanisms irreducible parts.
User expression is a beast unto itself, but I wonder if that alone absolves the service provider? I imagine Blizzard has an extensive and mature moderation apparatus to police and discourage such behavior. There's an acceptable level of justice and accountability in place. Yet there are even more terrible real-life examples of illicit behavior outpacing moderation and overrunning platforms to the point of legal intervention and termination. Moderating user behavior is one thing, but how do you propose moderating AI expression?
A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data. I think we could agree that a model trained on the worst possible data you can imagine is something we don't need in the world, no matter how well behaved your prompting is.
I do not propose moderating "AI expression" - I explicitly propose otherwise, and further propose mandating that the user is provided with source attribution information, so that they can choose not to infringe, should they be at risk of doing so, and should they find that a concern (or even choose to acquire a license instead). Whether this is technologically feasible, I'm not sure, but it very much feels like to me that it should be.
> A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data.
I'm not sure how to respond to this if at all, I think I addressed how I characterize the functionality of these models in sufficient detail. This just reads to me like an "I disagree" - and that's fine, but then that's also kinda it. Then we disagree and that's okay.