Meta’s Llama 3.1 Model Raises Harry Potter Related Copyright Concerns - Know More

Reading time icon 3 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

A recent study from Stanford, Cornell, and West Virginia University reveals that Meta’s open-source Llama 3.1 70B model can reproduce 50-token chunks from Harry Potter and the Sorcerer’s Stone more than half the time for 42 percent of the book. That finding surprised many in the AI legal field.

By contrast, Llama 1 65B, released in February 2023, retained only 4.4 percent of the exact text. That sharp increase suggests Meta didn’t limit memorization during training. Instead, the intense training schedule on 15 trillion tokens likely bolstered recall.

Researchers used a methodical, probability-based test. They split 36 books into 100-token segments and prompted the models with the first half. If the model’s calculated probability of producing the next 50 tokens exceeded 50 percent, that segment was counted as “memorized”. Llama 3.1 returned matching passages for approximately 42 percent of Harry Potter sections.

Other recent Meta news –

The study also noted that Llama 3.1 memorized more segments from well-known titles like The Hobbit and 1984, and fewer from obscure works like Sandman Slim, which was just 0.13 percent. That raises questions such as did the model ingest entire copyrighted texts or rely on widely quoted secondary sources?

Legal experts say this level of verbatim output could influence ongoing copyright litigation. The New York Times famously accused GPT-4 of reproducing news stories verbatim. In Llama 3.1’s case, memorizing large text chunks could support arguments for derivative work infringement. Fair use defenses may prove weaker when models star in recall tests like this.

Transparency actually complicates Meta’s stance. Open-weight models let researchers track token probabilities and detect memorization. Companies that restrict access to logits make it harder for outsiders to mount similar legal challenges. Cornell’s James Grimmelmann worries open models could become legal targets, while closed models benefit from opacity.

Even Meta’s legal strategy may shift. Some courts require similar infringement levels across plaintiffs. The stark difference between Llama’s recall of Rowling’s work versus lesser known titles could prevent class certification . That might favor Meta by forcing individual suits rather than large group claims.

Litigation will likely hinge on whether courts view Llama 3.1’s output as copied text or learned patterns. This new analysis adds data to a debate that previously relied on theory. As legal teams weigh probabilities and output thresholds, this research may redefine how law treats AI training and how models learn from copyrighted sources.

You may also be interested to read –

More about the topics: Meta

User forum

0 messages