You do the work. Big Tech takes the model.

Community Article Published May 11, 2026
Listen to the article as TTS audio

Let's change that.

The AI industry depends on two categories of human work it prefers to rename.

The first category is the work that already existed. It includes novels, journalism, scientific papers, code repositories, academic textbooks, forum posts, lyrics, legal research, encyclopedias, personal essays, tutorials, art criticism, and the millions of pages people built and published on the internet. This work trains base models.

The second category is the work created after the scrape. It includes annotation, ranking, red-teaming, moderation, preference labeling, instruction writing, expert feedback, and exposure to the worst material the internet contains. This work turns a raw base model into something a customer can use and a company can sell.

The industry calls the first category "publicly available data." It calls the second "human feedback". Both labels hide the issue. Public access is not permission. Human feedback is not automatically fair pay, informed consent, psychological safety, or transparent labor.

Urro's position is simple. Modern AI has been built by converting other people's work into private model capability. The source material is renamed data. The people who clean, rank, and repair the model are hidden behind contractors. The finished system is then treated as proprietary property.

This is not only a story about instruction tuning. Instruction tuning makes the contradiction easier to see because workers are closer to the surface. They write, rank, moderate, and repair model behavior. But the deeper problem begins earlier. Base models are trained on mass collections of human-created text, much of it scraped without permission, and then post-trained into products that inherit the same ethical debt.

Urro is building toward a general-purpose, instruction-tuned model. That link matters. But the standard has to apply to the full pipeline, including pretraining, post-training, evaluation, distillation, and deployment.

The evidence is no longer circumstantial

The AI industry's copyright posture used to depend on ambiguity. Companies scraped the web, called it fair use, and argued that model training was transformative enough to require no permission from the people whose work was transformed.

Discovery has made that position harder to defend.

Meta is the clearest example. In Kadrey v. Meta, court records showed that Meta downloaded books from shadow libraries, including LibGen and Anna's Archive. Reporting by Cybernews, Wired, and Rolling Stone described internal records alleging at least 81.7 terabytes of pirated data, escalation to Mark Zuckerberg, and approval to use LibGen for Llama 3.

A newer publisher lawsuit filed on 5 May 2026 goes further. Elsevier, Cengage, Hachette, Macmillan, McGraw Hill, and Scott Turow allege that Meta and Zuckerberg used millions of copyrighted books, journal articles, and other works to train Llama, describing the conduct as "one of the most massive infringements of copyrighted materials in history" (publisher complaint, Reuters, AP). The complaint alleges that Zuckerberg personally authorized the conduct. Meta says it will fight the case and argues that AI training can fall under fair use.

The point is not that every allegation has already been proven. The point is that the industry's training data story has moved from suspicion to documents, depositions, complaints, and judicial orders.

Anthropic's record is also revealing. In Bartz v. Anthropic, Judge William Alsup ruled that training on lawfully acquired books could qualify as fair use, but he separated that question from Anthropic's acquisition and retention of pirated copies. He wrote that buying a lawful copy later would not absolve the company of having stolen the earlier copy. Anthropic later agreed to a proposed $1.5 billion settlement covering hundreds of thousands of books, widely described as the largest copyright settlement of its kind (NPR, AP, Kluwer Copyright Blog).

Anthropic's book settlement did not end its copyright exposure. Music publishers, including Universal Music Group, Concord, and ABKCO, have pursued separate claims alleging unauthorized use of song lyrics and other works. In 2026, those publishers sought more than $3 billion and asked the court to rule that Anthropic infringed their copyrights (TechCrunch, Music Business Worldwide, CourtListener).

OpenAI faces a parallel record. The New York Times v. OpenAI alleges that OpenAI and Microsoft used Times journalism without authorization to train and operate AI products. Discovery disputes have covered deleted datasets, preserved outputs, and large sets of ChatGPT conversations (NPR, TechCrunch, Hollywood Reporter, NYSD opinion). The statutory risk is large because U.S. copyright law can allow enhanced statutory damages of up to $150,000 per work for willful infringement.

The first major fair use rejection in an AI training case came from Thomson Reuters v. Ross Intelligence. On 11 February 2025, Judge Stephanos Bibas held that Ross's use of Westlaw headnotes to train a competing legal research tool was not fair use (court opinion, Loeb & Loeb, Davis Wright Tremaine). Ross was not a generative AI system, so the ruling does not settle every LLM case. It does show that AI training is not outside copyright law.

By May 2026, AI copyright litigation had become a broad legal field, not a fringe complaint. The Copyright Alliance and independent trackers such as chatgptiseatingtheworld.com have documented dozens of active cases. The common question is no longer whether AI companies used copyrighted material. The harder question is which uses courts will tolerate, which uses require licensing, and which uses were piracy all along.

The industry chose cost over consent

The AI industry's fallback argument is necessity. The claim is that web-scale text is essential to language understanding, that licensing at that scale is impossible, and that no modern language model can be built without unlicensed web corpora.

That claim is increasingly hard to defend.

On 5 June 2025, a 27-author research team released The Common Pile v0.1, an 8-terabyte corpus of public-domain and openly licensed text. The dataset includes books, code, papers, encyclopedias, educational material, transcripts, and other sources. The team trained two 7-billion-parameter base models, Comma v0.1-1T and Comma v0.1-2T, and reported competitive performance against budget-matched models trained on unlicensed data. The Comma v0.1-2T model card describes a 7B base model trained on 2T tokens from the Common Pile.

The Common Pile does not solve every problem. It is a base-model dataset. It also contains license-metadata uncertainty and relies on the usual practical limits of large-scale data collection. But it proves an important technical point. Capable base models can be trained without treating the whole internet as free input.

Other efforts point in the same direction. Common Corpus assembled roughly two trillion tokens from uncopyrighted or permissively licensed sources. KL3M focuses on copyright-clean training resources, much of it from U.S. federal public-domain material. C4C is cited in the Common Corpus report as a 228-billion-token open-license corpus. These projects are not enough on their own to replace every commercial training pipeline. They are enough to show that the impossibility argument is false.

This matters for IBM Granite. Granite looked, at first, like a serious candidate for a cleaner model family. IBM emphasizes enterprise governance, permissive model licenses, and data screening. But Granite 3.1's model card says its base model was trained on a mix of open-source and proprietary data from domains including web, code, academic sources, books, and math (Granite 3.1 model card). IBM's Granite 3.0 report names FineWeb and DCLM as examples in its web pipeline (Granite 3.0 report). FineWeb is derived from Common Crawl (FineWeb dataset card, FineWeb paper).

Common Crawl is not a license. Its terms of use require users to comply with applicable law and recognize that crawled content may be subject to separate rights. An archive of the web does not become a rights-cleared dataset just because it is convenient for pretraining. Common Crawl may be useful infrastructure. It is not consent from the authors, publishers, developers, teachers, forum users, or artists whose work appears inside it.

IBM researchers represented to Urro that web corpora are required for basic language understanding. Urro disagrees. Comma 7B is one counterexample. More broadly, the licensed-data field is young, underfunded, and already producing usable base models. The real distinction is not possible versus impossible. It is cheaper versus harder.

Licensing markets also exist. The Financial Times announced a licensing agreement with OpenAI on 29 April 2024 (Financial Times, Reuters). Vox Media and The Atlantic announced OpenAI licensing deals on 29 May 2024 (Variety, VentureBeat). Reddit licensed data to Google in a deal reportedly worth about $60 million annually (Reuters). Shutterstock generated substantial AI licensing revenue (Bloomberg, PetaPixel).

These deals are imperfect. Licensing a platform's data does not automatically compensate every user whose work made that platform valuable. But the deals show that payment is possible. Big companies license when the counterparty has enough power to force the issue. They call permission impossible when the rights holders are fragmented, individual, poor, or unlikely to sue.

Common Crawl became the cover story

"Publicly available" became the industry's favorite phrase because it sounds factual and harmless. A page can be publicly reachable and still copyrighted. A forum post can be crawlable and still unlicensed for commercial model training. A PDF can be indexed by a web crawler and still belong to its author or publisher.

Common Crawl sits at the center of that ambiguity. It is often presented as a neutral research archive, and in many contexts it is. The issue is what AI companies do with it. When a model developer says it used Common Crawl, it has not answered the licensing question. It has named the acquisition channel.

Researchers have warned about this problem directly. A FAccT paper on Common Crawl described it as a major source for generative AI training data and analyzed the legal, representational, and governance problems that follow from using it at scale (ACM FAccT). News publishers have also challenged Common Crawl's role in commercial AI pipelines (News Media Alliance).

The EU AI Act is beginning to make this harder to hide. The European Commission released a mandatory template for public summaries of general-purpose AI training data in 2025 (European Commission, WilmerHale, Plume Law). The point of these summaries is basic accountability. "Web data" is not enough. Providers need to describe data types, sources, and collection methods.

Transparency is still weak. Stanford's 2025 Foundation Model Transparency Index found that disclosure remains uneven and, in several areas, declining (Stanford Report, FMTI paper). That is the problem. The companies building systems that affect work, search, education, law, and culture still disclose less about their data than a food label discloses about cereal.

The other hidden dataset is labor

Copyright is the easier half to document. It leaves files, licenses, docket entries, discovery disputes, and settlement records. Labor is harder to see because the system is designed to keep workers separated from the companies that benefit.

Modern AI systems are not built from pretraining data alone. They require annotation, preference ranking, safety labeling, red-teaming, expert review, and moderation. The industry calls this human feedback. Workers describe unstable contracts, low pay, exposure to disturbing content, and silence about the ultimate client.

The most cited example remains OpenAI's use of workers in Kenya through Sama. A TIME investigation published on 18 January 2023 reported that workers were paid less than $2 per hour to label toxic content, including graphic sexual violence, child abuse, bestiality, murder, suicide, and torture, for systems connected to ChatGPT safety (TIME, Business & Human Rights Resource Centre, Vice). The Guardian later reported that Kenyan moderators described inadequate warning and insufficient psychological support (The Guardian).

This pattern did not stay in the Global South. A UNI Global Union report on U.S. AI data workers found that 86% worried about meeting basic needs, one quarter relied on public assistance, and median annual earnings were roughly $22,620 (UNI Global Union). The attached Karen Hao investigation transcript describes workers connected to major AI systems, including ChatGPT and Gemini, who reported unstable contracts, project hopping, sudden pay cuts, and exposure to violent AI-generated material (YouTube source).

Scale AI and its related platforms show how the model works. The company has faced multiple wage and misclassification lawsuits, including claims over unpaid training time, minimum wage violations, and overtime (TechCrunch, Computerworld, AlgorithmWatch). Reports also described a U.S. Department of Labor investigation into possible wage violations (opentools.ai).

The larger labor chain remains opaque. SOMO reported on 31 March 2026 that Amazon, Google, and Meta refused in 2025 to disclose which human annotation services they used to develop AI models, making working conditions difficult to assess (SOMO). Privacy International has documented the role of data labelers behind powerful LLM training datasets (Privacy International). The Institute for Human Rights and Business describes content moderation as a new factory floor of exploitation (IHRB).

The pattern is consistent. AI companies need human judgment. They acquire it through contractors. Workers often do not know who the final client is. The client gets model improvement. The worker gets a task queue, a nondisclosure agreement, and the risk.

Watch Karen Hao's investigation into hidden AI data workers Watch 60 Minutes on Kenyan workers training AI

left: Karen Hao's investigation into hidden AI data workers
right: 60 Minutes on Kenyan workers reviewing traumatic AI training content

The distillation double standard

AI companies know that model outputs have value. Their terms of service prove it.

OpenAI's terms prohibit users from automatically extracting output and from using output to develop models that compete with OpenAI (OpenAI terms). Anthropic's help center says Claude outputs may not be used to train models competitive with Anthropic, including general-purpose chatbots and open-ended text generation models (Claude Help Center).

Anthropic has made this a public issue. On 23 February 2026, it said it had detected "industrial-scale" distillation campaigns by DeepSeek, Moonshot, and MiniMax, involving more than 16 million Claude exchanges through about 24,000 fraudulent accounts (Anthropic, Reuters). Anthropic described self-distillation as legitimate when a company distills its own models, while describing competitor extraction as illicit.

There is a real contractual issue here. Fraudulent accounts, evasion, and breach of service terms are not defensible practices. The hypocrisy is elsewhere.

The same companies that treat their outputs as protected assets trained on human works whose owners were not asked. They prohibit competitors from using their model behavior as training signal, while arguing that authors, artists, publishers, developers, and forum users cannot make the same objection to their own work.

This is not a principled theory of intellectual labor. It is a theory of ownership that starts when the model company becomes the owner.

That is why self-distillation matters for Urro. If a model is trained on a licensed foundation, its own outputs can become part of a cleaner improvement loop. If the seed model is polluted, self-distillation can reproduce the pollution. The technique is not ethical by itself. Its value depends on the provenance boundary around the system.

Recent work makes this more than a slogan. Self-Distilled Reasoner studies on-policy self-distillation, where a model learns from its own rollouts under different information conditions. UniSD proposes a unified self-distillation framework for large language models. Skill-SD applies skill-conditioned self-distillation to multi-turn agents. There are also cautionary results. A 2026 paper asks why self-distillation can degrade reasoning and reports that it can suppress uncertainty and harm out-of-distribution performance (Hugging Face Papers).

Self-distillation is not a shortcut around ethics. It is a way to reduce dependence on proprietary teachers, scraped instruction traces, and large-scale new labeling once the clean foundation exists.

Talkie-1930 proves one point and exposes another

Urro's internal study found that, as of 2025, there was no public general-purpose, instruction-tuned LLM trained only on licensed, public-domain, or otherwise clearly permissioned data.

As of 2026, the closest public contender may be talkie-1930-13b-it. The model card describes it as a 13B instruction-tuned post-train of talkie-1930-13b-base, which was trained on 260B tokens of pre-1931 English text. The project says the instruction-tuned model avoids modern chat transcripts and instruction-tuning data, using pre-1931 reference works instead (talkie project, GitHub).

That is important. Talkie shows that a clean instruction-tuned model is not impossible. But Talkie is intentionally niche. It is historical, pre-modern, and not a contemporary general-purpose assistant. Its value is proof of concept, not replacement.

That leaves the gap Urro cares about. The world still lacks a modern, useful, general-purpose instruction-tuned model whose full pipeline is built on licensed data and defensible labor practices.

What the standard should be

The AI industry does not need more vague moral branding. It needs operational standards.

Do not train on work you do not have the right to use. Public access is not a license. A web crawl is not consent.

Do not hide data acquisition behind neutral infrastructure. Common Crawl names a source channel, not a rights-cleared corpus.

Do not claim necessity when alternatives exist. Common Pile, Common Corpus, KL3M, C4C, publisher licensing deals, and public-domain partnerships all show that cleaner paths exist.

Do not call contractors "human feedback" when they are workers doing model labor. They should know the client, the task, the risks, the pay, and the use of their work.

Do not outsource trauma. A safety pipeline that depends on low-paid workers reading abuse, gore, and sexual violence is not morally clean because the finished chatbot is polite.

Do not prohibit competitors from distilling your outputs while treating everyone else's work as a free training substrate.

Do not call this inevitable. It is a business choice.

What Urro is building

Urro is an AI research organization working toward a different model pipeline. The goal is not an open-weight model with a vague data story. The goal is a modern, general-purpose, instruction-tuned model whose training data, structured tasks, feedback data, evaluations, and improvement loops stay inside a defensible provenance boundary.

That means using author-licensed, public-domain, and compatible-license content. It means excluding incompatible noncommercial and no-derivatives material when the intended use or release terms require it. It means treating pretraining and post-training as one ethical chain rather than separate branding layers.

It also means treating labor as part of the model's provenance. If humans write, label, rank, review, or red-team training data, they are not an invisible implementation detail. They are workers. They should be paid fairly, informed about the work, protected from avoidable harm, and allowed to decline harmful tasks without punishment.

Self-distillation is part of this strategy because it can amplify a clean foundation without constantly returning to the same extractive sources. It cannot make a dirty model clean. It can help a clean model improve without copying the industry's worst habits.

The alternative is not easy. It requires more curation, slower data acquisition, better documentation, narrower licensing decisions, and more expensive labor. That is the point. Ethical AI is not the cheaper version of exploitative AI. It is a different production model.

Progress does not come from scale alone. It comes from people.

People should not have to surrender their work, wages, or psychological wellbeing so that Big Tech can take the model.


The point here is simple.
This is not only about legality.
It is about the basic moral floor any decent industry should meet.

Do not steal people's work.
Do not hide the workers.
Do not outsource the harm.

Do not turn other people's lives, writing, judgment, and trauma into private infrastructure, then call it innovation.

Don't steal people's shit.
It was wrong before AI. It is still wrong now.

Can we at least agree on that?


This article was written by a human.

See also

Sources

Copyright litigation and training data

Licensed and public-domain data alternatives

Common Crawl, Granite, and model transparency

Annotation labor and hidden AI work

Distillation and self-distillation

Talkie-1930

Community

Yes I agree with this!!

Sign up or log in to comment