Text Generation
Safetensors
English
llama
conversational

Marin is not open source: trained on CC BY-NC data

#1
by JLouisBiz - opened

I'd like to offer a factual correction on the licensing claims.

The free software movement was started by Richard Stallman in the 1980s with the GNU Project and the Free Software Foundation. The four essential freedoms are: run the program for any purpose, study and change it (requiring source code), redistribute copies, and distribute modified versions. The Open Source Initiative later coined "open source" in 1998 to pitch the same freedoms to businesses. The Open Source AI Definition 1.0, published by OSI, makes clear what constitutes open source AI systems.

Marin claims to be "open-source foundation models" and "the best open-source model" in their marketing. Yet the model was trained on facebook/natural_reasoning, which is licensed CC BY-NC — non-commercial. That alone disqualifies it from being open source. A model derived from non-commercially licensed data cannot be freely used commercially, regardless of how small that dataset fraction is.

I needed less than one minute to find this. That a project would make such confident "open source" claims while training on NC-licensed data is reckless to the free software definition. It misleads developers, creates legal risk for downstream users, and dilutes the meaning of a term that took decades to establish.

Organizations that recklessly misuse "open source" just to attract users should adopt proper guidelines. The OSI's Open Source AI Definition (https://opensource.org/ai/open-source-ai-definition) is the standard. If you're not meeting it, don't claim you are.

This isn't pedantry. It's accountability to the freedom that GNU started. It's sad to see otherwise.

The Marin Project org
edited 15 days ago

The model itself is released under the Apache license which is OSI Approved (https://opensource.org/license/apache-2.0). Natural Reasoning is released under CC-by-NC license, without SA or ND restrictions. Based to the best of my knowledge and the published guidance of the Creative Commons, this is within the terms of the license

That being said, the term open-source in AI certainly is a bit noisy. As far as I know, the only open model (and a great one!) which fully omits CC-by-NC data is https://huggingface.co/common-pile/comma-v0.1-2t and I'd highly suggest it if this is a concern for you or your org.

As an aside, your comment here appears to be entirely AI generated from a Pangram check. While we welcome discussion and use AI ourselves, it's generally good practice to flag agent generated content up front! We personally include instructions for our agents to flag themselves as 🤖 in any Github discussions in our Agents.md for Marin

WillHeld changed discussion status to closed

Dismissing my comment with speculation about machine learning generation, rather than addressing the licensing facts, only confirms that I've found a genuine weak point in your "open source" claims. It also suggests the project does not truly prioritize user freedom — a dangerous stance in the LLM space.

Let me be clear: I am systematically tracking projects that market themselves as "open source" while undermining the freedoms established by the GNU Project and the Free Software Foundation since the 1980s. Your project is now on that list, unless you correct course.

CC BY‑NC explicitly prohibits commercial use. Training a model — even one released under Apache 2.0 — on NC-licensed data is itself a commercial use of that data. The Creative Commons FAQ states that "NonCommercial" means not "primarily intended for or directed toward commercial advantage." Many legal experts and open source advocates would argue that training a model for business or developer use clearly qualifies.

Taking works that are specifically copyrighted to prevent users' freedom does not entitle you to change their license or republish them under different terms. Yes, these are complex legal questions still awaiting court rulings. But responsible projects do not hide behind that uncertainty — they proactively follow the spirit and letter of free software.

Also note: accusing my comment of being generated by an LLM is an ad hominem attack. It does not rebut a single licensing fact. Even if the text were machine‑assisted, the core factual claim — that training on CC BY‑NC data disqualifies a model from being called open source under OSI's Open Source Machine Learning Definition — remains entirely valid.

I originally recommended your model to other organizations because you marketed it as truly open source. When I came to verify those claims, I found them misleading. Your reaction — closing the discussion instead of engaging — does not reflect healthy organizational policies in the social environment around free software.

Proposal:

I invite you to hire or consult a lawyer who specializes in open source and machine learning licensing — someone who understands the complexity, respects the history of software freedom, and is welcoming to community feedback. A lawyer who recognizes that a long‑term free software user pointing out a problem is not an attack, but a contribution.

If you genuinely want to build trust, do not dismiss contributions. Correct your marketing or change your data sources. Until then, I cannot in good conscience recommend Marin as an open source model.

This is not pedantry. It is accountability to the freedoms that GNU started. It is sad to see otherwise — but it is not too late to fix.

thanks for the insight 4o

Sign up or log in to comment