new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Mar 19

Hidden Licensing Risks in the LLMware Ecosystem

Large Language Models (LLMs) are increasingly integrated into software systems, giving rise to a new class of systems referred to as LLMware. Beyond traditional source-code components, LLMware embeds or interacts with LLMs that depend on other models and datasets, forming complex supply chains across open-source software (OSS), models, and datasets. However, licensing issues emerging from these intertwined dependencies remain largely unexplored. Leveraging GitHub and Hugging Face, we curate a large-scale dataset capturing LLMware supply chains, including 12,180 OSS repositories, 3,988 LLMs, and 708 datasets. Our analysis reveals that license distributions in LLMware differ substantially from traditional OSS ecosystems. We further examine license-related discussions and find that license selection and maintenance are the dominant concerns, accounting for 84% of cases. To understand incompatibility risks, we analyze license conflicts along supply chains and evaluate state-of-the-art detection approaches, which achieve only 58% and 76% F1 scores in this setting. Motivated by these limitations, we propose LiAgent, an LLM-based agent framework for ecosystem-level license compatibility analysis. LiAgent achieves an F1 score of 87%, improving performance by 14 percentage points over prior methods. We reported 60 incompatibility issues detected by LiAgent, 11 of which have been confirmed by developers. Notably, two conflicted LLMs have over 107 million and 5 million downloads on Hugging Face, respectively, indicating potentially widespread downstream impact. We conclude with implications and recommendations to support the sustainable growth of the LLMware ecosystem.

  • 8 authors
·
Feb 11

Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face

Many have observed that the development and deployment of generative machine learning (ML) and artificial intelligence (AI) models follow a distinctive pattern in which pre-trained models are adapted and fine-tuned for specific downstream tasks. However, there is limited empirical work that examines the structure of these interactions. This paper analyzes 1.86 million models on Hugging Face, a leading peer production platform for model development. Our study of model family trees -- networks that connect fine-tuned models to their base or parent -- reveals sprawling fine-tuning lineages that vary widely in size and structure. Using an evolutionary biology lens to study ML models, we use model metadata and model cards to measure the genetic similarity and mutation of traits over model families. We find that models tend to exhibit a family resemblance, meaning their genetic markers and traits exhibit more overlap when they belong to the same model family. However, these similarities depart in certain ways from standard models of asexual reproduction, because mutations are fast and directed, such that two `sibling' models tend to exhibit more similarity than parent/child pairs. Further analysis of the directional drifts of these mutations reveals qualitative insights about the open machine learning ecosystem: Licenses counter-intuitively drift from restrictive, commercial licenses towards permissive or copyleft licenses, often in violation of upstream license's terms; models evolve from multi-lingual compatibility towards english-only compatibility; and model cards reduce in length and standardize by turning, more often, to templates and automatically generated text. Overall, this work takes a step toward an empirically grounded understanding of model fine-tuning and suggests that ecological models and methods can yield novel scientific insights.

  • 3 authors
·
Aug 9, 2025 4

LiCoEval: Evaluating LLMs on License Compliance in Code Generation

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license compliance capabilities of LLMs, i.e., the ability to provide accurate license or copyright information when they generate code with striking similarity to already existing copyrighted code. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

  • 4 authors
·
Aug 5, 2024