AI for Scientific Discovery Won't Work Without Fixing How We Collaborate.
My co-author @cgeorgiaw and I just published a paper challenging a core assumption: that the main barriers to AI in science are technical. They're not. They're social.
Key findings:
π¨ The "AI Scientist" myth delays progress: Waiting for AGI devalues human expertise and obscures science's real purpose: cultivating understanding, not just outputs. π Wrong incentives: Datasets have 100x longer impact than models, yet data curation is undervalued. β οΈ Broken collaboration: Domain scientists want understanding. ML researchers optimize performance. Without shared language, projects fail. π Fragmentation costs years: Harmonizing just 9 cancer files took 329 hours.
Why this matters: Upstream bottlenecks like efficient PDE solvers could accelerate discovery across multiple sciences. CASP mobilized a community around protein structure, enabling AlphaFold. We need this for dozens of challenges.
Thus, we're launching Hugging Science! A global community addressing these barriers through collaborative challenges, open toolkits, education, and community-owned infrastructure. Please find all the links below!
Smol course has a distinctive approach to teaching post-training, so I'm posting about how itβs different to other post-training courses, including the llm course thatβs already available.
In short, the smol course is just more direct that any of the other course, and intended for semi-pro post trainers.
- Itβs a minimal set of instructions on the core parts. - Itβs intended to bootstrap real projects you're working on. - The material handsover to existing documentation for details - Likewise, it handsover to the LLM course for basics. - Assessment is based on a leaderboard, without reading all the material.
The course builds on smol course v1 which was the fastest way to learn to train your custom AI models. It now has:
- A leaderboard for students to submit models to - Certification based on exams and leaderboards - Prizes based on Leaderboards - Up to date content on TRL and SmolLM3 - Deep integration with the Hubβs compute for model training and evaluation
We will release chapters every few weeks, so you can follow the org to stay updated.
The open source AI community is just made of people who are passionate and care about their work. So we thought it would be cool to share our favourite icons of the community with a fun award.
Winners get free Hugging Face Pro Subscriptions, Merchandise, or compute credits for the hub.
This is a new initiative to recognise and celebrate the incredible work being done by community members. It's all about inspiring more collaboration and innovation in the world of machine learning and AI.
They're highlighting contributors in four key areas: - model creators: building and sharing innovative and state-of-the-art models. - educators: sharing knowledge through posts, articles, demos, and events. - tool builders: creating the libraries, frameworks, and applications that we all use. - community champions: supporting and mentoring others in forums.
Know someone who deserves recognition? Nominate them by opening a post in the Hugging Face community forum.
New blog post alert! "What is the Hugging Face Community Building?", with @yjernite and @irenesolaiman What 1.8 Million Models Reveal About Open Source Innovation: Our latest deep dive into the Hugging Face Hub reveals patterns that challenge conventional AI narratives:
π Models become platforms for innovation Qwen, Llama, and Gemma models have spawned entire ecosystems of specialized variants. Looking at derivative works shows community adoption better than any single metric.
π Datasets reveal the foundation layer β Most downloaded datasets are evaluation benchmarks (MMLU, Squad, GLUE) β Universities and research institutions dominate foundational data β Domain-specific datasets thrive across finance, healthcare, robotics, and science β Open actors provide the datasets that power most AI development
ποΈ Research institutions lead the charge: AI2 (Allen Institute) emerges as one of the most active contributors, alongside significant activity from IBM, NVIDIA, and international organizations. The open source ecosystem spans far beyond Big Tech.
π Interactive exploration tools: We've built several tools to help you discover patterns!
ModelVerse Explorer - organizational contributions DataVerse Explorer - dataset patterns Organization HeatMap - activity over time Base Model Explorer - model family trees Semantic Search - find models by capability
π Academic research is thriving: Researchers are already producing valuable insights, including recent work at FAccT 2025: "The Brief and Wondrous Life of Open Models." We've also made hub datasets, weekly snapshots, and other data available for your own analysis.
The bottom line: AI development is far more distributed, diverse, and collaborative than popular narratives suggest. Real innovation happens through community collaboration across specialized domains.