rg-preview
community
AI & ML interests
None defined yet.
MaziyarPanahiย
posted an update about 1 month ago
Post
2316
Training mRNA Language Models Across 25 Species for $165
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.
https://huggingface.co/blog/OpenMed/training-mrna-models-25-species
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.
https://huggingface.co/blog/OpenMed/training-mrna-models-25-species
MaziyarPanahiย
posted an update about 2 months ago
Post
2264
We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match.
Everything is open-sourced: datasets, adapters, and code.
https://huggingface.co/blog/OpenMed/synthvision
Everything is open-sourced: datasets, adapters, and code.
https://huggingface.co/blog/OpenMed/synthvision
MaziyarPanahiย
posted an update 2 months ago
Post
4863
DNA, mRNA, proteins, AI. I spent the last year going deep into computational biology as an ML engineer. This is Part I of what I found. ๐งฌ
In 2024, AlphaFold won the Nobel Prize in Chemistry.
By 2026, the open-source community had built alternatives that outperform it.
That's the story I find most interesting about protein AI right now. Not just the science (which is incredible), but the speed at which open-source caught up. Multiple teams, independently, reproduced and then exceeded AlphaFold 3's accuracy with permissive licenses. The field went from prediction to generation: we're not just modeling known proteins anymore, we're designing new ones.
I spent months mapping this landscape for ML engineers. What the architectures actually are (spoiler: transformers and diffusion models), which tools to use for what, and which ones you can actually ship commercially.
New post on the Hugging Face blog: https://huggingface.co/blog/MaziyarPanahi/protein-ai-landscape
Hope you all enjoy! ๐ค
In 2024, AlphaFold won the Nobel Prize in Chemistry.
By 2026, the open-source community had built alternatives that outperform it.
That's the story I find most interesting about protein AI right now. Not just the science (which is incredible), but the speed at which open-source caught up. Multiple teams, independently, reproduced and then exceeded AlphaFold 3's accuracy with permissive licenses. The field went from prediction to generation: we're not just modeling known proteins anymore, we're designing new ones.
I spent months mapping this landscape for ML engineers. What the architectures actually are (spoiler: transformers and diffusion models), which tools to use for what, and which ones you can actually ship commercially.
New post on the Hugging Face blog: https://huggingface.co/blog/MaziyarPanahi/protein-ai-landscape
Hope you all enjoy! ๐ค
MaziyarPanahiย
authored a
paper 3 months ago
MaziyarPanahiย
posted an update 3 months ago
Post
2428
Announcing: OpenMed Multilingual PII Detection Models
Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.
All Apache 2.0 licensed. Free for commercial use. No restrictions.
Performance:
- French: 97.97% F1 (top model)
- German: 97.61% F1 (top model)
- Italian: 97.28% F1 (top model)
All top-10 models per language exceed 96% F1
Coverage:
55+ PII entity types per language
Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian)
Language-specific address, phone, and name patterns
Training Data:
French: 49,580 samples
German: 42,250 samples
Italian: 40,944 samples
Why Multilingual?
European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.
Effective de-identification requires:
- Native language understanding โ not translation
- Local ID format recognition โ each country has unique patterns
- Cultural context awareness โ names, addresses, and formats vary
- These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.
HIPAA & GDPR Compliance
Built for US and European privacy regulations:
- On-premise deployment: Process data locally with zero external dependencies
- Data sovereignty: No API calls, no cloud services, no cross-border transfers
- Air-gapped capable: Deploy in fully isolated environments if required
- Regulatory-grade accuracy: Supporting Expert Determination standards
- HIPAA and GDPR compliance across languages, without compliance gaps.
Use Cases
- Hospital EHR systems: Automated patient record de-identification
- Clinical research: Multilingual dataset preparation for studies
- Insurance companies: Claims processing across
https://huggingface.co/collections/OpenMed/multilingual-pii-and-de-identification
Today I am releasing 105 open-source models for Personally Identifiable Information (PII) detection in French, German, and Italian.
All Apache 2.0 licensed. Free for commercial use. No restrictions.
Performance:
- French: 97.97% F1 (top model)
- German: 97.61% F1 (top model)
- Italian: 97.28% F1 (top model)
All top-10 models per language exceed 96% F1
Coverage:
55+ PII entity types per language
Native ID formats: NSS (French), Sozialversicherungsnummer (German), Codice Fiscale (Italian)
Language-specific address, phone, and name patterns
Training Data:
French: 49,580 samples
German: 42,250 samples
Italian: 40,944 samples
Why Multilingual?
European healthcare operates in European languages. Clinical notes, patient records, and medical documents are generated in French, German, Italian, and other languages.
Effective de-identification requires:
- Native language understanding โ not translation
- Local ID format recognition โ each country has unique patterns
- Cultural context awareness โ names, addresses, and formats vary
- These models deliver production-ready accuracy without requiring data to leave your infrastructure or language.
HIPAA & GDPR Compliance
Built for US and European privacy regulations:
- On-premise deployment: Process data locally with zero external dependencies
- Data sovereignty: No API calls, no cloud services, no cross-border transfers
- Air-gapped capable: Deploy in fully isolated environments if required
- Regulatory-grade accuracy: Supporting Expert Determination standards
- HIPAA and GDPR compliance across languages, without compliance gaps.
Use Cases
- Hospital EHR systems: Automated patient record de-identification
- Clinical research: Multilingual dataset preparation for studies
- Insurance companies: Claims processing across
https://huggingface.co/collections/OpenMed/multilingual-pii-and-de-identification
MaziyarPanahiย
posted an update 3 months ago
Post
1336
From Golden Gate Bridge to Broken JSON: Why Anthropic's SAE Steering Fails for Structured Output
I ran 6 experiments trying to use Anthropic's SAE steering for JSON generation.
- Base model: 86.8% valid JSON
- Steering only: 24.4%
- Fine-tuned: 96.6%
- FSM constrained: 100%
Steering is for semantics, not syntax.
https://huggingface.co/blog/MaziyarPanahi/sae-steering-json
I ran 6 experiments trying to use Anthropic's SAE steering for JSON generation.
- Base model: 86.8% valid JSON
- Steering only: 24.4%
- Fine-tuned: 96.6%
- FSM constrained: 100%
Steering is for semantics, not syntax.
https://huggingface.co/blog/MaziyarPanahi/sae-steering-json
MaziyarPanahiย
posted an update 3 months ago
Post
4089
๐จ Day 8/8: OpenMed Medical Reasoning Dataset Release - THE GRAND FINALE
Today I complete my 8-day release series with Medical-Reasoning-SFT-Mega.
The largest open medical reasoning dataset, combining 7 state-of-the-art AI models with fair distribution deduplication.
THE 7 SOURCE MODELS (Original Sample Counts):
1. Trinity-Mini: 810,284 samples
2. Qwen3-Next-80B: 604,249 samples
3. GPT-OSS-120B: 506,150 samples
4. Nemotron-Nano-30B: 444,544 samples
5. GLM-4.5-Air: 225,179 samples
6. MiniMax-M2.1: 204,773 samples
7. Baichuan-M3-235B: 124,520 samples
TOTAL BEFORE DEDUPLICATION: 2,919,699 samples
TOKEN COUNTS:
- Content tokens: 2.22 Billion
- Reasoning tokens: 1.56 Billion
- Total tokens: 3.78 Billion
- Samples with chain-of-thought: 100%
Quick Start:
All datasets Apache 2.0 licensed. Free for research and commercial use.
Thank you for following OpenMed's release series. I can't wait to see what you build. ๐ฅ
OpenMed/Medical-Reasoning-SFT-Mega
OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B-V2
OpenMed/Medical-Reasoning-SFT-Trinity-Mini
OpenMed/Medical-Reasoning-SFT-GLM_4.5_Air
OpenMed/Medical-Reasoning-SFT-MiniMax-M2.1
OpenMed/Medical-Reasoning-SFT-Qwen3-Next-80B
OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B
OpenMed/Medical-Reasoning-SFT-Baichuan-M3-235B
https://huggingface.co/collections/OpenMed/medical-datasets
Today I complete my 8-day release series with Medical-Reasoning-SFT-Mega.
The largest open medical reasoning dataset, combining 7 state-of-the-art AI models with fair distribution deduplication.
THE 7 SOURCE MODELS (Original Sample Counts):
1. Trinity-Mini: 810,284 samples
2. Qwen3-Next-80B: 604,249 samples
3. GPT-OSS-120B: 506,150 samples
4. Nemotron-Nano-30B: 444,544 samples
5. GLM-4.5-Air: 225,179 samples
6. MiniMax-M2.1: 204,773 samples
7. Baichuan-M3-235B: 124,520 samples
TOTAL BEFORE DEDUPLICATION: 2,919,699 samples
TOKEN COUNTS:
- Content tokens: 2.22 Billion
- Reasoning tokens: 1.56 Billion
- Total tokens: 3.78 Billion
- Samples with chain-of-thought: 100%
Quick Start:
from datasets import load_dataset
ds = load_dataset("OpenMed/Medical-Reasoning-SFT-Mega")All datasets Apache 2.0 licensed. Free for research and commercial use.
Thank you for following OpenMed's release series. I can't wait to see what you build. ๐ฅ
OpenMed/Medical-Reasoning-SFT-Mega
OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B-V2
OpenMed/Medical-Reasoning-SFT-Trinity-Mini
OpenMed/Medical-Reasoning-SFT-GLM_4.5_Air
OpenMed/Medical-Reasoning-SFT-MiniMax-M2.1
OpenMed/Medical-Reasoning-SFT-Qwen3-Next-80B
OpenMed/Medical-Reasoning-SFT-Nemotron-Nano-30B
OpenMed/Medical-Reasoning-SFT-Baichuan-M3-235B
https://huggingface.co/collections/OpenMed/medical-datasets
mlabonneย
authored 2
papers 4 months ago
Post
10340
New family of 1B models just dropped!
> LiquidAI/LFM2.5-1.2B-Base: 10T โ 28T tokens
> LiquidAI/LFM2.5-1.2B-Instruct: new large-scale multi-stage RL
> LiquidAI/LFM2.5-1.2B-JP: our most polite model
> LiquidAI/LFM2.5-VL-1.6B: multi-image multilingual
> LiquidAI/LFM2.5-Audio-1.5B: 8x times faster, no quality loss
Super proud of this release ๐ค
> LiquidAI/LFM2.5-1.2B-Base: 10T โ 28T tokens
> LiquidAI/LFM2.5-1.2B-Instruct: new large-scale multi-stage RL
> LiquidAI/LFM2.5-1.2B-JP: our most polite model
> LiquidAI/LFM2.5-VL-1.6B: multi-image multilingual
> LiquidAI/LFM2.5-Audio-1.5B: 8x times faster, no quality loss
Super proud of this release ๐ค
MaziyarPanahiย
posted an update 4 months ago
Post
3781
๐ OpenMed 2025 Year in Review: 6 Months of Open Medical AI
I'm thrilled to share what the OpenMed community has accomplished since our July 2025 launch!
๐ The Numbers
29,700,000 downloads Thank you! ๐
- 481 total models (475 medical NER models + 6 fine-tuned LLMs)
- 475 medical NER models in [OpenMed](
OpenMed ) organization
- 6 fine-tuned LLMs in [openmed-community](
openmed-community )
- 551,800 PyPI downloads of the [openmed package](https://pypi.org/project/openmed/)
- 707 followers on HuggingFace (you!)
- 97 GitHub stars on the [toolkit repo](https://github.com/maziyarpanahi/openmed)
๐ Top Models by Downloads
1. [OpenMed-NER-PharmaDetect-SuperClinical-434M]( OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M) โ 147,305 downloads
2. [OpenMed-NER-ChemicalDetect-ElectraMed-33M]( OpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M) โ 126,785 downloads
3. [OpenMed-NER-BloodCancerDetect-TinyMed-65M]( OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M) โ 126,465 downloads
๐ฌ Model Categories
Our 481 models cover comprehensive medical domains:
- Disease Detection (~50 variants)
- Pharmaceutical Detection (~50 variants)
- Oncology Detection (~50 variants)
- Genomics/DNA Detection (~80 variants)
- Chemical Detection (~50 variants)
- Species/Organism Detection (~60 variants)
- Protein Detection (~50 variants)
- Pathology Detection (~50 variants)
- Blood Cancer Detection (~30 variants)
- Anatomy Detection (~40 variants)
- Zero-Shot NER (GLiNER-based)
OpenMed
OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets (2508.01630)
https://huggingface.co/collections/OpenMed/medical-and-clinical-ner
https://huggingface.co/collections/OpenMed/zeroshot-medical-and-clinical-ner
OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
I'm thrilled to share what the OpenMed community has accomplished since our July 2025 launch!
๐ The Numbers
29,700,000 downloads Thank you! ๐
- 481 total models (475 medical NER models + 6 fine-tuned LLMs)
- 475 medical NER models in [OpenMed](
- 6 fine-tuned LLMs in [openmed-community](
- 551,800 PyPI downloads of the [openmed package](https://pypi.org/project/openmed/)
- 707 followers on HuggingFace (you!)
- 97 GitHub stars on the [toolkit repo](https://github.com/maziyarpanahi/openmed)
๐ Top Models by Downloads
1. [OpenMed-NER-PharmaDetect-SuperClinical-434M]( OpenMed/OpenMed-NER-PharmaDetect-SuperClinical-434M) โ 147,305 downloads
2. [OpenMed-NER-ChemicalDetect-ElectraMed-33M]( OpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M) โ 126,785 downloads
3. [OpenMed-NER-BloodCancerDetect-TinyMed-65M]( OpenMed/OpenMed-NER-BloodCancerDetect-TinyMed-65M) โ 126,465 downloads
๐ฌ Model Categories
Our 481 models cover comprehensive medical domains:
- Disease Detection (~50 variants)
- Pharmaceutical Detection (~50 variants)
- Oncology Detection (~50 variants)
- Genomics/DNA Detection (~80 variants)
- Chemical Detection (~50 variants)
- Species/Organism Detection (~60 variants)
- Protein Detection (~50 variants)
- Pathology Detection (~50 variants)
- Blood Cancer Detection (~30 variants)
- Anatomy Detection (~40 variants)
- Zero-Shot NER (GLiNER-based)
OpenMed NER: Open-Source, Domain-Adapted State-of-the-Art Transformers for Biomedical NER Across 12 Public Datasets (2508.01630)
https://huggingface.co/collections/OpenMed/medical-and-clinical-ner
https://huggingface.co/collections/OpenMed/zeroshot-medical-and-clinical-ner
OpenMed/Medical-Reasoning-SFT-GPT-OSS-120B
Post
8439
LiquidAI/LFM2-8B-A1B just dropped!
8.3B params with only 1.5B active/token ๐
> Quality โ 3โ4B dense, yet faster than Qwen3-1.7B
> MoE designed to run on phones/laptops (llama.cpp / vLLM)
> Pre-trained on 12T tokens โ strong math/code/IF
8.3B params with only 1.5B active/token ๐
> Quality โ 3โ4B dense, yet faster than Qwen3-1.7B
> MoE designed to run on phones/laptops (llama.cpp / vLLM)
> Pre-trained on 12T tokens โ strong math/code/IF
Post
3889
โ๏ธ New drop of tiny task-specific models!
Want to do data extraction, translation, RAG, tool use, or math on a Raspberry Pi? We got you covered! โ
These tiny models were fine-tuned to perform narrow tasks extremely well, making them competitive with much larger models.
You can deploy them today on-device or even on GPUs for big data operations!
LiquidAI/liquid-nanos-68b98d898414dd94d4d5f99a
Want to do data extraction, translation, RAG, tool use, or math on a Raspberry Pi? We got you covered! โ
These tiny models were fine-tuned to perform narrow tasks extremely well, making them competitive with much larger models.
You can deploy them today on-device or even on GPUs for big data operations!
LiquidAI/liquid-nanos-68b98d898414dd94d4d5f99a
Post
6987
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B
MaziyarPanahiย
authored 2
papers 9 months ago
MaziyarPanahiย
posted an update 10 months ago
Post
13043
๐งฌ Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face ๐ฌ
OpenMed is back! ๐ฅ Finding the right biomedical NER model just became as precise as a PCR assay!
I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.
๐ฏ Why This Matters in Healthcare AI:
Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.
๐ฌ What You Can Discover:
โ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes
โ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines"
โ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature
โ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components"
โ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"
๐ก Advanced Features:
๐ Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein")
๐ฅ Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more
๐ Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations
โก Real-Time Search - Auto-filtering as you type, no search buttons needed
๐จ Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals
Ready to revolutionize your biomedical NLP pipeline?
๐ Try it now: OpenMed/openmed-ner-models
๐งฌ Built with: Gradio, Transformers, Advanced Entity Mapping
OpenMed is back! ๐ฅ Finding the right biomedical NER model just became as precise as a PCR assay!
I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.
๐ฏ Why This Matters in Healthcare AI:
Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.
๐ฌ What You Can Discover:
โ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes
โ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines"
โ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature
โ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components"
โ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"
๐ก Advanced Features:
๐ Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein")
๐ฅ Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more
๐ Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations
โก Real-Time Search - Auto-filtering as you type, no search buttons needed
๐จ Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals
Ready to revolutionize your biomedical NLP pipeline?
๐ Try it now: OpenMed/openmed-ner-models
๐งฌ Built with: Gradio, Transformers, Advanced Entity Mapping
Post
5766
Based on a new hybrid architecture, these 350M, 700M, and 1.2B models are both fast and performant, ideal for on-device deployment.
I recommend fine-tuning them to power your next edge application. We already provide Colab notebooks to guide you. More to come soon!
๐ Blog post: https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models
๐ค Models: LiquidAI/lfm2-686d721927015b2ad73eaa38
dvilasueroย
posted an update 11 months ago
Post
3424
Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.
A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.
Today, I'm thrilled to introduce our first step in this direction.
In a nutshell:
๐ Effortlessly run prompts and models over your data.
๐ Agentic search for accuracy and real-time information.
๐ผ๏ธ Familiar, minimalistic interface for interacting with data.
๐ฏ Human feedback 2.0: Your input directly improves generated data.
๐ฏ Access hundreds of open models and leading inference providers.
Go to this space to try it out!
aisheets/sheets
Leave your questions below, we're just getting started!
A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.
Today, I'm thrilled to introduce our first step in this direction.
In a nutshell:
๐ Effortlessly run prompts and models over your data.
๐ Agentic search for accuracy and real-time information.
๐ผ๏ธ Familiar, minimalistic interface for interacting with data.
๐ฏ Human feedback 2.0: Your input directly improves generated data.
๐ฏ Access hundreds of open models and leading inference providers.
Go to this space to try it out!
aisheets/sheets
Leave your questions below, we're just getting started!