Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!
Key findings from our research on optimal architectures for small language models:
โ Depth beats width: 32 layers outperforms 12 layers at the same parameter count โ Best-in-class factuality: 47.5% on TruthfulQA โ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion โ Canon layers add only 0.13% parameters but improve reasoning
We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.
Introducing PTS Visualizer - an interactive tool for exploring how language models reason!
Visualize pivotal tokens, thought anchors, and reasoning circuits. See which tokens and sentences significantly impact success probability, explore embedding clusters, and trace reasoning step-by-step.
Recently, Essential AI released a new 8B base model EssentialAI/rnj-1 they highlighted the importance of data mix for pretraning -
"In the long run, we expect our methods to automatically represent, transform, and blend data to optimize measurable abilities in pre-training. Our work on modeling data taxonomies led to new approaches for jointly clustering and mixing data distributions under data repetition penalties. Many improvements in our STEM abilities can be traced back to this. "
I just published Ellora - 6 production-ready LoRA recipes for enhancing LLMs with specific capabilities. Each recipe costs under $100 to run and includes complete training code, data generation, and evaluation.
The 6 Recipes: Recipe 1: Accuracy Recovery - Recover 75% of quantization losses with self-distillation Recipe 2: Reasoning LoRA - Add structured thinking with GRPO (0% to 60% adoption, 75% quality boost) Recipe 3: Tool Calling - Real execution on actual codebases Recipe 4: Context Extension - Scale from 32K to 2M tokens (61x increase) Recipe 5: Secure Code Generation - 97% vulnerability reduction using automated Semgrep analysis Recipe 6: Execution-Aware World Models - Teaching models runtime behavior
Why Recipes? Ellora provides methodologies, not frameworks. Use them with your existing tools (PEFT, LoRAX, vLLM, Unsloth, HuggingFace). Each recipe uses self-supervised data generation (Magpie approach) - no expensive human labeling required.
All recipes include Jupyter notebooks you can run immediately with clear success metrics.
Introducing OpenEvolve Prompt Optimizer - a Space that automatically evolves and optimizes your prompts using OpenEvolve!
This tool uses OpenEvolve to iteratively improve prompts by testing them on real datasets and evolving better versions. No more manual prompt engineering guesswork - let OpenEvolve find the optimal prompts for you.
How it works: - Enter your initial prompt using {input} as a placeholder for dataset inputs - Input any HuggingFace dataset name you want to use for optimization - Specify the dataset split and field names for your use case - Click Optimize Prompt and the system will validate everything first - Compare your initial prompt vs the evolved best prompt side-by-side
๐ฏ Introducing Chayan: A Calibrated 4-Model LLM Router Achieving 69% Accuracy on RouterArena
We're excited to share Chayan, a cost-efficient LLM router that intelligently routes queries between 4 models to maximize accuracy while minimizing cost. Chayan just submitted to the RouterArena leaderboard and achieved 69.05% accuracy on the benchmark!
Chayan achieves impressive results on the RouterArena benchmark: โข 69.05% accuracy (would rank #1 on current leaderboard) โข $0.333 per 1K queries โข +12.07pp improvement over all-mini baseline (56.98%) โข 99% of perfect 2-model oracle performance at 57% lower cost
Compared to our previous 2-model router (61.43% accuracy), Chayan delivers +7.62pp improvement through smarter 4-model routing.
๐ง How It Works
Chayan uses an Adaptive K-NN classifier with prototype memory to route between 4 models: โข openai/gpt-4o-mini (fast & cheap) โข google/gemini-2.5-flash-lite (balanced) โข google/gemini-2.5-flash (capable) โข openai/gpt-4o (most powerful)
๐ Getting Started
You can use Chayan directly from HuggingFace:
from adaptive_classifier import AdaptiveClassifier
These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead.
The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources
We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data.
Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately.
MARS Achieves Strong Results on Google DeepMind's IMO-Bench
We evaluated OptiLLM's MARS (Multi-Agent Reasoning System) approach on IMO-Bench, Google DeepMind's challenging mathematical reasoning benchmark with International Mathematical Olympiad-level problems.
What is MARS?
MARS is a multi-agent reasoning technique that works with any LLM. It uses 3 parallel reasoning agents that independently solve problems, then verifies their solutions through consensus and iterative refinement. The key advantage: it's model-agnostic and can be applied to any base model through OptiLLM's inference proxy.
Results on IMO-Bench:
AnswerBench (400 short-answer problems): MARS: 36.0% (144/400 correct) Baseline: 24.5% (98/400 correct) Improvement: +11.5pp across all domains
All results achieved using google/gemini-2.5-flash-lite-preview-09-2025 as the base model. The same MARS approach can enhance reasoning for any model through OptiLLM's OpenAI-compatible API.
Datasets available at: AnswerBench: huggingface.co/datasets/Hwilner/imo-answerbench ProofBench: huggingface.co/datasets/Hwilner/imo-proofbench
Try it yourself:
python optillm.py --approach mars --model google/gemini-2.5-flash-lite-preview-09-2025
On this day in 2019, OpenAI released the final GPT-2 model as part of their staged release. I still remember that November well - so much was happening, but GPT-2's release felt like a watershed moment for the field. It showed us what was possible with carefully trained language models.
To recreate some of that GPT-2 magic, I recently tackled an interesting challenge: can you pretrain a language model with just 1 billion tokens - roughly 1/10th of what GPT-2 used - and still get comparable performance? After 50+ systematic experiments testing different dataset mixtures, the answer is yes.
The result is codelion/gpt-2-70m, which achieves over 90% of GPT-2's benchmark performance despite being trained on 10x less data. The key was finding the optimal dataset composition: 50% high-quality textbook PDFs, 30% filtered web content, and 20% educational resources. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix
We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.
Key Finding:
A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.
Results:
Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).
The takeaway: careful dataset curation matters more than total data volume.
๐ง Introducing Ellora Recipe #6: Execution-Aware World Model for Qwen3-4B-Thinking
Teaching LLMs to understand not just what code does, but HOW it executes at runtime!
Inspired by Meta's CWM (Code World Model) research, this LoRA adapter adds execution awareness to Qwen3-4B-Thinking-2507. The model learns to predict variable states, trace program execution step-by-step, and debug code by understanding runtime behavior.
๐ Key Innovation: We combine Qwen3's native thinking capabilities with real Python execution traces captured via sys.settrace(). The model is trained using GRPO with a custom reward function that scores execution prediction accuracy.
๐ Training Approach: - Hybrid Magpie-style code generation - Real execution tracing for ground truth - Self-supervised learning (no manual annotations!) - 298 training samples with execution traces
โจ What it does: - Predicts variable states at each line of code - Explains execution flow with thinking tags - Helps debug by understanding runtime behavior - Works as a "neural debugger"
๐ฏ Results: - 20% overall accuracy on execution prediction - 33.3% mean state accuracy - Trained on Qwen3-4B-Thinking (262K context, 4B params)
Part of the Ellora project - standardized LoRA recipes for enhancing LLM capabilities. All recipes use self-supervised data generation and work with existing infrastructure (PEFT, LoRAX, vLLM).
๐ Adaptive Classifier v0.1.0: Now with ONNX Runtime Support!
We're excited to announce a major update to Adaptive Classifier - a flexible, continuous learning classification system that adapts to new classes without retraining!
What's New:
โก ONNX Runtime Integration: Get 1.14x faster CPU inference out of the box (up to 4x on x86 processors)
๐ฆ INT8 Quantization: Models are now 4x smaller with minimal accuracy loss, making deployment easier and faster
๐ฏ Smart Loading: Automatically uses the best model variant for your hardware - quantized for speed by default, or unquantized for maximum accuracy
๐ 7.5x Faster Model Loading: Get started quickly with optimized model initialization
How It Works:
Adaptive Classifier lets you build text classifiers that continuously learn from new examples without catastrophic forgetting. Perfect for: - Dynamic classification tasks where classes evolve over time - Few-shot learning scenarios with limited training data - Production systems that need to adapt to new categories
The new ONNX support means you get production-ready speed on CPU without any code changes - just load and run!
Try it now:
from adaptive_classifier import AdaptiveClassifier
# Load with ONNX automatically enabled (quantized for best performance) classifier = AdaptiveClassifier.load("adaptive-classifier/llm-router")
# Add examples dynamically classifier.add_examples( ["Route this to GPT-4", "Simple task for GPT-3.5"], ["strong", "weak"] )
# Predict with optimized inference predictions = classifier.predict("Complex reasoning task")
๐ Adaptive Classifier v0.0.17 Released - Major Accuracy Improvements!
We've just released a major update fixing critical bugs that were causing 40-50% accuracy drops in our enterprise classifiers!
Key Fixes: โข Fixed k-parameter prediction bug causing massive accuracy loss โข Improved incremental learning for new classes โข Enhanced weight preservation during model updates
Over 40 percent of AI-generated code contains security vulnerabilities. We recently worked on a LoRA to write secure code by default using automated Semgrep analysis and GRPO, achieving 97 percent reduction in vulnerabilities without requiring security-specific prompts.
Technical Approach: Automated security training pipeline combining Semgrep vulnerability detection with preference learning. Generate multiple solutions with varying security awareness, automatically analyze for vulnerabilities, create preference pairs based on security scores, train using GRPO with multi-factor scoring.
Scoring System (100 points total): - Functionality: 40 points - Does the code work correctly - Security patterns: 40 points - Uses secure coding practices - Low vulnerabilities: 20 points - Semgrep score below threshold
This balanced scoring prevents reward hacking where models generate empty functions to avoid vulnerabilities.
Real Transformation Examples:
Database query before: query = f"SELECT * FROM products WHERE name = '{name}'"
Database query after: query = "SELECT * FROM products WHERE name = ?" db.execute(query, (name,))
I recently worked on a LoRA that improves tool use in LLM. Thought the approach might interest folks here.
The issue I have had when trying to use some of the local LLMs with coding agents is this:
Me: "Find all API endpoints with authentication in this codebase" LLM: "You should look for @app.route decorators and check if they have auth middleware..."
But I often want it to search the files and show me but the LLM doesn't trigger a tool use call.
To fine-tune it for tool use I combined two data sources:
1. Magpie scenarios - 5000+ diverse tasks (bug hunting, refactoring, security audits) 2. Real execution - Ran these on actual repos (FastAPI, Django, React) to get authentic tool responses
This ensures the model learns both breadth (many scenarios) and depth (real tool behavior).
I wanted to share a technique that's been working really well for recovering performance after INT4 quantization.
Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses.
Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47).
We saw similar results on Qwen3-0.6B:
Perplexity: 2.40 โ 2.09 (only 5.7% degradation from FP16 baseline) Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction) Speed: 3.0x faster inference than FP16 Quality: Generates correct, optimized code solutions
Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization.
Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!
I recently added a recipe in ellora to improve reasoning capabilities to Gemma-3-1B using self-supervised learning. Model now shows step-by-step thinking in <think> tags before answering.
Logic puzzle accuracy: 61% โ 84%. 3 hours training on single GPU. ๐ง
Used GRPO where model generates multiple responses and learns to prefer better reasoning. Works surprisingly well for making smaller models more transparent.
Released 17 production-ready adaptive text classifiers that learn from just 100 examples per class and continuously improve without retraining.
These models achieve 93% average accuracy across enterprise use cases like email routing, fraud detection, document classification, and support ticket categorization. Built on ModernBERT with prototype memory and elastic weight consolidation.
Key benefits: 90% cost reduction vs API solutions, 90-120ms local inference, dynamic class addition, and zero vendor lock-in.
All models available under adaptive-classifier organization. Install with pip install adaptive-classifier.