| --- | |
| license: cc0-1.0 | |
| datasets: | |
| - PleIAs/common_corpus | |
| - isaacus/mteb-GovReport | |
| - sedthh/gutenberg_english | |
| - wikimedia/wikipedia | |
| language: | |
| - en | |
| LibreModel I (0.96B) | |
| Model Description | |
| LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear. | |
| Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget). | |
| Model Details | |
| Model Type: Causal Language Model (GPT-style) | |
| Parameters: 960M (0.96B) | |
| Architecture: LlamaConfig with optimizations | |
| Context Length: 3,072 tokens | |
| Vocabulary Size: 128,256 (LLaMA 3 tokenizer) | |
| Training Tokens: 19.2B (Chinchilla-optimal) | |
| Training Cost: ~$500 using AWS spot instances | |
| Architecture Features | |
| Layers: 22 transformer layers | |
| Attention Heads: 24 total, 8 key-value heads (3:1 GQA) | |
| Hidden Size: 1,536 | |
| Sink Tokens: 4 persistent context tokens for improved long-range attention | |
| Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision | |
| 4-Phase Curriculum Training | |
| Phase 1: Foundation (0-8%) | |
| 70% Project Gutenberg (literature, classics) | |
| 30% Government Reports (analytical structure) | |
| Phase 2: Diversification (8-20%) | |
| 50% Project Gutenberg | |
| 45% Wikipedia (factual knowledge) | |
| 5% Government Reports | |
| Phase 3: Advanced Reasoning (20-40%) | |
| 40% Project Gutenberg | |
| 30% Harvard Legal Cases (logical reasoning) | |
| 30% Wikipedia | |
| Phase 4: Optimization (40-100%) | |
| 40% Project Gutenberg | |
| 30% Wikipedia | |
| 30% OpenGovernment (diverse analytical content) | |
| Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns. | |
| Training Data Sources (100% Public Domain) | |
| Project Gutenberg: Classical literature, philosophy, science texts | |
| Wikipedia: Encyclopedia articles and factual content | |
| Government Documents: Policy papers, reports, legal documents | |
| OpenGovernment: Diverse government publications and analyses | |
| Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status. | |
| This is a base model and not ready for use. We are beginning post-training end of month and will upload once done. | |
| GGUFs can be found at https://github.com/openconstruct/libremodel/releases--- |