jerrimu
/

libremodel

Model card Files Files and versions

jerrimu commited on Sep 20, 2025

Commit

003ffb9

·

verified ·

1 Parent(s): 0e9037c

Update README.md

Files changed (1) hide show

README.md +61 -1

README.md CHANGED Viewed

@@ -7,4 +7,64 @@ datasets:
 - wikimedia/wikipedia
 language:
 - en
----

 - wikimedia/wikipedia
 language:
 - en
+LibreModel I (0.96B)
+Model Description
+LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear.
+Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget).
+Model Details
+Model Type: Causal Language Model (GPT-style)
+Parameters: 960M (0.96B)
+Architecture: LlamaConfig with optimizations
+Context Length: 3,072 tokens
+Vocabulary Size: 128,256 (LLaMA 3 tokenizer)
+Training Tokens: 19.2B (Chinchilla-optimal)
+Training Cost: ~$500 using AWS spot instances
+Architecture Features
+Layers: 22 transformer layers
+Attention Heads: 24 total, 8 key-value heads (3:1 GQA)
+Hidden Size: 1,536
+Sink Tokens: 4 persistent context tokens for improved long-range attention
+Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision
+4-Phase Curriculum Training
+Phase 1: Foundation (0-8%)
+70% Project Gutenberg (literature, classics)
+30% Government Reports (analytical structure)
+Phase 2: Diversification (8-20%)
+50% Project Gutenberg
+45% Wikipedia (factual knowledge)
+5% Government Reports
+Phase 3: Advanced Reasoning (20-40%)
+40% Project Gutenberg
+30% Harvard Legal Cases (logical reasoning)
+30% Wikipedia
+Phase 4: Optimization (40-100%)
+40% Project Gutenberg
+30% Wikipedia
+30% OpenGovernment (diverse analytical content)
+Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns.
+Training Data Sources (100% Public Domain)
+Project Gutenberg: Classical literature, philosophy, science texts
+Wikipedia: Encyclopedia articles and factual content
+Government Documents: Policy papers, reports, legal documents
+OpenGovernment: Diverse government publications and analyses
+Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status.
+This is a base model and not ready for use.  We are beginning post-training 09/20 and will upload once done.
+GGUFs can be found at https://github.com/openconstruct/libremodel/releases---