jerrimu
/

libremodel

Model card Files Files and versions

libremodel / README.md

jerrimu's picture

Update README.md

52ed6da verified 4 months ago

|

history blame contribute delete

2.57 kB

	---
	license: cc0-1.0
	datasets:
	- PleIAs/common_corpus
	- isaacus/mteb-GovReport
	- sedthh/gutenberg_english
	- wikimedia/wikipedia
	language:
	- en


	LibreModel I (0.96B)
	Model Description
	LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear.
	Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget).
	Model Details

	Model Type: Causal Language Model (GPT-style)
	Parameters: 960M (0.96B)
	Architecture: LlamaConfig with optimizations
	Context Length: 3,072 tokens
	Vocabulary Size: 128,256 (LLaMA 3 tokenizer)
	Training Tokens: 19.2B (Chinchilla-optimal)
	Training Cost: ~$500 using AWS spot instances

	Architecture Features

	Layers: 22 transformer layers
	Attention Heads: 24 total, 8 key-value heads (3:1 GQA)
	Hidden Size: 1,536
	Sink Tokens: 4 persistent context tokens for improved long-range attention
	Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision

	4-Phase Curriculum Training
	Phase 1: Foundation (0-8%)

	70% Project Gutenberg (literature, classics)
	30% Government Reports (analytical structure)

	Phase 2: Diversification (8-20%)

	50% Project Gutenberg
	45% Wikipedia (factual knowledge)
	5% Government Reports

	Phase 3: Advanced Reasoning (20-40%)

	40% Project Gutenberg
	30% Harvard Legal Cases (logical reasoning)
	30% Wikipedia

	Phase 4: Optimization (40-100%)

	40% Project Gutenberg
	30% Wikipedia
	30% OpenGovernment (diverse analytical content)

	Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns.
	Training Data Sources (100% Public Domain)

	Project Gutenberg: Classical literature, philosophy, science texts
	Wikipedia: Encyclopedia articles and factual content
	Government Documents: Policy papers, reports, legal documents
	OpenGovernment: Diverse government publications and analyses

	Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status.


	This is a base model and not ready for use. We are beginning post-training end of month and will upload once done.
	GGUFs can be found at https://github.com/openconstruct/libremodel/releases---