File size: 2,569 Bytes
0e9037c
 
 
 
 
 
 
 
 
003ffb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52ed6da
003ffb9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc0-1.0
datasets:
- PleIAs/common_corpus
- isaacus/mteb-GovReport
- sedthh/gutenberg_english
- wikimedia/wikipedia
language:
- en


LibreModel I (0.96B)
Model Description
LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear.
Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget).
Model Details

Model Type: Causal Language Model (GPT-style)
Parameters: 960M (0.96B)
Architecture: LlamaConfig with optimizations
Context Length: 3,072 tokens
Vocabulary Size: 128,256 (LLaMA 3 tokenizer)
Training Tokens: 19.2B (Chinchilla-optimal)
Training Cost: ~$500 using AWS spot instances

Architecture Features

Layers: 22 transformer layers
Attention Heads: 24 total, 8 key-value heads (3:1 GQA)
Hidden Size: 1,536
Sink Tokens: 4 persistent context tokens for improved long-range attention
Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision

4-Phase Curriculum Training
Phase 1: Foundation (0-8%)

70% Project Gutenberg (literature, classics)
30% Government Reports (analytical structure)

Phase 2: Diversification (8-20%)

50% Project Gutenberg
45% Wikipedia (factual knowledge)
5% Government Reports

Phase 3: Advanced Reasoning (20-40%)

40% Project Gutenberg
30% Harvard Legal Cases (logical reasoning)
30% Wikipedia

Phase 4: Optimization (40-100%)

40% Project Gutenberg
30% Wikipedia
30% OpenGovernment (diverse analytical content)

Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns.
Training Data Sources (100% Public Domain)

Project Gutenberg: Classical literature, philosophy, science texts
Wikipedia: Encyclopedia articles and factual content
Government Documents: Policy papers, reports, legal documents
OpenGovernment: Diverse government publications and analyses

Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status.


This is a base model and not ready for use.  We are beginning post-training end of month and will upload once done.
GGUFs can be found at https://github.com/openconstruct/libremodel/releases---