jerrimu commited on
Commit
003ffb9
·
verified ·
1 Parent(s): 0e9037c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -1
README.md CHANGED
@@ -7,4 +7,64 @@ datasets:
7
  - wikimedia/wikipedia
8
  language:
9
  - en
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - wikimedia/wikipedia
8
  language:
9
  - en
10
+
11
+
12
+ LibreModel I (0.96B)
13
+ Model Description
14
+ LibreModel I is a 960M parameter language model trained exclusively on copyright-free, public domain data using a novel 4-phase curriculum learning approach. This model demonstrates that competitive language models can be built without relying on copyrighted content, making AI development more accessible and legally clear.
15
+ Key Innovation: First model to use curriculum learning with public domain data exclusively, proving that copyright-free training can achieve competitive results at a fraction of typical training costs ($500 total budget).
16
+ Model Details
17
+
18
+ Model Type: Causal Language Model (GPT-style)
19
+ Parameters: 960M (0.96B)
20
+ Architecture: LlamaConfig with optimizations
21
+ Context Length: 3,072 tokens
22
+ Vocabulary Size: 128,256 (LLaMA 3 tokenizer)
23
+ Training Tokens: 19.2B (Chinchilla-optimal)
24
+ Training Cost: ~$500 using AWS spot instances
25
+
26
+ Architecture Features
27
+
28
+ Layers: 22 transformer layers
29
+ Attention Heads: 24 total, 8 key-value heads (3:1 GQA)
30
+ Hidden Size: 1,536
31
+ Sink Tokens: 4 persistent context tokens for improved long-range attention
32
+ Optimizations: Flash Attention 2, gradient checkpointing, bf16 mixed precision
33
+
34
+ 4-Phase Curriculum Training
35
+ Phase 1: Foundation (0-8%)
36
+
37
+ 70% Project Gutenberg (literature, classics)
38
+ 30% Government Reports (analytical structure)
39
+
40
+ Phase 2: Diversification (8-20%)
41
+
42
+ 50% Project Gutenberg
43
+ 45% Wikipedia (factual knowledge)
44
+ 5% Government Reports
45
+
46
+ Phase 3: Advanced Reasoning (20-40%)
47
+
48
+ 40% Project Gutenberg
49
+ 30% Harvard Legal Cases (logical reasoning)
50
+ 30% Wikipedia
51
+
52
+ Phase 4: Optimization (40-100%)
53
+
54
+ 40% Project Gutenberg
55
+ 30% Wikipedia
56
+ 30% OpenGovernment (diverse analytical content)
57
+
58
+ Note: Harvard legal data was eliminated after 40% due to persistent training instabilities, replaced with OpenGovernment data for better stability while maintaining reasoning patterns.
59
+ Training Data Sources (100% Public Domain)
60
+
61
+ Project Gutenberg: Classical literature, philosophy, science texts
62
+ Wikipedia: Encyclopedia articles and factual content
63
+ Government Documents: Policy papers, reports, legal documents
64
+ OpenGovernment: Diverse government publications and analyses
65
+
66
+ Total: ~19.2B tokens across all phases, with careful curation to ensure public domain status.
67
+
68
+
69
+ This is a base model and not ready for use. We are beginning post-training 09/20 and will upload once done.
70
+ GGUFs can be found at https://github.com/openconstruct/libremodel/releases---