algorythmtechnologies commited on
Commit
82d25b9
·
verified ·
1 Parent(s): 6ce3b41

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -130
README.md CHANGED
@@ -1,130 +1,137 @@
1
- # Supernova (25M) — AlgoRythm Technologies
2
-
3
- **Enhanced AI Assistant with Tool Integration**
4
-
5
- Supernova is a 25,000,000-parameter decoder-only Transformer, built from scratch, using the GPT‑2 tokenizer (vocab size 50,257) with an exact parameter budget — not exceeding by even 1 parameter.
6
-
7
- **🚀 Enhanced with Advanced AI Capabilities:**
8
- - **🧠 Advanced Reasoning Engine**: Multi-step problem solving, knowledge synthesis, domain expertise analysis
9
- - **📊 Math Engine Integration**: Advanced mathematical computations, scientific calculations, engineering equations
10
- - **🔍 Serper Web Search**: Real-time information, current events, factual queries
11
- - **🎓 Multi-Domain Expertise**: Science, Technology, Medicine, Business, Humanities, Arts
12
- - **⚡ Smart Tool Coordination**: Intelligent routing and chaining of multiple tools for complex queries
13
- - **🔬 Sophisticated Analysis**: Context-aware responses with evidence synthesis and comprehensive reasoning
14
-
15
- Key specs:
16
- - Exact params: 25,000,000
17
- - Tokenizer: GPT‑2 (vocab_size = 50,257)
18
- - d_model: 320
19
- - n_layers: 6
20
- - n_heads: 10 (head_dim = 32)
21
- - n_positions: 4,748 (learned positional embeddings)
22
- - MLP ratio: 4.0 (hidden_size = 4 × d_model)
23
- - Weight tying: yes (LM head shares token embedding weights; no LM head bias)
24
- - Dropout: configurable (default 0.1)
25
-
26
- Why these numbers? They are chosen so that the total parameter count equals exactly 25,000,000 with GPT‑2 vocab size, using learned positional embeddings and tied output head.
27
-
28
- Parameter proof sketch (matches code):
29
- - Token embeddings: 50,257 × 320 = 16,082,240
30
- - Positional embeddings: 4,748 × 320 = 1,519,360
31
- - Per block: 12·d^2 + 13·d = 12·(320^2) + 13·320 = 1,228,800 + 4,160 = 1,232,960
32
- - 6 blocks total: 7,397,760
33
- - Final LayerNorm: 2·d = 640
34
- - Total = 16,082,240 + 1,519,360 + 7,397,760 + 640 = 25,000,000
35
-
36
- The verification script (supernova/verify_params.py) asserts this at runtime.
37
-
38
- Brand behavior:
39
- - The chat wrapper will return the AlgoRythm Tech – Company Profile & Vision text (branding/ALGORHYTHM_TECH_PROFILE.txt) when a prompt asks about AlgoRythm Tech/company profile/vision.
40
-
41
- Caution on scope:
42
- - “Knows everything that happened in the world” is not achievable in a single model; instead, this repo provides a scalable pipeline to train on broad, diverse, and massive text corpora. You control the data sources via a YAML config.
43
-
44
- Quickstart
45
-
46
- 1) Install dependencies (Windows PowerShell)
47
- - Ensure Python 3.10+ is installed
48
- - Navigate to the project
49
- cd C:\Users\sriaa\supernova
50
- - Install dependencies
51
- pip install -r requirements.txt
52
- - If PyTorch wheel needs a specific index (GPU/CPU), follow https://pytorch.org/get-started/locally/
53
-
54
- 2) Verify exact parameter count and tokenizer vocabulary size
55
- python -m supernova.verify_params --config .\configs\supernova_25m.json
56
- Expected output includes:
57
- - vocab_size: 50257
58
- - total_params: 25000000 (EXACT)
59
-
60
- 3) Prepare data config (comprehensive knowledge training)
61
- - For comprehensive coverage across all subjects:
62
- copy .\configs\comprehensive_data_sources.yaml .\configs\data_sources.yaml
63
- - Or for basic setup:
64
- copy .\configs\data_sources.example.yaml .\configs\data_sources.yaml
65
- - Edit the file and enable/disable sources you want. Many are large and require significant bandwidth.
66
-
67
- 4) Train (logs gradient norm and uses a strong LR schedule)
68
- python -m supernova.train ^
69
- --config .\configs\supernova_25m.json ^
70
- --data-config .\configs\data_sources.yaml ^
71
- --seq-len 1024 ^
72
- --batch-size 16 ^
73
- --grad-accum 8 ^
74
- --lr 3e-4 ^
75
- --warmup-steps 2000 ^
76
- --max-steps 100000 ^
77
- --save-every 10000
78
- Notes:
79
- - Gradient norm is printed regularly (no clipping by default).
80
- - Adjust batch/accum/seq-len by your hardware.
81
- - Cosine decay schedule with warmup is applied.
82
-
83
- 5) Advanced Chat with Enhanced Reasoning (brand-aware; post-training)
84
- # API keys are already configured in configs/api_keys.yaml
85
- # - Math Engine: Built-in SymPy-based mathematical computation (no API key needed)
86
- # - Serper: Web search API configured
87
-
88
- # Advanced interactive chat with sophisticated reasoning
89
- python .\chat_advanced.py --config .\configs\supernova_25m.json
90
-
91
- # Single prompt mode with advanced analysis
92
- python .\chat_advanced.py --config .\configs\supernova_25m.json --prompt "Analyze the implications of artificial intelligence on healthcare from multiple perspectives"
93
-
94
- # Basic enhanced chat (legacy)
95
- python .\chat_enhanced.py --config .\configs\supernova_25m.json
96
-
97
- - **🧐 Complex reasoning queries** → Multi-step analysis using reasoning engine
98
- - **📊 Mathematical queries** Routed to math engine for precise calculations
99
- - **🔍 Current events/facts** Routed to Serper for real-time web search
100
- - **🏢 AlgoRythm Tech queries** → Returns company profile
101
- - **📚 Multi-domain questions** → Synthesizes expertise across scientific, technical, and academic fields
102
- - **🎓 General knowledge** → Enhanced model generation with sophisticated context
103
-
104
- Data sources (broad options)
105
- - Included in configs/data_sources.example.yaml. Example (enable selectively):
106
- - c4/en (Colossal Clean Crawled Corpus)
107
- - wikipedia/en
108
- - openwebtext
109
- - bookcorpusopen
110
- - the_pile
111
- Notes:
112
- - Review licenses and terms of each dataset.
113
- - You can add your own sources. The pipeline streams and interleaves by weight.
114
-
115
- Training details
116
- - Optimizer: AdamW (betas=(0.9, 0.95), weight_decay=0.1)
117
- - LR schedule: Cosine decay with warmup (proper schedule; no “shabby” LR)
118
- - Gradient norm: computed every log step and printed
119
- - Mixed precision: optional (bf16/fp16) if available
120
- - Checkpointing: periodic saving to output directory
121
-
122
- Brand profile
123
- - File: branding/ALGORHYTHM_TECH_PROFILE.txt
124
- - The chat wrapper uses this exact text for company-related queries.
125
-
126
- License
127
- - Apache 2.0 (see LICENSE)
128
-
129
- Attribution
130
- - Built by AlgoRythm Technologies.
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - algorythmtechnologies/suslm
7
+ ---
8
+ # Supernova (25M) AlgoRythm Technologies
9
+
10
+ **Enhanced AI Assistant with Tool Integration**
11
+
12
+ Supernova is a 25,000,000-parameter decoder-only Transformer, built from scratch, using the GPT‑2 tokenizer (vocab size 50,257) with an exact parameter budget — not exceeding by even 1 parameter.
13
+
14
+ **🚀 Enhanced with Advanced AI Capabilities:**
15
+ - **🧠 Advanced Reasoning Engine**: Multi-step problem solving, knowledge synthesis, domain expertise analysis
16
+ - **📊 Math Engine Integration**: Advanced mathematical computations, scientific calculations, engineering equations
17
+ - **🔍 Serper Web Search**: Real-time information, current events, factual queries
18
+ - **🎓 Multi-Domain Expertise**: Science, Technology, Medicine, Business, Humanities, Arts
19
+ - **⚡ Smart Tool Coordination**: Intelligent routing and chaining of multiple tools for complex queries
20
+ - **🔬 Sophisticated Analysis**: Context-aware responses with evidence synthesis and comprehensive reasoning
21
+
22
+ Key specs:
23
+ - Exact params: 25,000,000
24
+ - Tokenizer: GPT‑2 (vocab_size = 50,257)
25
+ - d_model: 320
26
+ - n_layers: 6
27
+ - n_heads: 10 (head_dim = 32)
28
+ - n_positions: 4,748 (learned positional embeddings)
29
+ - MLP ratio: 4.0 (hidden_size = 4 × d_model)
30
+ - Weight tying: yes (LM head shares token embedding weights; no LM head bias)
31
+ - Dropout: configurable (default 0.1)
32
+
33
+ Why these numbers? They are chosen so that the total parameter count equals exactly 25,000,000 with GPT‑2 vocab size, using learned positional embeddings and tied output head.
34
+
35
+ Parameter proof sketch (matches code):
36
+ - Token embeddings: 50,257 × 320 = 16,082,240
37
+ - Positional embeddings: 4,748 × 320 = 1,519,360
38
+ - Per block: 12·d^2 + 13·d = 12·(320^2) + 13·320 = 1,228,800 + 4,160 = 1,232,960
39
+ - 6 blocks total: 7,397,760
40
+ - Final LayerNorm: 2·d = 640
41
+ - Total = 16,082,240 + 1,519,360 + 7,397,760 + 640 = 25,000,000
42
+
43
+ The verification script (supernova/verify_params.py) asserts this at runtime.
44
+
45
+ Brand behavior:
46
+ - The chat wrapper will return the AlgoRythm Tech – Company Profile & Vision text (branding/ALGORHYTHM_TECH_PROFILE.txt) when a prompt asks about AlgoRythm Tech/company profile/vision.
47
+
48
+ Caution on scope:
49
+ - “Knows everything that happened in the world” is not achievable in a single model; instead, this repo provides a scalable pipeline to train on broad, diverse, and massive text corpora. You control the data sources via a YAML config.
50
+
51
+ Quickstart
52
+
53
+ 1) Install dependencies (Windows PowerShell)
54
+ - Ensure Python 3.10+ is installed
55
+ - Navigate to the project
56
+ cd C:\Users\sriaa\supernova
57
+ - Install dependencies
58
+ pip install -r requirements.txt
59
+ - If PyTorch wheel needs a specific index (GPU/CPU), follow https://pytorch.org/get-started/locally/
60
+
61
+ 2) Verify exact parameter count and tokenizer vocabulary size
62
+ python -m supernova.verify_params --config .\configs\supernova_25m.json
63
+ Expected output includes:
64
+ - vocab_size: 50257
65
+ - total_params: 25000000 (EXACT)
66
+
67
+ 3) Prepare data config (comprehensive knowledge training)
68
+ - For comprehensive coverage across all subjects:
69
+ copy .\configs\comprehensive_data_sources.yaml .\configs\data_sources.yaml
70
+ - Or for basic setup:
71
+ copy .\configs\data_sources.example.yaml .\configs\data_sources.yaml
72
+ - Edit the file and enable/disable sources you want. Many are large and require significant bandwidth.
73
+
74
+ 4) Train (logs gradient norm and uses a strong LR schedule)
75
+ python -m supernova.train ^
76
+ --config .\configs\supernova_25m.json ^
77
+ --data-config .\configs\data_sources.yaml ^
78
+ --seq-len 1024 ^
79
+ --batch-size 16 ^
80
+ --grad-accum 8 ^
81
+ --lr 3e-4 ^
82
+ --warmup-steps 2000 ^
83
+ --max-steps 100000 ^
84
+ --save-every 10000
85
+ Notes:
86
+ - Gradient norm is printed regularly (no clipping by default).
87
+ - Adjust batch/accum/seq-len by your hardware.
88
+ - Cosine decay schedule with warmup is applied.
89
+
90
+ 5) Advanced Chat with Enhanced Reasoning (brand-aware; post-training)
91
+ # API keys are already configured in configs/api_keys.yaml
92
+ # - Math Engine: Built-in SymPy-based mathematical computation (no API key needed)
93
+ # - Serper: Web search API configured
94
+
95
+ # Advanced interactive chat with sophisticated reasoning
96
+ python .\chat_advanced.py --config .\configs\supernova_25m.json
97
+
98
+ # Single prompt mode with advanced analysis
99
+ python .\chat_advanced.py --config .\configs\supernova_25m.json --prompt "Analyze the implications of artificial intelligence on healthcare from multiple perspectives"
100
+
101
+ # Basic enhanced chat (legacy)
102
+ python .\chat_enhanced.py --config .\configs\supernova_25m.json
103
+
104
+ - **🧐 Complex reasoning queries** → Multi-step analysis using reasoning engine
105
+ - **📊 Mathematical queries** Routed to math engine for precise calculations
106
+ - **🔍 Current events/facts** Routed to Serper for real-time web search
107
+ - **🏢 AlgoRythm Tech queries** → Returns company profile
108
+ - **📚 Multi-domain questions** → Synthesizes expertise across scientific, technical, and academic fields
109
+ - **🎓 General knowledge** → Enhanced model generation with sophisticated context
110
+
111
+ Data sources (broad options)
112
+ - Included in configs/data_sources.example.yaml. Example (enable selectively):
113
+ - c4/en (Colossal Clean Crawled Corpus)
114
+ - wikipedia/en
115
+ - openwebtext
116
+ - bookcorpusopen
117
+ - the_pile
118
+ Notes:
119
+ - Review licenses and terms of each dataset.
120
+ - You can add your own sources. The pipeline streams and interleaves by weight.
121
+
122
+ Training details
123
+ - Optimizer: AdamW (betas=(0.9, 0.95), weight_decay=0.1)
124
+ - LR schedule: Cosine decay with warmup (proper schedule; no “shabby” LR)
125
+ - Gradient norm: computed every log step and printed
126
+ - Mixed precision: optional (bf16/fp16) if available
127
+ - Checkpointing: periodic saving to output directory
128
+
129
+ Brand profile
130
+ - File: branding/ALGORHYTHM_TECH_PROFILE.txt
131
+ - The chat wrapper uses this exact text for company-related queries.
132
+
133
+ License
134
+ - Apache 2.0 (see LICENSE)
135
+
136
+ Attribution
137
+ - Built by AlgoRythm Technologies.