File size: 5,981 Bytes
82d25b9
 
 
 
 
10cc86d
82d25b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: apache-2.0
language:
- en
base_model:
- algorythmtechnologies/Supernova25million
---
# Supernova (25M) — AlgoRythm Technologies

**Enhanced AI Assistant with Tool Integration**

Supernova is a 25,000,000-parameter decoder-only Transformer, built from scratch, using the GPT‑2 tokenizer (vocab size 50,257) with an exact parameter budget — not exceeding by even 1 parameter.

**🚀 Enhanced with Advanced AI Capabilities:**
- **🧠 Advanced Reasoning Engine**: Multi-step problem solving, knowledge synthesis, domain expertise analysis
- **📊 Math Engine Integration**: Advanced mathematical computations, scientific calculations, engineering equations
- **🔍 Serper Web Search**: Real-time information, current events, factual queries
- **🎓 Multi-Domain Expertise**: Science, Technology, Medicine, Business, Humanities, Arts
- **⚡ Smart Tool Coordination**: Intelligent routing and chaining of multiple tools for complex queries
- **🔬 Sophisticated Analysis**: Context-aware responses with evidence synthesis and comprehensive reasoning

Key specs:
- Exact params: 25,000,000
- Tokenizer: GPT‑2 (vocab_size = 50,257)
- d_model: 320
- n_layers: 6
- n_heads: 10 (head_dim = 32)
- n_positions: 4,748 (learned positional embeddings)
- MLP ratio: 4.0 (hidden_size = 4 × d_model)
- Weight tying: yes (LM head shares token embedding weights; no LM head bias)
- Dropout: configurable (default 0.1)

Why these numbers? They are chosen so that the total parameter count equals exactly 25,000,000 with GPT‑2 vocab size, using learned positional embeddings and tied output head.

Parameter proof sketch (matches code):
- Token embeddings: 50,257 × 320 = 16,082,240
- Positional embeddings: 4,748 × 320 = 1,519,360
- Per block: 12·d^2 + 13·d = 12·(320^2) + 13·320 = 1,228,800 + 4,160 = 1,232,960
- 6 blocks total: 7,397,760
- Final LayerNorm: 2·d = 640
- Total = 16,082,240 + 1,519,360 + 7,397,760 + 640 = 25,000,000

The verification script (supernova/verify_params.py) asserts this at runtime.

Brand behavior:
- The chat wrapper will return the AlgoRythm Tech – Company Profile & Vision text (branding/ALGORHYTHM_TECH_PROFILE.txt) when a prompt asks about AlgoRythm Tech/company profile/vision.

Caution on scope:
- “Knows everything that happened in the world” is not achievable in a single model; instead, this repo provides a scalable pipeline to train on broad, diverse, and massive text corpora. You control the data sources via a YAML config.

Quickstart

1) Install dependencies (Windows PowerShell)
- Ensure Python 3.10+ is installed
- Navigate to the project
  cd C:\Users\sriaa\supernova
- Install dependencies
  pip install -r requirements.txt
- If PyTorch wheel needs a specific index (GPU/CPU), follow https://pytorch.org/get-started/locally/

2) Verify exact parameter count and tokenizer vocabulary size
  python -m supernova.verify_params --config .\configs\supernova_25m.json
Expected output includes:
- vocab_size: 50257
- total_params: 25000000 (EXACT)

3) Prepare data config (comprehensive knowledge training)
- For comprehensive coverage across all subjects:
  copy .\configs\comprehensive_data_sources.yaml .\configs\data_sources.yaml
- Or for basic setup:
  copy .\configs\data_sources.example.yaml .\configs\data_sources.yaml
- Edit the file and enable/disable sources you want. Many are large and require significant bandwidth.

4) Train (logs gradient norm and uses a strong LR schedule)
  python -m supernova.train ^
    --config .\configs\supernova_25m.json ^
    --data-config .\configs\data_sources.yaml ^
    --seq-len 1024 ^
    --batch-size 16 ^
    --grad-accum 8 ^
    --lr 3e-4 ^
    --warmup-steps 2000 ^
    --max-steps 100000 ^
    --save-every 10000
Notes:
- Gradient norm is printed regularly (no clipping by default).
- Adjust batch/accum/seq-len by your hardware.
- Cosine decay schedule with warmup is applied.

5) Advanced Chat with Enhanced Reasoning (brand-aware; post-training)
  # API keys are already configured in configs/api_keys.yaml
  # - Math Engine: Built-in SymPy-based mathematical computation (no API key needed)
  # - Serper: Web search API configured
  
  # Advanced interactive chat with sophisticated reasoning
  python .\chat_advanced.py --config .\configs\supernova_25m.json
  
  # Single prompt mode with advanced analysis
  python .\chat_advanced.py --config .\configs\supernova_25m.json --prompt "Analyze the implications of artificial intelligence on healthcare from multiple perspectives"
  
  # Basic enhanced chat (legacy)
  python .\chat_enhanced.py --config .\configs\supernova_25m.json
  
- **🧐 Complex reasoning queries** → Multi-step analysis using reasoning engine
- **📊 Mathematical queries** → Routed to math engine for precise calculations
- **🔍 Current events/facts** → Routed to Serper for real-time web search
- **🏢 AlgoRythm Tech queries** → Returns company profile
- **📚 Multi-domain questions** → Synthesizes expertise across scientific, technical, and academic fields
- **🎓 General knowledge** → Enhanced model generation with sophisticated context

Data sources (broad options)
- Included in configs/data_sources.example.yaml. Example (enable selectively):
  - c4/en (Colossal Clean Crawled Corpus)
  - wikipedia/en
  - openwebtext
  - bookcorpusopen
  - the_pile
Notes:
- Review licenses and terms of each dataset.
- You can add your own sources. The pipeline streams and interleaves by weight.

Training details
- Optimizer: AdamW (betas=(0.9, 0.95), weight_decay=0.1)
- LR schedule: Cosine decay with warmup (proper schedule; no “shabby” LR)
- Gradient norm: computed every log step and printed
- Mixed precision: optional (bf16/fp16) if available
- Checkpointing: periodic saving to output directory

Brand profile
- File: branding/ALGORHYTHM_TECH_PROFILE.txt
- The chat wrapper uses this exact text for company-related queries.

License
- Apache 2.0 (see LICENSE)

Attribution
- Built by AlgoRythm Technologies.