Supernova25million / README.md

Update README.md

10cc86d verified 4 months ago

5.98 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- algorythmtechnologies/Supernova25million
	---
	# Supernova (25M) — AlgoRythm Technologies

	Enhanced AI Assistant with Tool Integration

	Supernova is a 25,000,000-parameter decoder-only Transformer, built from scratch, using the GPT‑2 tokenizer (vocab size 50,257) with an exact parameter budget — not exceeding by even 1 parameter.

	🚀 Enhanced with Advanced AI Capabilities:
	- 🧠 Advanced Reasoning Engine: Multi-step problem solving, knowledge synthesis, domain expertise analysis
	- 📊 Math Engine Integration: Advanced mathematical computations, scientific calculations, engineering equations
	- 🔍 Serper Web Search: Real-time information, current events, factual queries
	- 🎓 Multi-Domain Expertise: Science, Technology, Medicine, Business, Humanities, Arts
	- ⚡ Smart Tool Coordination: Intelligent routing and chaining of multiple tools for complex queries
	- 🔬 Sophisticated Analysis: Context-aware responses with evidence synthesis and comprehensive reasoning

	Key specs:
	- Exact params: 25,000,000
	- Tokenizer: GPT‑2 (vocab_size = 50,257)
	- d_model: 320
	- n_layers: 6
	- n_heads: 10 (head_dim = 32)
	- n_positions: 4,748 (learned positional embeddings)
	- MLP ratio: 4.0 (hidden_size = 4 × d_model)
	- Weight tying: yes (LM head shares token embedding weights; no LM head bias)
	- Dropout: configurable (default 0.1)

	Why these numbers? They are chosen so that the total parameter count equals exactly 25,000,000 with GPT‑2 vocab size, using learned positional embeddings and tied output head.

	Parameter proof sketch (matches code):
	- Token embeddings: 50,257 × 320 = 16,082,240
	- Positional embeddings: 4,748 × 320 = 1,519,360
	- Per block: 12·d^2 + 13·d = 12·(320^2) + 13·320 = 1,228,800 + 4,160 = 1,232,960
	- 6 blocks total: 7,397,760
	- Final LayerNorm: 2·d = 640
	- Total = 16,082,240 + 1,519,360 + 7,397,760 + 640 = 25,000,000

	The verification script (supernova/verify_params.py) asserts this at runtime.

	Brand behavior:
	- The chat wrapper will return the AlgoRythm Tech – Company Profile & Vision text (branding/ALGORHYTHM_TECH_PROFILE.txt) when a prompt asks about AlgoRythm Tech/company profile/vision.

	Caution on scope:
	- “Knows everything that happened in the world” is not achievable in a single model; instead, this repo provides a scalable pipeline to train on broad, diverse, and massive text corpora. You control the data sources via a YAML config.

	Quickstart

	1) Install dependencies (Windows PowerShell)
	- Ensure Python 3.10+ is installed
	- Navigate to the project
	cd C:\Users\sriaa\supernova
	- Install dependencies
	pip install -r requirements.txt
	- If PyTorch wheel needs a specific index (GPU/CPU), follow https://pytorch.org/get-started/locally/

	2) Verify exact parameter count and tokenizer vocabulary size
	python -m supernova.verify_params --config .\configs\supernova_25m.json
	Expected output includes:
	- vocab_size: 50257
	- total_params: 25000000 (EXACT)

	3) Prepare data config (comprehensive knowledge training)
	- For comprehensive coverage across all subjects:
	copy .\configs\comprehensive_data_sources.yaml .\configs\data_sources.yaml
	- Or for basic setup:
	copy .\configs\data_sources.example.yaml .\configs\data_sources.yaml
	- Edit the file and enable/disable sources you want. Many are large and require significant bandwidth.

	4) Train (logs gradient norm and uses a strong LR schedule)
	python -m supernova.train ^
	--config .\configs\supernova_25m.json ^
	--data-config .\configs\data_sources.yaml ^
	--seq-len 1024 ^
	--batch-size 16 ^
	--grad-accum 8 ^
	--lr 3e-4 ^
	--warmup-steps 2000 ^
	--max-steps 100000 ^
	--save-every 10000
	Notes:
	- Gradient norm is printed regularly (no clipping by default).
	- Adjust batch/accum/seq-len by your hardware.
	- Cosine decay schedule with warmup is applied.

	5) Advanced Chat with Enhanced Reasoning (brand-aware; post-training)
	# API keys are already configured in configs/api_keys.yaml
	# - Math Engine: Built-in SymPy-based mathematical computation (no API key needed)
	# - Serper: Web search API configured

	# Advanced interactive chat with sophisticated reasoning
	python .\chat_advanced.py --config .\configs\supernova_25m.json

	# Single prompt mode with advanced analysis
	python .\chat_advanced.py --config .\configs\supernova_25m.json --prompt "Analyze the implications of artificial intelligence on healthcare from multiple perspectives"

	# Basic enhanced chat (legacy)
	python .\chat_enhanced.py --config .\configs\supernova_25m.json

	- 🧐 Complex reasoning queries → Multi-step analysis using reasoning engine
	- 📊 Mathematical queries → Routed to math engine for precise calculations
	- 🔍 Current events/facts → Routed to Serper for real-time web search
	- 🏢 AlgoRythm Tech queries → Returns company profile
	- 📚 Multi-domain questions → Synthesizes expertise across scientific, technical, and academic fields
	- 🎓 General knowledge → Enhanced model generation with sophisticated context

	Data sources (broad options)
	- Included in configs/data_sources.example.yaml. Example (enable selectively):
	- c4/en (Colossal Clean Crawled Corpus)
	- wikipedia/en
	- openwebtext
	- bookcorpusopen
	- the_pile
	Notes:
	- Review licenses and terms of each dataset.
	- You can add your own sources. The pipeline streams and interleaves by weight.

	Training details
	- Optimizer: AdamW (betas=(0.9, 0.95), weight_decay=0.1)
	- LR schedule: Cosine decay with warmup (proper schedule; no “shabby” LR)
	- Gradient norm: computed every log step and printed
	- Mixed precision: optional (bf16/fp16) if available
	- Checkpointing: periodic saving to output directory

	Brand profile
	- File: branding/ALGORHYTHM_TECH_PROFILE.txt
	- The chat wrapper uses this exact text for company-related queries.

	License
	- Apache 2.0 (see LICENSE)

	Attribution
	- Built by AlgoRythm Technologies.