Zandy-Wandy commited on
Commit
ad00e79
Β·
verified Β·
1 Parent(s): 0f8cbc9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +311 -289
README.md CHANGED
@@ -1,289 +1,311 @@
1
- # Vortex Scientific
2
-
3
- **Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
4
-
5
- ## 🌟 Features
6
-
7
- - **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
8
- - **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
9
- - **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
10
- - **Two Model Sizes**:
11
- - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
12
- - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
13
- - **HuggingFace Compatible**: Full integration with `transformers` library
14
- - **From Scratch**: No base model β€” everything built bottom-up including tokenizer and weights
15
-
16
- ## πŸ—οΈ Architecture
17
-
18
- Vortex uses a two-block hybrid architecture:
19
-
20
- 1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
21
- 2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
22
-
23
- Layer ratios:
24
- - 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
25
- - 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
26
-
27
- ### Science Modules
28
-
29
- - **EquationModule**: LaTeX equation detection and structural understanding
30
- - **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
31
- - **CitationModule**: Citation span detection, provenance tracking, confidence scoring
32
- - **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
33
-
34
- ## πŸ“¦ Project Structure
35
-
36
- ```
37
- Vortex/
38
- β”œβ”€β”€ configs/
39
- β”‚ β”œβ”€β”€ vortex_7b_config.py # 7B model configuration
40
- β”‚ β”œβ”€β”€ vortex_13b_config.py # 13B model configuration
41
- β”‚ └── training_config.py # Training hyperparameters
42
- β”œβ”€β”€ models/
43
- β”‚ β”œβ”€β”€ ssm_layer.py # State-space layer
44
- β”‚ β”œβ”€β”€ attention_layer.py # Local windowed attention
45
- β”‚ β”œβ”€β”€ scigate_ffn.py # Science-gated feed-forward
46
- β”‚ β”œβ”€β”€ vortex_model.py # Main model class
47
- β”‚ └── science_modules/ # Specialized science modules
48
- β”œβ”€β”€ tokenizer/
49
- β”‚ └── vortex_tokenizer.py # Custom BPE tokenizer with science vocab
50
- β”œβ”€β”€ data/
51
- β”‚ β”œβ”€β”€ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
52
- β”‚ β”œβ”€β”€ quality_filter.py # Multi-stage quality filtering
53
- β”‚ β”œβ”€β”€ domain_classifier.py # 7-domain classifier
54
- β”‚ β”œβ”€β”€ deduplication.py # MinHash LSH deduplication
55
- β”‚ └── scraper.py # Web scraping (arXiv, PubMed, etc.)
56
- β”œβ”€β”€ training/
57
- β”‚ β”œβ”€β”€ trainer.py # Main training loop
58
- β”‚ β”œβ”€β”€ losses.py # Science-aware loss functions
59
- β”‚ └── curriculum.py # Curriculum learning scheduler
60
- β”œβ”€β”€ inference/
61
- β”‚ β”œβ”€β”€ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
62
- β”‚ └── mps_optimize.py # MPS optimizations for Apple Silicon
63
- β”œβ”€β”€ evaluation/ # Science benchmarks (coming soon)
64
- β”œβ”€β”€ configuration_vortex.py # HF config class
65
- β”œβ”€β”€ tokenization_vortex.py # HF tokenizer wrapper
66
- β”œβ”€β”€ modeling_vortex.py # HF model integration
67
- β”œβ”€β”€ train.py # Training entry point
68
- β”œβ”€β”€ inference/inference.py # Inference entry point
69
- └── requirements.txt
70
- ```
71
-
72
- ## πŸš€ Quick Start
73
-
74
- ### Installation
75
-
76
- ```bash
77
- # Clone and setup
78
- cd Vortex
79
- pip install -r requirements.txt
80
-
81
- # For CUDA optimizations
82
- pip install flash-attn
83
- pip install bitsandbytes
84
- ```
85
-
86
- ### Training
87
-
88
- ```bash
89
- # Train 7B model on CUDA
90
- python train.py \
91
- --model_size 7b \
92
- --device cuda \
93
- --data_dir ./data/processed \
94
- --output_dir ./checkpoints \
95
- --max_steps 100000
96
-
97
- # Train 13B model with INT8 quantization (for 8GB VRAM)
98
- python train.py \
99
- --model_size 13b \
100
- --device cuda \
101
- --quantization int8 \
102
- --data_dir ./data/processed \
103
- --output_dir ./checkpoints_13b
104
- ```
105
-
106
- ### Inference
107
-
108
- ```bash
109
- # Generate text with 7B model
110
- python inference/inference.py \
111
- --model_path ./checkpoints/latest.pt \
112
- --model_size 7b \
113
- --device cuda \
114
- --prompt "The equation E = mc^2 describes" \
115
- --max_new_tokens 100
116
-
117
- # Interactive mode
118
- python inference/inference.py \
119
- --model_path ./checkpoints/latest.pt \
120
- --model_size 7b \
121
- --device cuda \
122
- --interactive
123
-
124
- # On Apple Silicon (MPS)
125
- python inference/inference.py \
126
- --model_path ./checkpoints/latest.pt \
127
- --model_size 7b \
128
- --use_mps \
129
- --prompt "Explain quantum mechanics"
130
- ```
131
-
132
- ### HuggingFace Integration
133
-
134
- ```python
135
- from transformers import AutoModelForCausalLM, AutoTokenizer
136
-
137
- # Load model and tokenizer
138
- model = AutoModelForCausalLM.from_pretrained("./checkpoints")
139
- tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
140
-
141
- # Generate
142
- input_text = "The energy of a photon is given by"
143
- inputs = tokenizer(input_text, return_tensors="pt")
144
- outputs = model.generate(**inputs, max_new_tokens=50)
145
- print(tokenizer.decode(outputs[0]))
146
- ```
147
-
148
- ## πŸ“Š Data Pipeline
149
-
150
- 1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
151
- 2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
152
- 3. **Deduplication**: MinHash LSH for near-duplicate detection
153
- 4. **Domain Classification**: Classify into 7 science domains
154
- 5. **Tokenization**: Custom science-aware BPE tokenizer
155
- 6. **Sharding**: Write to Parquet with statistics
156
-
157
- ```python
158
- from data.dataset_loader import VortexDatasetLoader
159
- from data.quality_filter import ScienceQualityFilter
160
- from data.deduplication import MinHashLSH
161
-
162
- # Load and process data
163
- loader = VortexDatasetLoader()
164
- quality_filter = ScienceQualityFilter()
165
- lsh = MinHashLSH()
166
-
167
- # Stream datasets, filter, deduplicate, and shard
168
- for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
169
- if quality_filter.filter(sample["text"]):
170
- lsh.add_document(sample["id"], sample["text"])
171
- # Tokenize and save
172
- ```
173
-
174
- ## 🎯 Training Strategy
175
-
176
- ### Curriculum Learning
177
-
178
- Training progresses through 4 stages:
179
-
180
- 1. **Foundation** (0-20%): Basic science text, simple equations, definitions
181
- 2. **Domain** (20-50%): Domain-specific deep content per science area
182
- 3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
183
- 4. **Integration** (80-100%): Cross-domain science, full dataset
184
-
185
- ### Science-Aware Loss
186
-
187
- ```python
188
- total_loss = (
189
- lm_loss * 1.0 # Standard next token prediction
190
- + equation_loss * 0.3 # Equation reconstruction accuracy
191
- + domain_loss * 0.1 # Domain classification head
192
- + citation_loss * 0.1 # Citation detection accuracy
193
- + numerical_loss * 0.2 # Numerical reasoning accuracy
194
- )
195
- ```
196
-
197
- ## βš™οΈ Configuration
198
-
199
- ### 7B Config (VORTEX_7B_CONFIG)
200
-
201
- - `d_model`: 4096
202
- - `num_layers`: 32
203
- - `num_heads`: 32
204
- - `d_state`: 16
205
- - `ssm_ratio`: 0.6
206
- - `vocab_size`: 50000
207
- - `max_seq_len`: 16384
208
-
209
- ### 13B Config (VORTEX_13B_CONFIG)
210
-
211
- - `d_model`: 5120
212
- - `num_layers`: 40
213
- - `num_heads`: 40
214
- - `d_state`: 32
215
- - `ssm_ratio`: 0.5
216
- - `vocab_size`: 50000
217
- - `max_seq_len`: 16384
218
-
219
- ## πŸ”§ Hardware Targets
220
-
221
- ### Nvidia 4060 Laptop (8GB VRAM)
222
-
223
- - **7B**: BF16, no quantization, Flash Attention 2, torch.compile
224
- - **13B**: INT8 quantization, Flash Attention 2, torch.compile
225
- - Target TPS: 25-40 (7B), 15-25 (13B)
226
-
227
- ### Apple Silicon (M2/M3)
228
-
229
- - **7B on M3**: BF16 (via float16), SDPA, no compile
230
- - **13B on M3 Max**: BF16, unified memory, SDPA
231
- - Target TPS: 20-35 (7B), 12-20 (13B)
232
-
233
- ## πŸ§ͺ Science Domains
234
-
235
- 1. **Physics** (`[PHYS]`)
236
- 2. **Mathematics** (`[MATH]`)
237
- 3. **Chemistry** (`[CHEM]`)
238
- 4. **Biology** (`[BIO]`)
239
- 5. **Earth Science** (`[EARTH]`)
240
- 6. **Space Science** (`[SPACE]`)
241
- 7. **Zoology** (`[ZOO]`)
242
-
243
- Domain tags can be included in training data to guide the SciGate FFN routing.
244
-
245
- ## πŸ“ Tokenizer
246
-
247
- Custom BPE tokenizer with:
248
-
249
- - 40,000 base BPE tokens trained on scientific corpus
250
- - 10,000 science-specific tokens:
251
- - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
252
- - 118 chemical element symbols
253
- - 200 SI and derived units
254
- - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
255
- - 500 mathematical operators
256
- - Amino acid codes
257
- - Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
258
- - Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
259
-
260
- ## πŸ§ͺ Evaluation
261
-
262
- Science benchmarks across all 7 domains will be added. Planned benchmarks:
263
-
264
- - **Physics**: Feynman Questions, Physics GRE
265
- - **Math**: MATH dataset, GSM8K
266
- - **Chemistry**: Chemistry problem-solving, molecular property prediction
267
- - **Biology**: PubMed QA, bioinformatics tasks
268
- - **Earth Science**: Climate modeling questions
269
- - **Space Science**: Astronomy problem sets
270
- - **Zoology**: Species classification, ecological reasoning
271
-
272
- ## πŸ“„ License
273
-
274
- This is a school science project. Code is provided for educational purposes.
275
-
276
- ## πŸ™ Acknowledgments
277
-
278
- - **Mamba** (Gu et al.) for SSM architecture inspiration
279
- - **Flash Attention** (Dao et al.) for efficient attention
280
- - **HuggingFace** for transformers library
281
- - All open scientific data sources: arXiv, PubMed, S2ORC, etc.
282
-
283
- ## πŸ“§ Contact
284
-
285
- For questions or issues, please open an issue on GitHub.
286
-
287
- ---
288
-
289
- **Built with ❀️ for scientific AI research**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - vortex
7
+ - science
8
+ - physics
9
+ - chemistry
10
+ - biology
11
+ - mathematics
12
+ - ssm
13
+ - mamba
14
+ - hybrid-architecture
15
+ - custom-tokenizer
16
+ - from-scratch
17
+ - matrix-corp
18
+ pipeline_tag: text-generation
19
+ library_name: transformers
20
+ model_type: vortex
21
+ ---
22
+
23
+ # Vortex Scientific
24
+
25
+ **Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
26
+
27
+ ## 🌟 Features
28
+
29
+ - **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
30
+ - **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
31
+ - **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
32
+ - **Two Model Sizes**:
33
+ - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
34
+ - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
35
+ - **HuggingFace Compatible**: Full integration with `transformers` library
36
+ - **From Scratch**: No base model β€” everything built bottom-up including tokenizer and weights
37
+
38
+ ## πŸ—οΈ Architecture
39
+
40
+ Vortex uses a two-block hybrid architecture:
41
+
42
+ 1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
43
+ 2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
44
+
45
+ Layer ratios:
46
+ - 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
47
+ - 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
48
+
49
+ ### Science Modules
50
+
51
+ - **EquationModule**: LaTeX equation detection and structural understanding
52
+ - **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
53
+ - **CitationModule**: Citation span detection, provenance tracking, confidence scoring
54
+ - **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
55
+
56
+ ## πŸ“¦ Project Structure
57
+
58
+ ```
59
+ Vortex/
60
+ β”œβ”€β”€ configs/
61
+ β”‚ β”œβ”€β”€ vortex_7b_config.py # 7B model configuration
62
+ β”‚ β”œβ”€β”€ vortex_13b_config.py # 13B model configuration
63
+ β”‚ └── training_config.py # Training hyperparameters
64
+ β”œβ”€β”€ models/
65
+ β”‚ β”œβ”€β”€ ssm_layer.py # State-space layer
66
+ β”‚ β”œβ”€β”€ attention_layer.py # Local windowed attention
67
+ β”‚ β”œβ”€β”€ scigate_ffn.py # Science-gated feed-forward
68
+ β”‚ β”œβ”€β”€ vortex_model.py # Main model class
69
+ β”‚ └── science_modules/ # Specialized science modules
70
+ β”œβ”€β”€ tokenizer/
71
+ β”‚ └── vortex_tokenizer.py # Custom BPE tokenizer with science vocab
72
+ β”œβ”€β”€ data/
73
+ β”‚ β”œβ”€β”€ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
74
+ β”‚ β”œβ”€β”€ quality_filter.py # Multi-stage quality filtering
75
+ β”‚ β”œβ”€β”€ domain_classifier.py # 7-domain classifier
76
+ β”‚ β”œβ”€β”€ deduplication.py # MinHash LSH deduplication
77
+ β”‚ └── scraper.py # Web scraping (arXiv, PubMed, etc.)
78
+ β”œβ”€β”€ training/
79
+ β”‚ β”œβ”€β”€ trainer.py # Main training loop
80
+ β”‚ β”œβ”€β”€ losses.py # Science-aware loss functions
81
+ β”‚ └── curriculum.py # Curriculum learning scheduler
82
+ β”œβ”€β”€ inference/
83
+ β”‚ β”œβ”€β”€ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
84
+ β”‚ └── mps_optimize.py # MPS optimizations for Apple Silicon
85
+ β”œβ”€β”€ evaluation/ # Science benchmarks (coming soon)
86
+ β”œβ”€β”€ configuration_vortex.py # HF config class
87
+ β”œβ”€β”€ tokenization_vortex.py # HF tokenizer wrapper
88
+ β”œβ”€β”€ modeling_vortex.py # HF model integration
89
+ β”œβ”€β”€ train.py # Training entry point
90
+ β”œβ”€β”€ inference/inference.py # Inference entry point
91
+ └── requirements.txt
92
+ ```
93
+
94
+ ## πŸš€ Quick Start
95
+
96
+ ### Installation
97
+
98
+ ```bash
99
+ # Clone and setup
100
+ cd Vortex
101
+ pip install -r requirements.txt
102
+
103
+ # For CUDA optimizations
104
+ pip install flash-attn
105
+ pip install bitsandbytes
106
+ ```
107
+
108
+ ### Training
109
+
110
+ ```bash
111
+ # Train 7B model on CUDA
112
+ python train.py \
113
+ --model_size 7b \
114
+ --device cuda \
115
+ --data_dir ./data/processed \
116
+ --output_dir ./checkpoints \
117
+ --max_steps 100000
118
+
119
+ # Train 13B model with INT8 quantization (for 8GB VRAM)
120
+ python train.py \
121
+ --model_size 13b \
122
+ --device cuda \
123
+ --quantization int8 \
124
+ --data_dir ./data/processed \
125
+ --output_dir ./checkpoints_13b
126
+ ```
127
+
128
+ ### Inference
129
+
130
+ ```bash
131
+ # Generate text with 7B model
132
+ python inference/inference.py \
133
+ --model_path ./checkpoints/latest.pt \
134
+ --model_size 7b \
135
+ --device cuda \
136
+ --prompt "The equation E = mc^2 describes" \
137
+ --max_new_tokens 100
138
+
139
+ # Interactive mode
140
+ python inference/inference.py \
141
+ --model_path ./checkpoints/latest.pt \
142
+ --model_size 7b \
143
+ --device cuda \
144
+ --interactive
145
+
146
+ # On Apple Silicon (MPS)
147
+ python inference/inference.py \
148
+ --model_path ./checkpoints/latest.pt \
149
+ --model_size 7b \
150
+ --use_mps \
151
+ --prompt "Explain quantum mechanics"
152
+ ```
153
+
154
+ ### HuggingFace Integration
155
+
156
+ ```python
157
+ from transformers import AutoModelForCausalLM, AutoTokenizer
158
+
159
+ # Load model and tokenizer
160
+ model = AutoModelForCausalLM.from_pretrained("./checkpoints")
161
+ tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
162
+
163
+ # Generate
164
+ input_text = "The energy of a photon is given by"
165
+ inputs = tokenizer(input_text, return_tensors="pt")
166
+ outputs = model.generate(**inputs, max_new_tokens=50)
167
+ print(tokenizer.decode(outputs[0]))
168
+ ```
169
+
170
+ ## πŸ“Š Data Pipeline
171
+
172
+ 1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
173
+ 2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
174
+ 3. **Deduplication**: MinHash LSH for near-duplicate detection
175
+ 4. **Domain Classification**: Classify into 7 science domains
176
+ 5. **Tokenization**: Custom science-aware BPE tokenizer
177
+ 6. **Sharding**: Write to Parquet with statistics
178
+
179
+ ```python
180
+ from data.dataset_loader import VortexDatasetLoader
181
+ from data.quality_filter import ScienceQualityFilter
182
+ from data.deduplication import MinHashLSH
183
+
184
+ # Load and process data
185
+ loader = VortexDatasetLoader()
186
+ quality_filter = ScienceQualityFilter()
187
+ lsh = MinHashLSH()
188
+
189
+ # Stream datasets, filter, deduplicate, and shard
190
+ for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
191
+ if quality_filter.filter(sample["text"]):
192
+ lsh.add_document(sample["id"], sample["text"])
193
+ # Tokenize and save
194
+ ```
195
+
196
+ ## 🎯 Training Strategy
197
+
198
+ ### Curriculum Learning
199
+
200
+ Training progresses through 4 stages:
201
+
202
+ 1. **Foundation** (0-20%): Basic science text, simple equations, definitions
203
+ 2. **Domain** (20-50%): Domain-specific deep content per science area
204
+ 3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
205
+ 4. **Integration** (80-100%): Cross-domain science, full dataset
206
+
207
+ ### Science-Aware Loss
208
+
209
+ ```python
210
+ total_loss = (
211
+ lm_loss * 1.0 # Standard next token prediction
212
+ + equation_loss * 0.3 # Equation reconstruction accuracy
213
+ + domain_loss * 0.1 # Domain classification head
214
+ + citation_loss * 0.1 # Citation detection accuracy
215
+ + numerical_loss * 0.2 # Numerical reasoning accuracy
216
+ )
217
+ ```
218
+
219
+ ## βš™οΈ Configuration
220
+
221
+ ### 7B Config (VORTEX_7B_CONFIG)
222
+
223
+ - `d_model`: 4096
224
+ - `num_layers`: 32
225
+ - `num_heads`: 32
226
+ - `d_state`: 16
227
+ - `ssm_ratio`: 0.6
228
+ - `vocab_size`: 50000
229
+ - `max_seq_len`: 16384
230
+
231
+ ### 13B Config (VORTEX_13B_CONFIG)
232
+
233
+ - `d_model`: 5120
234
+ - `num_layers`: 40
235
+ - `num_heads`: 40
236
+ - `d_state`: 32
237
+ - `ssm_ratio`: 0.5
238
+ - `vocab_size`: 50000
239
+ - `max_seq_len`: 16384
240
+
241
+ ## πŸ”§ Hardware Targets
242
+
243
+ ### Nvidia 4060 Laptop (8GB VRAM)
244
+
245
+ - **7B**: BF16, no quantization, Flash Attention 2, torch.compile
246
+ - **13B**: INT8 quantization, Flash Attention 2, torch.compile
247
+ - Target TPS: 25-40 (7B), 15-25 (13B)
248
+
249
+ ### Apple Silicon (M2/M3)
250
+
251
+ - **7B on M3**: BF16 (via float16), SDPA, no compile
252
+ - **13B on M3 Max**: BF16, unified memory, SDPA
253
+ - Target TPS: 20-35 (7B), 12-20 (13B)
254
+
255
+ ## πŸ§ͺ Science Domains
256
+
257
+ 1. **Physics** (`[PHYS]`)
258
+ 2. **Mathematics** (`[MATH]`)
259
+ 3. **Chemistry** (`[CHEM]`)
260
+ 4. **Biology** (`[BIO]`)
261
+ 5. **Earth Science** (`[EARTH]`)
262
+ 6. **Space Science** (`[SPACE]`)
263
+ 7. **Zoology** (`[ZOO]`)
264
+
265
+ Domain tags can be included in training data to guide the SciGate FFN routing.
266
+
267
+ ## πŸ“ Tokenizer
268
+
269
+ Custom BPE tokenizer with:
270
+
271
+ - 40,000 base BPE tokens trained on scientific corpus
272
+ - 10,000 science-specific tokens:
273
+ - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
274
+ - 118 chemical element symbols
275
+ - 200 SI and derived units
276
+ - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
277
+ - 500 mathematical operators
278
+ - Amino acid codes
279
+ - Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
280
+ - Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
281
+
282
+ ## πŸ§ͺ Evaluation
283
+
284
+ Science benchmarks across all 7 domains will be added. Planned benchmarks:
285
+
286
+ - **Physics**: Feynman Questions, Physics GRE
287
+ - **Math**: MATH dataset, GSM8K
288
+ - **Chemistry**: Chemistry problem-solving, molecular property prediction
289
+ - **Biology**: PubMed QA, bioinformatics tasks
290
+ - **Earth Science**: Climate modeling questions
291
+ - **Space Science**: Astronomy problem sets
292
+ - **Zoology**: Species classification, ecological reasoning
293
+
294
+ ## πŸ“„ License
295
+
296
+ This is a school science project. Code is provided for educational purposes.
297
+
298
+ ## πŸ™ Acknowledgments
299
+
300
+ - **Mamba** (Gu et al.) for SSM architecture inspiration
301
+ - **Flash Attention** (Dao et al.) for efficient attention
302
+ - **HuggingFace** for transformers library
303
+ - All open scientific data sources: arXiv, PubMed, S2ORC, etc.
304
+
305
+ ## πŸ“§ Contact
306
+
307
+ For questions or issues, please open an issue on GitHub.
308
+
309
+ ---
310
+
311
+ **Built with ❀️ for scientific AI research**