Zandy-Wandy commited on Mar 10

Commit

bf64b03

verified ·

1 Parent(s): 83c011e

Upload Vortex model

Browse files

Files changed (50) hide show

README.md +289 -0
configs/__pycache__/vortex_7b_config.cpython-313.pyc +0 -0
configs/training_config.py +97 -0
configs/vortex_13b_config.py +68 -0
configs/vortex_7b_config.py +71 -0
configuration_vortex.py +110 -0
cuda_optimize.py +287 -0
data/__pycache__/deduplication.cpython-313.pyc +0 -0
data/__pycache__/domain_classifier.cpython-313.pyc +0 -0
data/__pycache__/quality_filter.cpython-313.pyc +0 -0
data/dataset_loader.py +263 -0
data/deduplication.py +260 -0
data/domain_classifier.py +163 -0
data/quality_filter.py +279 -0
data/scraper.py +405 -0
inference.py +213 -0
modeling_vortex.py +222 -0
models/__pycache__/attention_layer.cpython-313.pyc +0 -0
models/__pycache__/scigate_ffn.cpython-313.pyc +0 -0
models/__pycache__/ssm_layer.cpython-313.pyc +0 -0
models/__pycache__/vortex_model.cpython-313.pyc +0 -0
models/attention_layer.py +370 -0
models/science_modules/__init__.py +15 -0
models/science_modules/__pycache__/__init__.cpython-313.pyc +0 -0
models/science_modules/__pycache__/citation_module.cpython-313.pyc +0 -0
models/science_modules/__pycache__/equation_module.cpython-313.pyc +0 -0
models/science_modules/__pycache__/molecular_module.cpython-313.pyc +0 -0
models/science_modules/__pycache__/numerical_module.cpython-313.pyc +0 -0
models/science_modules/citation_module.py +230 -0
models/science_modules/equation_module.py +266 -0
models/science_modules/molecular_module.py +333 -0
models/science_modules/numerical_module.py +251 -0
models/scigate_ffn.py +203 -0
models/ssm_layer.py +252 -0
models/vortex_model.py +377 -0
mps_optimize.py +172 -0
push_to_hf.py +39 -0
requirements.txt +50 -0
science_bench.py +360 -0
test_model.py +449 -0
tokenization_vortex.py +174 -0
tokenizer/__pycache__/vortex_tokenizer.cpython-313.pyc +0 -0
tokenizer/vortex_tokenizer.py +442 -0
train.py +146 -0
training/__pycache__/curriculum.cpython-313.pyc +0 -0
training/__pycache__/losses.cpython-313.pyc +0 -0
training/curriculum.py +175 -0
training/losses.py +162 -0
training/trainer.py +442 -0
vortex_config.py +71 -0

README.md ADDED Viewed

	@@ -0,0 +1,289 @@

+# Vortex Scientific
+**Vortex Scientific** is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
+## 🌟 Features
+- **Novel Architecture**: Hybrid State-Space Model (SSM) + Local Attention blocks
+- **Science-Specialized**: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
+- **Hardware Optimized**: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
+- **Two Model Sizes**:
+  - **Vortex-7B**: 7 billion parameters, fits in 8GB VRAM
+  - **Vortex-13B**: 13 billion parameters, fits in 16GB VRAM with quantization
+- **HuggingFace Compatible**: Full integration with `transformers` library
+- **From Scratch**: No base model — everything built bottom-up including tokenizer and weights
+## 🏗️ Architecture
+Vortex uses a two-block hybrid architecture:
+1. **SSM-Only Blocks**: State-space layers for efficient long-context processing (O(n) complexity)
+2. **Attention+Science Blocks**: Local windowed attention + science modules + SciGate FFN
+Layer ratios:
+- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
+- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
+### Science Modules
+- **EquationModule**: LaTeX equation detection and structural understanding
+- **NumericalReasoningModule**: Digit-level encoding, scientific notation, unit awareness
+- **CitationModule**: Citation span detection, provenance tracking, confidence scoring
+- **MolecularModule**: Element embeddings, SMILES understanding, amino acid sequences
+## 📦 Project Structure
+```
+Vortex/
+├── configs/
+│   ├── vortex_7b_config.py      # 7B model configuration
+│   ├── vortex_13b_config.py     # 13B model configuration
+│   └── training_config.py       # Training hyperparameters
+├── models/
+│   ├── ssm_layer.py             # State-space layer
+│   ├── attention_layer.py       # Local windowed attention
+│   ├── scigate_ffn.py           # Science-gated feed-forward
+│   ├── vortex_model.py          # Main model class
+│   └── science_modules/         # Specialized science modules
+├── tokenizer/
+│   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
+├── data/
+│   ├── dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
+│   ├── quality_filter.py        # Multi-stage quality filtering
+│   ├── domain_classifier.py     # 7-domain classifier
+│   ├── deduplication.py         # MinHash LSH deduplication
+│   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
+├── training/
+│   ├── trainer.py               # Main training loop
+│   ├── losses.py                # Science-aware loss functions
+│   └── curriculum.py            # Curriculum learning scheduler
+├── inference/
+│   ├── cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
+│   └── mps_optimize.py          # MPS optimizations for Apple Silicon
+├── evaluation/                  # Science benchmarks (coming soon)
+├── configuration_vortex.py      # HF config class
+├── tokenization_vortex.py       # HF tokenizer wrapper
+├── modeling_vortex.py           # HF model integration
+├── train.py                     # Training entry point
+├── inference/inference.py       # Inference entry point
+└── requirements.txt
+```
+## 🚀 Quick Start
+### Installation
+```bash
+# Clone and setup
+cd Vortex
+pip install -r requirements.txt
+# For CUDA optimizations
+pip install flash-attn
+pip install bitsandbytes
+```
+### Training
+```bash
+# Train 7B model on CUDA
+python train.py \
+    --model_size 7b \
+    --device cuda \
+    --data_dir ./data/processed \
+    --output_dir ./checkpoints \
+    --max_steps 100000
+# Train 13B model with INT8 quantization (for 8GB VRAM)
+python train.py \
+    --model_size 13b \
+    --device cuda \
+    --quantization int8 \
+    --data_dir ./data/processed \
+    --output_dir ./checkpoints_13b
+```
+### Inference
+```bash
+# Generate text with 7B model
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --device cuda \
+    --prompt "The equation E = mc^2 describes" \
+    --max_new_tokens 100
+# Interactive mode
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --device cuda \
+    --interactive
+# On Apple Silicon (MPS)
+python inference/inference.py \
+    --model_path ./checkpoints/latest.pt \
+    --model_size 7b \
+    --use_mps \
+    --prompt "Explain quantum mechanics"
+```
+### HuggingFace Integration
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("./checkpoints")
+tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
+# Generate
+input_text = "The energy of a photon is given by"
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(outputs[0]))
+```
+## 📊 Data Pipeline
+1. **Open Datasets**: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
+2. **Quality Filtering**: Multi-stage checks (length, language, equations, repetition, citations)
+3. **Deduplication**: MinHash LSH for near-duplicate detection
+4. **Domain Classification**: Classify into 7 science domains
+5. **Tokenization**: Custom science-aware BPE tokenizer
+6. **Sharding**: Write to Parquet with statistics
+```python
+from data.dataset_loader import VortexDatasetLoader
+from data.quality_filter import ScienceQualityFilter
+from data.deduplication import MinHashLSH
+# Load and process data
+loader = VortexDatasetLoader()
+quality_filter = ScienceQualityFilter()
+lsh = MinHashLSH()
+# Stream datasets, filter, deduplicate, and shard
+for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
+    if quality_filter.filter(sample["text"]):
+        lsh.add_document(sample["id"], sample["text"])
+        # Tokenize and save
+```
+## 🎯 Training Strategy
+### Curriculum Learning
+Training progresses through 4 stages:
+1. **Foundation** (0-20%): Basic science text, simple equations, definitions
+2. **Domain** (20-50%): Domain-specific deep content per science area
+3. **Reasoning** (50-80%): Scientific problem solving, multi-step derivations
+4. **Integration** (80-100%): Cross-domain science, full dataset
+### Science-Aware Loss
+```python
+total_loss = (
+    lm_loss * 1.0              # Standard next token prediction
+    + equation_loss * 0.3      # Equation reconstruction accuracy
+    + domain_loss * 0.1        # Domain classification head
+    + citation_loss * 0.1      # Citation detection accuracy
+    + numerical_loss * 0.2     # Numerical reasoning accuracy
+)
+```
+## ⚙️ Configuration
+### 7B Config (VORTEX_7B_CONFIG)
+- `d_model`: 4096
+- `num_layers`: 32
+- `num_heads`: 32
+- `d_state`: 16
+- `ssm_ratio`: 0.6
+- `vocab_size`: 50000
+- `max_seq_len`: 16384
+### 13B Config (VORTEX_13B_CONFIG)
+- `d_model`: 5120
+- `num_layers`: 40
+- `num_heads`: 40
+- `d_state`: 32
+- `ssm_ratio`: 0.5
+- `vocab_size`: 50000
+- `max_seq_len`: 16384
+## 🔧 Hardware Targets
+### Nvidia 4060 Laptop (8GB VRAM)
+- **7B**: BF16, no quantization, Flash Attention 2, torch.compile
+- **13B**: INT8 quantization, Flash Attention 2, torch.compile
+- Target TPS: 25-40 (7B), 15-25 (13B)
+### Apple Silicon (M2/M3)
+- **7B on M3**: BF16 (via float16), SDPA, no compile
+- **13B on M3 Max**: BF16, unified memory, SDPA
+- Target TPS: 20-35 (7B), 12-20 (13B)
+## 🧪 Science Domains
+1. **Physics** (`[PHYS]`)
+2. **Mathematics** (`[MATH]`)
+3. **Chemistry** (`[CHEM]`)
+4. **Biology** (`[BIO]`)
+5. **Earth Science** (`[EARTH]`)
+6. **Space Science** (`[SPACE]`)
+7. **Zoology** (`[ZOO]`)
+Domain tags can be included in training data to guide the SciGate FFN routing.
+## 📝 Tokenizer
+Custom BPE tokenizer with:
+- 40,000 base BPE tokens trained on scientific corpus
+- 10,000 science-specific tokens:
+  - 500 LaTeX math symbols (`\alpha`, `\sum`, `\int`, etc.)
+  - 118 chemical element symbols
+  - 200 SI and derived units
+  - 300 scientific abbreviations (DNA, RNA, ATP, etc.)
+  - 500 mathematical operators
+  - Amino acid codes
+  - Greek alphabet (α, β, γ, etc.)
+- Special tokens: `[EQUATION]`, `[CITATION]`, `[MOLECULE]`, `[FIGURE]`, `[TABLE]`, domain tags
+## 🧪 Evaluation
+Science benchmarks across all 7 domains will be added. Planned benchmarks:
+- **Physics**: Feynman Questions, Physics GRE
+- **Math**: MATH dataset, GSM8K
+- **Chemistry**: Chemistry problem-solving, molecular property prediction
+- **Biology**: PubMed QA, bioinformatics tasks
+- **Earth Science**: Climate modeling questions
+- **Space Science**: Astronomy problem sets
+- **Zoology**: Species classification, ecological reasoning
+## 📄 License
+This is a school science project. Code is provided for educational purposes.
+## 🙏 Acknowledgments
+- **Mamba** (Gu et al.) for SSM architecture inspiration
+- **Flash Attention** (Dao et al.) for efficient attention
+- **HuggingFace** for transformers library
+- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
+## 📧 Contact
+For questions or issues, please open an issue on GitHub.
+---
+**Built with ❤️ for scientific AI research**

configs/__pycache__/vortex_7b_config.cpython-313.pyc ADDED Viewed

Binary file (1.78 kB). View file

configs/training_config.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""
+Training configuration for Vortex models.
+Covers both 7B and 13B variants with hardware-specific optimizations.
+"""
+import torch
+TRAINING_CONFIG = {
+    # Training hyperparameters
+    "learning_rate": 3e-4,
+    "weight_decay": 0.1,
+    "beta1": 0.9,
+    "beta2": 0.95,
+    "clip_grad_norm": 1.0,
+    # Batch sizing
+    "global_batch_size": 512,  # tokens per batch
+    "micro_batch_size": 8,     # per GPU
+    "gradient_accumulation_steps": 4,
+    # Training schedule
+    "max_steps": 100000,
+    "warmup_steps": 2000,
+    "save_interval": 5000,
+    "eval_interval": 1000,
+    "log_interval": 100,
+    # Mixed precision
+    "use_amp": True,
+    "amp_dtype": torch.bfloat16,
+    # Optimizer
+    "optimizer": "AdamW",
+    "use_fused": True,  # fused AdamW if available
+    # Curriculum learning stages (as fractions of max_steps)
+    "curriculum_stages": [
+        {"name": "foundation", "start": 0.0, "end": 0.2},    # 0-20%
+        {"name": "domain", "start": 0.2, "end": 0.5},       # 20-50%
+        {"name": "reasoning", "start": 0.5, "end": 0.8},    # 50-80%
+        {"name": "integration", "start": 0.8, "end": 1.0},  # 80-100%
+    ],
+    # Loss weights (science-aware loss)
+    "loss_weights": {
+        "lm_loss": 1.0,
+        "equation_loss": 0.3,
+        "domain_loss": 0.1,
+        "citation_loss": 0.1,
+        "numerical_loss": 0.2,
+    },
+    # Checkpointing
+    "checkpoint_dir": "checkpoints",
+    "save_optimizer_state": True,
+    "save_scheduler_state": True,
+    # Logging
+    "log_dir": "logs",
+    "use_wandb": False,
+    "wandb_project": "vortex-scientific",
+    # Data loading
+    "num_workers": 8,
+    "prefetch_factor": 2,
+    "pin_memory": True,
+    # Device configuration
+    "device": "cuda",  # or "mps" for Apple Silicon
+    "use_mps": False,
+    # Quantization (for 13B on 8GB VRAM)
+    "quantization": None,  # None, "int8", "int4"
+}
+# Hardware-specific overrides
+TRAINING_CONFIG_7B_CUDA = TRAINING_CONFIG.copy()
+TRAINING_CONFIG_7B_CUDA.update({
+    "device": "cuda",
+    "quantization": None,
+    "micro_batch_size": 8,
+})
+TRAINING_CONFIG_13B_CUDA = TRAINING_CONFIG.copy()
+TRAINING_CONFIG_13B_CUDA.update({
+    "device": "cuda",
+    "quantization": "int8",  # 13B needs INT8 on 8GB
+    "micro_batch_size": 4,
+})
+TRAINING_CONFIG_MPS = TRAINING_CONFIG.copy()
+TRAINING_CONFIG_MPS.update({
+    "device": "mps",
+    "use_mps": True,
+    "use_amp": False,  # MPS doesn't support bfloat16 AMP well
+    "micro_batch_size": 4,
+})

configs/vortex_13b_config.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""
+Vortex-13B model configuration.
+Optimized for 16GB VRAM (4060 Ti laptop) and MacBook Pro M3 Max.
+"""
+VORTEX_13B_CONFIG = {
+    # Model dimensions
+    "d_model": 5120,
+    "num_layers": 40,
+    "num_heads": 40,
+    "head_dim": 128,  # d_model // num_heads
+    # State-space layer parameters
+    "d_state": 32,          # SSM state dimension (larger for bigger model)
+    "d_conv": 4,            # SSM convolution width
+    # Attention parameters
+    "window_size": 512,     # Local attention window
+    "use_flash_attention": True,
+    # Feed-forward parameters
+    "ffn_expansion": 4,
+    "num_domains": 7,
+    "vocab_size": 50000,
+    "max_seq_len": 16384,
+    # Layer ratio: 50% SSM, 50% attention (more memory for attention)
+    "ssm_ratio": 0.5,
+    # Data types
+    "dtype": "bfloat16",
+    # Special tokens (same as 7B)
+    "special_tokens": {
+        "[PAD]": 0,
+        "[UNK]": 1,
+        "[BOS]": 2,
+        "[EOS]": 3,
+        "[EQUATION]": 4,
+        "[/EQUATION]": 5,
+        "[CITATION]": 6,
+        "[/CITATION]": 7,
+        "[MOLECULE]": 8,
+        "[/MOLECULE]": 9,
+        "[FIGURE]": 10,
+        "[TABLE]": 11,
+        "[MATH]": 12,
+        "[CHEM]": 13,
+        "[BIO]": 14,
+        "[PHYS]": 15,
+        "[EARTH]": 16,
+        "[SPACE]": 17,
+        "[ZOO]": 18,
+    },
+    "domain_tags": ["[MATH]", "[CHEM]", "[BIO]", "[PHYS]", "[EARTH]", "[SPACE]", "[ZOO]"],
+    # Science module flags
+    "enable_equation_module": True,
+    "enable_numerical_module": True,
+    "enable_citation_module": True,
+    "enable_molecular_module": True,
+}
+def get_config():
+    """Return the 13B configuration dictionary."""
+    return VORTEX_13B_CONFIG

configs/vortex_7b_config.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+Vortex-7B model configuration.
+Optimized for 8GB VRAM (4060 laptop) and MacBook Pro M2/M3.
+"""
+VORTEX_7B_CONFIG = {
+    # Model dimensions
+    "d_model": 4096,
+    "num_layers": 32,
+    "num_heads": 32,
+    "head_dim": 128,  # d_model // num_heads
+    # State-space layer parameters
+    "d_state": 16,          # SSM state dimension
+    "d_conv": 4,            # SSM convolution width
+    # Attention parameters
+    "window_size": 512,     # Local attention window
+    "use_flash_attention": True,  # CUDA only
+    # Feed-forward parameters
+    "ffn_expansion": 4,     # Hidden dim = d_model * expansion
+    "num_domains": 7,       # Physics, Math, Chemistry, Biology, Earth, Space, Zoology
+    # Tokenizer parameters
+    "vocab_size": 50000,
+    "max_seq_len": 16384,
+    # Layer ratio: 60% SSM, 40% attention
+    "ssm_ratio": 0.6,
+    # Data types
+    "dtype": "bfloat16",
+    # Special tokens
+    "special_tokens": {
+        "[PAD]": 0,
+        "[UNK]": 1,
+        "[BOS]": 2,
+        "[EOS]": 3,
+        "[EQUATION]": 4,
+        "[/EQUATION]": 5,
+        "[CITATION]": 6,
+        "[/CITATION]": 7,
+        "[MOLECULE]": 8,
+        "[/MOLECULE]": 9,
+        "[FIGURE]": 10,
+        "[TABLE]": 11,
+        "[MATH]": 12,
+        "[CHEM]": 13,
+        "[BIO]": 14,
+        "[PHYS]": 15,
+        "[EARTH]": 16,
+        "[SPACE]": 17,
+        "[ZOO]": 18,
+    },
+    # Domain tags
+    "domain_tags": ["[MATH]", "[CHEM]", "[BIO]", "[PHYS]", "[EARTH]", "[SPACE]", "[ZOO]"],
+    # Science module flags (enable/disable for ablation)
+    "enable_equation_module": True,
+    "enable_numerical_module": True,
+    "enable_citation_module": True,
+    "enable_molecular_module": True,
+}
+def get_config():
+    """Return the 7B configuration dictionary."""
+    return VORTEX_7B_CONFIG

configuration_vortex.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+Vortex configuration for HuggingFace.
+"""
+from typing import Optional, List, Dict, Any
+from transformers import PretrainedConfig
+class VortexConfig(PretrainedConfig):
+    """
+    Configuration class for Vortex model.
+    Compatible with HuggingFace transformers.
+    """
+    model_type = "vortex"
+    tie_word_embeddings = True
+    def __init__(
+        self,
+        d_model: int = 4096,
+        num_layers: int = 32,
+        num_heads: int = 32,
+        d_state: int = 16,
+        d_conv: int = 4,
+        window_size: int = 512,
+        ffn_expansion: int = 4,
+        num_domains: int = 7,
+        vocab_size: int = 50000,
+        max_seq_len: int = 16384,
+        ssm_ratio: float = 0.6,
+        enable_equation_module: bool = True,
+        enable_numerical_module: bool = True,
+        enable_citation_module: bool = True,
+        enable_molecular_module: bool = True,
+        special_tokens: Optional[Dict[str, int]] = None,
+        domain_tags: Optional[List[str]] = None,
+        initializer_range: float = 0.02,
+        tie_word_embeddings: bool = True,
+        **kwargs
+    ):
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+        self.d_model = d_model
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.d_state = d_state
+        self.d_conv = d_conv
+        self.window_size = window_size
+        self.ffn_expansion = ffn_expansion
+        self.num_domains = num_domains
+        self.vocab_size = vocab_size
+        self.max_seq_len = max_seq_len
+        self.ssm_ratio = ssm_ratio
+        self.enable_equation_module = enable_equation_module
+        self.enable_numerical_module = enable_numerical_module
+        self.enable_citation_module = enable_citation_module
+        self.enable_molecular_module = enable_molecular_module
+        self.special_tokens = special_tokens or {
+            "[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3,
+            "[EQUATION]": 4, "[/EQUATION]": 5,
+            "[CITATION]": 6, "[/CITATION]": 7,
+            "[MOLECULE]": 8, "[/MOLECULE]": 9,
+            "[FIGURE]": 10, "[TABLE]": 11,
+            "[MATH]": 12, "[CHEM]": 13, "[BIO]": 14,
+            "[PHYS]": 15, "[EARTH]": 16, "[SPACE]": 17, "[ZOO]": 18,
+        }
+        self.domain_tags = domain_tags or ["[MATH]", "[CHEM]", "[BIO]", "[PHYS]", "[EARTH]", "[SPACE]", "[ZOO]"]
+        self.initializer_range = initializer_range
+        # Compute derived attributes
+        self.head_dim = self.d_model // self.num_heads
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: str, **kwargs):
+        """Load config from pretrained model."""
+        import json
+        import os
+        config_path = os.path.join(pretrained_model_name_or_path, "config.json")
+        if os.path.exists(config_path):
+            with open(config_path, "r") as f:
+                config_dict = json.load(f)
+            config_dict.update(kwargs)
+            return cls(**config_dict)
+        else:
+            # Return default config
+            return cls(**kwargs)
+    def to_dict(self) -> Dict[str, Any]:
+        """Convert to dictionary."""
+        return {
+            "model_type": self.model_type,
+            "d_model": self.d_model,
+            "num_layers": self.num_layers,
+            "num_heads": self.num_heads,
+            "head_dim": self.head_dim,
+            "d_state": self.d_state,
+            "d_conv": self.d_conv,
+            "window_size": self.window_size,
+            "ffn_expansion": self.ffn_expansion,
+            "num_domains": self.num_domains,
+            "vocab_size": self.vocab_size,
+            "max_seq_len": self.max_seq_len,
+            "ssm_ratio": self.ssm_ratio,
+            "enable_equation_module": self.enable_equation_module,
+            "enable_numerical_module": self.enable_numerical_module,
+            "enable_citation_module": self.enable_citation_module,
+            "enable_molecular_module": self.enable_molecular_module,
+            "special_tokens": self.special_tokens,
+            "domain_tags": self.domain_tags,
+            "initializer_range": self.initializer_range,
+        }

cuda_optimize.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""
+CUDA optimizations for Vortex model on Nvidia 4060 laptop.
+Flash Attention 2, torch.compile, INT8 quantization.
+"""
+import torch
+import torch.nn as nn
+from typing import Optional, Dict, Any
+def optimize_for_cuda(
+    model: nn.Module,
+    config: Dict,
+    use_flash_attention: bool = True,
+    use_torch_compile: bool = True,
+    compile_mode: str = "reduce-overhead",
+    quantization: Optional[str] = None,
+) -> nn.Module:
+    """
+    Apply CUDA optimizations to model.
+    Args:
+        model: VortexModel
+        config: Model config
+        use_flash_attention: Enable Flash Attention 2
+        use_torch_compile: Use torch.compile
+        compile_mode: Compile mode ("reduce-overhead", "max-autotune")
+        quantization: None, "int8", or "int4"
+    Returns:
+        Optimized model
+    """
+    device = torch.device("cuda")
+    # Move to CUDA
+    model = model.to(device)
+    # Set dtype
+    dtype_str = config.get("dtype", "bfloat16")
+    if dtype_str == "bfloat16":
+        dtype = torch.bfloat16
+    elif dtype_str == "float16":
+        dtype = torch.float16
+    else:
+        dtype = torch.float32
+    model = model.to(dtype)
+    # Apply Flash Attention 2 to attention layers
+    if use_flash_attention:
+        model = _apply_flash_attention(model)
+        print("Applied Flash Attention 2")
+    # Apply torch.compile
+    if use_torch_compile:
+        model = torch.compile(
+            model,
+            mode=compile_mode,
+            fullgraph=True,
+            dynamic=True,
+        )
+        print(f"Applied torch.compile with mode={compile_mode}")
+    # Apply quantization if requested
+    if quantization == "int8":
+        model = _apply_int8_quantization(model)
+        print("Applied INT8 quantization")
+    elif quantization == "int4":
+        model = _apply_int4_quantization(model)
+        print("Applied INT4 quantization")
+    return model
+def _apply_flash_attention(model: nn.Module) -> nn.Module:
+    """
+    Replace standard attention with Flash Attention 2.
+    Requires: pip install flash-attn
+    """
+    try:
+        from flash_attn import flash_attn_func
+        # Monkey-patch attention layers to use flash attention
+        for name, module in model.named_modules():
+            if hasattr(module, 'use_flash_attention'):
+                module.use_flash_attention = True
+                # Replace forward with flash attention version
+                original_forward = module.forward
+                def flash_forward(self, x, *args, **kwargs):
+                    return self._flash_attention_forward(x, *args, **kwargs)
+                module.forward = flash_forward.__get__(module, type(module))
+        return model
+    except ImportError:
+        print("Flash Attention not available. Install with: pip install flash-attn")
+        return model
+def _apply_int8_quantization(model: nn.Module) -> nn.Module:
+    """
+    Apply INT8 quantization using bitsandbytes.
+    """
+    try:
+        import bitsandbytes as bnb
+        # Replace linear layers with 8-bit variants
+        for name, module in model.named_modules():
+            if isinstance(module, nn.Linear):
+                # Create 8-bit linear replacement
+                parent_name = name.rsplit('.', 1)[0] if '.' in name else ''
+                child_name = name.rsplit('.', 1)[1] if '.' in name else name
+                # Get parent module
+                parent = model
+                if parent_name:
+                    for part in parent_name.split('.'):
+                        parent = getattr(parent, part)
+                # Replace with 8-bit linear
+                replacement = bnb.nn.Linear8bitLt(
+                    module.in_features,
+                    module.out_features,
+                    bias=module.bias is not None,
+                    has_fp16_weights=False,
+                )
+                # Copy weights (will be quantized)
+                replacement.weight.data = module.weight.data
+                if module.bias is not None:
+                    replacement.bias.data = module.bias.data
+                setattr(parent, child_name, replacement)
+        return model
+    except ImportError:
+        print("bitsandbytes not available. Install with: pip install bitsandbytes")
+        return model
+def _apply_int4_quantization(model: nn.Module) -> nn.Module:
+    """
+    Apply INT4 quantization using bitsandbytes.
+    More aggressive, for 13B on 8GB VRAM.
+    """
+    try:
+        import bitsandbytes as bnb
+        for name, module in model.named_modules():
+            if isinstance(module, nn.Linear):
+                parent_name = name.rsplit('.', 1)[0] if '.' in name else ''
+                child_name = name.rsplit('.', 1)[1] if '.' in name else name
+                parent = model
+                if parent_name:
+                    for part in parent_name.split('.'):
+                        parent = getattr(parent, part)
+                # 4-bit linear
+                replacement = bnb.nn.Linear4bit(
+                    module.in_features,
+                    module.out_features,
+                    bias=module.bias is not None,
+                    compute_dtype=torch.float16,
+                    compress_statistics=True,
+                )
+                replacement.weight.data = module.weight.data
+                if module.bias is not None:
+                    replacement.bias.data = module.bias.data
+                setattr(parent, child_name, replacement)
+        return model
+    except ImportError:
+        print("bitsandbytes not available.")
+        return model
+def get_cuda_memory_usage() -> Dict[str, float]:
+    """Get current CUDA memory usage in GB."""
+    if not torch.cuda.is_available():
+        return {"error": "CUDA not available"}
+    allocated = torch.cuda.memory_allocated() / 1e9
+    reserved = torch.cuda.memory_reserved() / 1e9
+    max_allocated = torch.cuda.max_memory_allocated() / 1e9
+    return {
+        "allocated_gb": allocated,
+        "reserved_gb": reserved,
+        "max_allocated_gb": max_allocated,
+    }
+def profile_model(
+    model: nn.Module,
+    input_ids: torch.Tensor,
+    num_warmup: int = 10,
+    num_runs: int = 100,
+) -> Dict[str, float]:
+    """
+    Profile model performance.
+    Args:
+        model: Model to profile
+        input_ids: Example input
+        num_warmup: Number of warmup runs
+        num_runs: Number of profiling runs
+    Returns:
+        Dictionary with timing statistics
+    """
+    model.eval()
+    device = next(model.parameters()).device
+    input_ids = input_ids.to(device)
+    # Warmup
+    with torch.no_grad():
+        for _ in range(num_warmup):
+            _ = model(input_ids)
+    # Profile
+    torch.cuda.synchronize()
+    import time
+    start = time.time()
+    with torch.no_grad():
+        for _ in range(num_runs):
+            _ = model(input_ids)
+    torch.cuda.synchronize()
+    elapsed = time.time() - start
+    avg_time = elapsed / num_runs
+    tokens_per_sec = input_ids.shape[1] / avg_time
+    return {
+        "avg_time_sec": avg_time,
+        "tokens_per_sec": tokens_per_sec,
+    }
+def test_cuda_optimize():
+    """Test CUDA optimizations."""
+    if not torch.cuda.is_available():
+        print("CUDA not available, skipping test")
+        return
+    from models.vortex_model import VortexModel
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    config = VORTEX_7B_CONFIG.copy()
+    config["d_model"] = 512
+    config["num_layers"] = 2
+    config["num_heads"] = 8
+    config["vocab_size"] = 1000
+    model = VortexModel(config)
+    print(f"Model parameters: {model.get_num_params():,}")
+    # Optimize
+    model = optimize_for_cuda(
+        model,
+        config,
+        use_flash_attention=False,  # May not be available
+        use_torch_compile=False,   # Skip compile for test
+        quantization=None,
+    )
+    # Test forward
+    batch_size = 2
+    seq_len = 128
+    input_ids = torch.randint(0, config["vocab_size"], (batch_size, seq_len)).cuda()
+    with torch.no_grad():
+        output = model(input_ids)
+        logits = output["logits"]
+    print(f"Output shape: {logits.shape}")
+    print("CUDA optimize test passed!")
+if __name__ == "__main__":
+    test_cuda_optimize()

data/__pycache__/deduplication.cpython-313.pyc ADDED Viewed

Binary file (10.1 kB). View file

data/__pycache__/domain_classifier.cpython-313.pyc ADDED Viewed

Binary file (6.38 kB). View file

data/__pycache__/quality_filter.cpython-313.pyc ADDED Viewed

Binary file (11.3 kB). View file

data/dataset_loader.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""
+DatasetLoader: Loads and processes open scientific datasets.
+Supports streaming from HuggingFace datasets with sharding.
+"""
+import os
+import json
+from typing import List, Dict, Optional, Iterator
+from pathlib import Path
+try:
+    from datasets import load_dataset, Dataset, IterableDataset
+    import pyarrow.parquet as pq
+except ImportError:
+    print("Please install datasets and pyarrow: pip install datasets pyarrow")
+    raise
+class VortexDatasetLoader:
+    """
+    Loads and processes open scientific datasets.
+    Supports streaming with sharding to Parquet files.
+    """
+    # Open datasets configuration
+    DATASETS = {
+        "pile_scientific": {
+            "path": "EleutherAI/pile",
+            "subset": "pubmed_central",
+            "split": "train",
+            "text_field": "text",
+            "domain": "biology",  # approximate
+        },
+        "s2orc": {
+            "path": "allenai/s2orc",
+            "subset": None,
+            "split": "train",
+            "text_field": "text",
+            "domain": "multidisciplinary",
+        },
+        "pes2o": {
+            "path": "allenai/peS2o",
+            "subset": None,
+            "split": "train",
+            "text_field": "text",
+            "domain": "multidisciplinary",
+        },
+        "automath": {
+            "path": "math-ai/AutoMathText",
+            "subset": None,
+            "split": "train",
+            "text_field": "text",
+            "domain": "math",
+        },
+        "deepmind_math": {
+            "path": "deepmind/math_dataset",
+            "subset": "algebra__linear_1d",
+            "split": "train",
+            "text_field": "text",
+            "domain": "math",
+        },
+        "pubmed_qa": {
+            "path": "bigbio/pubmed_qa",
+            "subset": "pubmed_qa_labeled_fold0_source",
+            "split": "train",
+            "text_field": "question",
+            "domain": "biology",
+        },
+    }
+    def __init__(
+        self,
+        cache_dir: str = "./data/cache",
+        output_dir: str = "./data/processed",
+        num_proc: int = 4,
+    ):
+        """
+        Initialize dataset loader.
+        Args:
+            cache_dir: Directory for caching downloaded datasets
+            output_dir: Directory for processed shards
+            num_proc: Number of processes for data processing
+        """
+        self.cache_dir = Path(cache_dir)
+        self.output_dir = Path(output_dir)
+        self.num_proc = num_proc
+        self.cache_dir.mkdir(parents=True, exist_ok=True)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+    def load_dataset(
+        self,
+        dataset_name: str,
+        streaming: bool = True,
+        max_samples: Optional[int] = None,
+    ) -> Iterator[Dict]:
+        """
+        Load a dataset as an iterator.
+        Args:
+            dataset_name: Name from DATASETS config
+            streaming: Use streaming mode for large datasets
+            max_samples: Maximum number of samples to yield
+        Yields:
+            Dictionary with text and metadata
+        """
+        if dataset_name not in self.DATASETS:
+            raise ValueError(f"Unknown dataset: {dataset_name}. Available: {list(self.DATASETS.keys())}")
+        config = self.DATASETS[dataset_name]
+        print(f"Loading dataset: {dataset_name}")
+        print(f"  Path: {config['path']}")
+        print(f"  Streaming: {streaming}")
+        try:
+            dataset = load_dataset(
+                config["path"],
+                name=config["subset"],
+                split=config["split"],
+                streaming=streaming,
+                cache_dir=str(self.cache_dir),
+            )
+            count = 0
+            for sample in dataset:
+                text = sample.get(config["text_field"], "")
+                if not text or not isinstance(text, str):
+                    continue
+                yield {
+                    "text": text,
+                    "dataset": dataset_name,
+                    "domain": config["domain"],
+                    "source": config["path"],
+                }
+                count += 1
+                if max_samples and count >= max_samples:
+                    break
+            print(f"Loaded {count} samples from {dataset_name}")
+        except Exception as e:
+            print(f"Error loading dataset {dataset_name}: {e}")
+            # Return empty iterator
+            return
+    def load_multiple_datasets(
+        self,
+        dataset_names: List[str],
+        streaming: bool = True,
+        max_per_dataset: Optional[int] = None,
+    ) -> Iterator[Dict]:
+        """
+        Load multiple datasets and yield samples interleaved.
+        Args:
+            dataset_names: List of dataset names
+            streaming: Use streaming mode
+            max_per_dataset: Max samples per dataset
+        Yields:
+            Dictionary with text and metadata
+        """
+        iterators = []
+        for name in dataset_names:
+            it = self.load_dataset(name, streaming=streaming, max_samples=max_per_dataset)
+            iterators.append(it)
+        # Simple round-robin interleaving
+        active = len(iterators)
+        while active > 0:
+            for i, it in enumerate(iterators):
+                if it is None:
+                    continue
+                try:
+                    yield next(it)
+                except StopIteration:
+                    iterators[i] = None
+                    active -= 1
+                    break
+    def shard_to_parquet(
+        self,
+        samples: Iterator[Dict],
+        output_prefix: str,
+        samples_per_shard: int = 10000,
+    ):
+        """
+        Write samples to sharded Parquet files.
+        Args:
+            samples: Iterator of sample dictionaries
+            output_prefix: Prefix for output files (e.g., "train")
+            samples_per_shard: Number of samples per shard
+        """
+        shard_index = 0
+        buffer = []
+        for sample in samples:
+            buffer.append(sample)
+            if len(buffer) >= samples_per_shard:
+                self._write_shard(buffer, output_prefix, shard_index)
+                shard_index += 1
+                buffer = []
+        # Write remaining
+        if buffer:
+            self._write_shard(buffer, output_prefix, shard_index)
+        print(f"Wrote {shard_index + 1} shards to {self.output_dir}")
+    def _write_shard(
+        self,
+        buffer: List[Dict],
+        output_prefix: str,
+        shard_index: int,
+    ):
+        """Write a single shard to Parquet."""
+        import pandas as pd
+        df = pd.DataFrame(buffer)
+        output_path = self.output_dir / f"{output_prefix}_{shard_index:05d}.parquet"
+        df.to_parquet(output_path, index=False)
+    def get_shard_list(
+        self,
+        prefix: str,
+    ) -> List[Path]:
+        """Get list of shard files matching prefix."""
+        return sorted(self.output_dir.glob(f"{prefix}_*.parquet"))
+    def read_shard(
+        self,
+        shard_path: Path,
+    ) -> List[Dict]:
+        """Read a single shard."""
+        import pandas as pd
+        df = pd.read_parquet(shard_path)
+        return df.to_dict('records')
+def test_dataset_loader():
+    """Test the dataset loader."""
+    loader = VortexDatasetLoader()
+    # Test loading a small dataset
+    print("Testing dataset loader...")
+    count = 0
+    for sample in loader.load_dataset("pubmed_qa", streaming=True, max_samples=10):
+        print(f"Sample {count}: {sample['text'][:100]}...")
+        count += 1
+    print(f"Loaded {count} samples")
+    print("DatasetLoader test passed!")
+if __name__ == "__main__":
+    test_dataset_loader()

data/deduplication.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Deduplication: MinHash LSH for near-duplicate detection.
+"""
+import hashlib
+import random
+from typing import List, Set, Tuple, Optional
+from dataclasses import dataclass
+@dataclass
+class MinHashSignature:
+    """MinHash signature for a document."""
+    hash_values: List[int]
+    doc_id: str
+class MinHashLSH:
+    """
+    MinHash LSH for near-duplicate detection.
+    Uses shingling and MinHash to estimate Jaccard similarity.
+    """
+    def __init__(
+        self,
+        num_permutations: int = 128,
+        threshold: float = 0.8,
+        bands: int = 16,
+        rows_per_band: int = 8,
+    ):
+        """
+        Initialize MinHash LSH.
+        Args:
+            num_permutations: Number of hash permutations for MinHash
+            threshold: Similarity threshold for considering duplicates
+            bands: Number of bands for LSH
+            rows_per_band: Rows per band (bands * rows_per_band = num_permutations)
+        """
+        self.num_permutations = num_permutations
+        self.threshold = threshold
+        self.bands = bands
+        self.rows_per_band = rows_per_band
+        assert bands * rows_per_band == num_permutations
+        # Generate random hash functions
+        self.hash_functions = self._generate_hash_functions(num_permutations)
+        # LSH index: band_id -> {bucket_hash -> set of doc_ids}
+        self.index = [dict() for _ in range(bands)]
+        # Store signatures for similarity computation
+        self.signatures = {}  # doc_id -> MinHashSignature
+    def _generate_hash_functions(self, n: int) -> List:
+        """Generate n random hash functions."""
+        # Use random permutations of large prime
+        functions = []
+        for _ in range(n):
+            a = random.randint(1, 2**32 - 1)
+            b = random.randint(0, 2**32 - 1)
+            functions.append((a, b))
+        return functions
+    def _hash(self, x: int, a: int, b: int) -> int:
+        """Universal hash function: (a*x + b) mod prime."""
+        prime = 2**61 - 1
+        return ((a * x + b) % prime) & 0xFFFFFFFF
+    def _compute_minhash(self, shingles: Set[int]) -> List[int]:
+        """
+        Compute MinHash signature for a set of shingles.
+        Args:
+            shingles: Set of shingle hash values
+        Returns:
+            List of minhash values (one per permutation)
+        """
+        signature = []
+        for a, b in self.hash_functions:
+            min_hash = min(self._hash(shingle, a, b) for shingle in shingles)
+            signature.append(min_hash)
+        return signature
+    def _shingle_text(
+        self,
+        text: str,
+        k: int = 5,
+    ) -> Set[int]:
+        """
+        Extract k-gram shingles from text.
+        Args:
+            text: Input text
+            k: Shingle size (characters)
+        Returns:
+            Set of shingle hashes
+        """
+        text = text.lower()
+        shingles = set()
+        for i in range(len(text) - k + 1):
+            shingle = text[i:i+k]
+            # Hash shingle
+            shingle_hash = int(hashlib.md5(shingle.encode()).hexdigest()[:8], 16)
+            shingles.add(shingle_hash)
+        return shingles
+    def add_document(
+        self,
+        doc_id: str,
+        text: str,
+        compute_signature: bool = True,
+    ) -> MinHashSignature:
+        """
+        Add a document to the LSH index.
+        Args:
+            doc_id: Unique document ID
+            text: Document text
+            compute_signature: Whether to compute signature (or use precomputed)
+        Returns:
+            MinHash signature
+        """
+        if compute_signature:
+            shingles = self._shingle_text(text)
+            signature = self._compute_minhash(shingles)
+        else:
+            raise ValueError("Must compute signature")
+        # Store signature
+        signature_obj = MinHashSignature(hash_values=signature, doc_id=doc_id)
+        self.signatures[doc_id] = signature_obj
+        # Index into bands
+        for band_idx in range(self.bands):
+            start = band_idx * self.rows_per_band
+            end = start + self.rows_per_band
+            band_signature = tuple(signature[start:end])
+            bucket_hash = hash(band_signature)
+            if bucket_hash not in self.index[band_idx]:
+                self.index[band_idx][bucket_hash] = set()
+            self.index[band_idx][bucket_hash].add(doc_id)
+        return signature_obj
+    def query(
+        self,
+        text: str,
+        candidate_doc_ids: Optional[Set[str]] = None,
+    ) -> List[Tuple[str, float]]:
+        """
+        Query for near-duplicate documents.
+        Args:
+            text: Query text
+            candidate_doc_ids: Optional set of candidate doc IDs to check
+        Returns:
+            List of (doc_id, similarity) above threshold
+        """
+        shingles = self._shingle_text(text)
+        query_signature = self._compute_minhash(shingles)
+        # Find candidates via LSH
+        candidate_sets = []
+        for band_idx in range(self.bands):
+            start = band_idx * self.rows_per_band
+            end = start + self.rows_per_band
+            band_signature = tuple(query_signature[start:end])
+            bucket_hash = hash(band_signature)
+            if bucket_hash in self.index[band_idx]:
+                candidate_sets.append(self.index[band_idx][bucket_hash])
+        # Union of candidates
+        candidates = set()
+        for s in candidate_sets:
+            candidates.update(s)
+        if candidate_doc_ids is not None:
+            candidates = candidates.intersection(candidate_doc_ids)
+        # Compute exact Jaccard similarity for candidates
+        results = []
+        query_shingles = shingles
+        for doc_id in candidates:
+            # In practice, would retrieve stored shingles
+            # For now, approximate using signature
+            similarity = self._estimate_similarity(query_signature, doc_id)
+            if similarity >= self.threshold:
+                results.append((doc_id, similarity))
+        return sorted(results, key=lambda x: x[1], reverse=True)
+    def _estimate_similarity(
+        self,
+        signature1: List[int],
+        doc_id2: str,
+    ) -> float:
+        """
+        Estimate Jaccard similarity between two signatures.
+        Uses MinHash similarity: proportion of matching hash values.
+        Args:
+            signature1: First MinHash signature
+            doc_id2: Second document ID (signature retrieved from storage)
+        Returns:
+            Estimated Jaccard similarity
+        """
+        if doc_id2 not in self.signatures:
+            return 0.0
+        signature2 = self.signatures[doc_id2].hash_values
+        # Count matching values
+        matches = sum(1 for h1, h2 in zip(signature1, signature2) if h1 == h2)
+        similarity = matches / len(signature1)
+        return similarity
+    def compute_signature(self, text: str) -> MinHashSignature:
+        """Compute MinHash signature for text."""
+        shingles = self._shingle_text(text)
+        signature = self._compute_minhash(shingles)
+        return MinHashSignature(hash_values=signature, doc_id="")
+def test_deduplication():
+    """Test MinHash LSH."""
+    lsh = MinHashLSH(num_permutations=64, threshold=0.7, bands=8, rows_per_band=8)
+    # Add documents
+    docs = [
+        ("doc1", "The quick brown fox jumps over the lazy dog."),
+        ("doc2", "The quick brown fox jumps over the lazy dog!!!"),  # near duplicate
+        ("doc3", "The quick brown fox leaps over the lazy dog."),  # near duplicate
+        ("doc4", "Completely unrelated text about science and experiments."),
+    ]
+    signatures = {}
+    for doc_id, text in docs:
+        sig = lsh.add_document(doc_id, text)
+        signatures[doc_id] = sig
+    # Query with doc1
+    results = lsh.query(docs[0][1])
+    print(f"Query results for doc1: {results}")
+    # Should find doc2 and doc3 as similar
+    print("Deduplication test passed!")
+if __name__ == "__main__":
+    test_deduplication()

data/domain_classifier.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+DomainClassifier: Classifies documents into 7 science domains.
+Uses a simple linear classifier on top of text features.
+"""
+import re
+from typing import List, Tuple, Optional
+import torch
+import torch.nn as nn
+class DomainClassifier(nn.Module):
+    """
+    Classifies documents into 7 science domains:
+    0: Physics
+    1: Mathematics
+    2: Chemistry
+    3: Biology
+    4: Earth Science
+    5: Space Science
+    6: Zoology
+    """
+    # Domain keywords for rule-based fallback
+    DOMAIN_KEYWORDS = {
+        0: ['physics', 'quantum', 'relativity', 'mechanics', 'thermodynamics', 'electromagnetism'],
+        1: ['mathematics', 'algebra', 'calculus', 'geometry', 'topology', 'proof', 'theorem'],
+        2: ['chemistry', 'molecular', 'reaction', 'compound', 'element', 'organic'],
+        3: ['biology', 'cell', 'gene', 'protein', 'organism', 'evolution'],
+        4: ['earth', 'geology', 'climate', 'ocean', 'atmosphere', 'meteorology'],
+        5: ['space', 'astronomy', 'planet', 'star', 'galaxy', 'cosmology'],
+        6: ['zoology', 'animal', 'species', 'vertebrate', 'invertebrate', 'ecology'],
+    }
+    def __init__(self, d_model: int, num_domains: int = 7):
+        """
+        Initialize domain classifier.
+        Args:
+            d_model: Input embedding dimension
+            num_domains: Number of domains (7)
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.num_domains = num_domains
+        # Simple linear classifier
+        self.classifier = nn.Linear(d_model, num_domains)
+        # Initialize weights
+        nn.init.normal_(self.classifier.weight, mean=0.0, std=0.02)
+        nn.init.zeros_(self.classifier.bias)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Classify domain from hidden states.
+        Args:
+            hidden_states: (batch, seq_len, d_model)
+            attention_mask: (batch, seq_len)
+        Returns:
+            Domain logits (batch, num_domains)
+        """
+        # Mean pooling over sequence (masked)
+        if attention_mask is not None:
+            mask = attention_mask.unsqueeze(-1)  # (batch, seq_len, 1)
+            summed = (hidden_states * mask).sum(dim=1)
+            counts = mask.sum(dim=1)
+            pooled = summed / counts.clamp(min=1)
+        else:
+            pooled = hidden_states.mean(dim=1)
+        # Classify
+        logits = self.classifier(pooled)
+        return logits
+    def classify_text(
+        self,
+        text: str,
+    ) -> Tuple[int, float]:
+        """
+        Rule-based fallback classification from raw text.
+        Args:
+            text: Input text string
+        Returns:
+            (domain_id, confidence)
+        """
+        text_lower = text.lower()
+        # Count keyword matches per domain
+        scores = []
+        for domain_id, keywords in self.DOMAIN_KEYWORDS.items():
+            score = sum(1 for kw in keywords if kw in text_lower)
+            scores.append(score)
+        if max(scores) == 0:
+            return 0, 0.0  # Unknown -> default to physics
+        best_domain = scores.index(max(scores))
+        confidence = max(scores) / sum(scores) if sum(scores) > 0 else 0.0
+        return best_domain, confidence
+    def compute_loss(
+        self,
+        logits: torch.Tensor,
+        domain_labels: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute classification loss.
+        Args:
+            logits: (batch, num_domains)
+            domain_labels: (batch,) with domain IDs
+        Returns:
+            Cross-entropy loss
+        """
+        return nn.functional.cross_entropy(logits, domain_labels)
+def test_domain_classifier():
+    """Test DomainClassifier."""
+    d_model = 512
+    batch_size = 4
+    seq_len = 128
+    classifier = DomainClassifier(d_model)
+    # Test with random hidden states
+    hidden = torch.randn(batch_size, seq_len, d_model)
+    logits = classifier(hidden)
+    print(f"Logits shape: {logits.shape}")
+    assert logits.shape == (batch_size, 7)
+    # Test with text
+    texts = [
+        "The quantum mechanics of particles...",
+        "Solving differential equations...",
+        "Chemical reactions produce compounds...",
+        "Cells contain DNA and proteins...",
+    ]
+    for text in texts:
+        domain, conf = classifier.classify_text(text)
+        print(f"Text: {text[:30]}... -> Domain {domain}, conf {conf:.2f}")
+    # Test loss
+    labels = torch.tensor([0, 1, 2, 3])
+    loss = classifier.compute_loss(logits, labels)
+    print(f"Loss: {loss.item():.4f}")
+    print("DomainClassifier test passed!")
+if __name__ == "__main__":
+    test_domain_classifier()

data/quality_filter.py ADDED Viewed

	@@ -0,0 +1,279 @@

+"""
+ScienceQualityFilter: Multi-stage quality filtering for scientific text.
+"""
+import re
+from typing import List, Tuple, Optional
+from dataclasses import dataclass
+@dataclass
+class FilterStats:
+    """Statistics from quality filtering."""
+    total: int = 0
+    passed: int = 0
+    failed_length: int = 0
+    failed_language: int = 0
+    failed_content: int = 0
+    failed_equations: int = 0
+    failed_repetition: int = 0
+    failed_citations: int = 0
+class ScienceQualityFilter:
+    """
+    Multi-stage quality filtering for scientific text.
+    """
+    def __init__(
+        self,
+        min_length: int = 128,
+        max_length: int = 64000,
+        max_repetition_ratio: float = 0.2,
+        min_equation_ratio: float = 0.0,
+        max_equation_ratio: float = 0.6,
+        min_citation_ratio: float = 0.0,
+        max_citation_ratio: float = 0.4,
+    ):
+        """
+        Initialize quality filter.
+        Args:
+            min_length: Minimum text length in characters
+            max_length: Maximum text length in characters
+            max_repetition_ratio: Maximum character-level repetition ratio
+            min_equation_ratio: Minimum equation density (optional)
+            max_equation_ratio: Maximum equation density
+            min_citation_ratio: Minimum citation density (optional)
+            max_citation_ratio: Maximum citation density
+        """
+        self.min_length = min_length
+        self.max_length = max_length
+        self.max_repetition_ratio = max_repetition_ratio
+        self.min_equation_ratio = min_equation_ratio
+        self.max_equation_ratio = max_equation_ratio
+        self.min_citation_ratio = min_citation_ratio
+        self.max_citation_ratio = max_citation_ratio
+    def filter(self, text: str, stats: Optional[FilterStats] = None) -> bool:
+        """
+        Run all quality checks on text.
+        Args:
+            text: Input text
+            stats: Optional stats object to update
+        Returns:
+            True if text passes all checks, False otherwise
+        """
+        if stats is None:
+            stats = FilterStats()
+        stats.total += 1
+        # 1. Length check
+        if not self.length_check(text):
+            stats.failed_length += 1
+            return False
+        # 2. Language check (English only, simplified)
+        if not self.language_check(text):
+            stats.failed_language += 1
+            return False
+        # 3. Science content check
+        if not self.science_content_check(text):
+            stats.failed_content += 1
+            return False
+        # 4. Equation validity check
+        if not self.equation_validity_check(text):
+            stats.failed_equations += 1
+            return False
+        # 5. Repetition check
+        if not self.repetition_check(text):
+            stats.failed_repetition += 1
+            return False
+        # 6. Citation ratio check
+        if not self.citation_ratio_check(text):
+            stats.failed_citations += 1
+            return False
+        stats.passed += 1
+        return True
+    def length_check(self, text: str) -> bool:
+        """Check text length."""
+        length = len(text)
+        return self.min_length <= length <= self.max_length
+    def language_check(self, text: str) -> bool:
+        """
+        Check if text is likely English.
+        Simplified heuristic: count common English words.
+        """
+        # Common English words
+        english_words = {'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have', 'i'}
+        words = re.findall(r'\b[a-zA-Z]{2,}\b', text.lower())
+        if len(words) < 10:
+            return False
+        english_count = sum(1 for w in words if w in english_words)
+        return english_count / len(words) > 0.1
+    def science_content_check(self, text: str) -> bool:
+        """
+        Check if text contains scientific content.
+        Looks for scientific terms, equations, units, etc.
+        """
+        # Scientific keywords
+        sci_keywords = [
+            'experiment', 'data', 'result', 'analysis', 'method',
+            'theory', 'hypothesis', 'conclusion', 'discussion',
+            'figure', 'table', 'equation', 'reference', 'citation',
+            'molecular', 'protein', 'gene', 'cell', 'reaction',
+            'mathematical', 'derivation', 'proof', 'theorem',
+        ]
+        # Units
+        units = ['m', 'kg', 's', 'mol', 'K', 'J', 'N', 'Pa', 'Hz', 'eV', '°C']
+        # Count occurrences
+        text_lower = text.lower()
+        keyword_count = sum(1 for kw in sci_keywords if kw in text_lower)
+        unit_count = sum(1 for u in units if f' {u}' in text_lower or f'{u} ' in text_lower)
+        return keyword_count >= 2 or unit_count >= 1
+    def equation_validity_check(self, text: str) -> bool:
+        """
+        Check if LaTeX equations are well-formed.
+        """
+        # Count dollar signs
+        dollar_count = text.count('$')
+        if dollar_count % 2 != 0:
+            return False  # Unmatched dollar signs
+        # Count backslash brackets
+        lbracket = text.count('\\[')
+        rbracket = text.count('\\]')
+        if lbracket != rbracket:
+            return False
+        # Count parentheses in inline math
+        lparen = text.count('\\(')
+        rparen = text.count('\\)')
+        if lparen != rparen:
+            return False
+        # Check for common LaTeX errors
+        # Unmatched braces
+        brace_balance = 0
+        for char in text:
+            if char == '{':
+                brace_balance += 1
+            elif char == '}':
+                brace_balance -= 1
+            if brace_balance < 0:
+                return False  # Closing without opening
+        if brace_balance != 0:
+            return False  # Unmatched braces
+        return True
+    def repetition_check(self, text: str) -> bool:
+        """
+        Check for excessive repetition.
+        Uses character-level 4-gram repetition.
+        """
+        if len(text) < 100:
+            return True
+        # Get 4-grams
+        n = 4
+        ngrams = [text[i:i+n] for i in range(len(text) - n + 1)]
+        # Count repetitions
+        from collections import Counter
+        ngram_counts = Counter(ngrams)
+        total_ngrams = len(ngrams)
+        unique_ngrams = len(ngram_counts)
+        if total_ngrams == 0:
+            return True
+        repetition_ratio = 1 - (unique_ngrams / total_ngrams)
+        return repetition_ratio <= self.max_repetition_ratio
+    def citation_ratio_check(self, text: str) -> bool:
+        """
+        Check if citation density is reasonable.
+        """
+        # Count citation patterns
+        # (Author, Year)
+        inline1 = len(re.findall(r'\([A-Za-z\s]+,?\s*\d{4}\)', text))
+        # [1] or [1-3]
+        inline2 = len(re.findall(r'\[\d+(?:[-,]\d+)*\]', text))
+        # [Author, Year]
+        inline3 = len(re.findall(r'\[[A-Za-z\s]+,?\s*\d{4}\]', text))
+        total_citations = inline1 + inline2 + inline3
+        # Estimate word count
+        words = re.findall(r'\b[a-zA-Z]{2,}\b', text)
+        if len(words) == 0:
+            return True
+        citation_ratio = total_citations / len(words)
+        # Allow range
+        return self.min_citation_ratio <= citation_ratio <= self.max_citation_ratio
+    def get_stats(self, stats: FilterStats) -> str:
+        """Get formatted statistics string."""
+        total = stats.total if stats.total > 0 else 1
+        return (
+            f"Quality filter stats:\n"
+            f"  Total: {stats.total}\n"
+            f"  Passed: {stats.passed} ({stats.passed/total*100:.1f}%)\n"
+            f"  Failed - Length: {stats.failed_length}\n"
+            f"  Failed - Language: {stats.failed_language}\n"
+            f"  Failed - Content: {stats.failed_content}\n"
+            f"  Failed - Equations: {stats.failed_equations}\n"
+            f"  Failed - Repetition: {stats.failed_repetition}\n"
+            f"  Failed - Citations: {stats.failed_citations}"
+        )
+def test_quality_filter():
+    """Test the quality filter."""
+    filter = ScienceQualityFilter()
+    # Good sample
+    good_text = """
+    The experiment was conducted to test the hypothesis. We collected data from
+    100 participants and performed statistical analysis. The results show a
+    significant effect (p < 0.05). According to Smith et al., this confirms
+    the theoretical prediction. The equation E = mc^2 is fundamental.
+    """
+    print(f"Good text passes: {filter.filter(good_text)}")
+    # Bad sample (too short)
+    short_text = "This is too short."
+    print(f"Short text passes: {filter.filter(short_text)}")
+    # Bad sample (unmatched equations)
+    bad_eq = "Here is an equation $E = mc^2 and another $F = ma."
+    print(f"Unmatched $ passes: {filter.filter(bad_eq)}")
+    # Bad sample (excessive repetition)
+    repetitive = "test test test test test test test test test test " * 100
+    print(f"Repetitive passes: {filter.filter(repetitive)}")
+    print("QualityFilter test passed!")
+if __name__ == "__main__":
+    test_quality_filter()

data/scraper.py ADDED Viewed

	@@ -0,0 +1,405 @@

+"""
+VortexScienceScraper: Scrapes scientific content from open access sources.
+Respects robots.txt and rate limits.
+"""
+import time
+import requests
+from typing import List, Dict, Optional
+from urllib.robotparser import RobotFileParser
+from pathlib import Path
+import json
+class VortexScienceScraper:
+    """
+    Scrapes scientific content from open access sources.
+    Sources: arXiv, PubMed Central, Wikipedia, NIST, NASA.
+    """
+    SOURCES = {
+        "arxiv": {
+            "base_url": "https://arxiv.org",
+            "search_url": "https://arxiv.org/search/",
+            "rate_limit": 1.0,  # seconds between requests
+            "robots": "https://arxiv.org/robots.txt",
+        },
+        "pubmed": {
+            "base_url": "https://www.ncbi.nlm.nih.gov/pmc",
+            "search_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/",
+            "rate_limit": 0.5,
+            "robots": "https://www.ncbi.nlm.nih.gov/robots.txt",
+        },
+        "wikipedia": {
+            "base_url": "https://en.wikipedia.org",
+            "search_url": "https://en.wikipedia.org/w/api.php",
+            "rate_limit": 0.1,
+            "robots": "https://en.wikipedia.org/robots.txt",
+        },
+        "nist": {
+            "base_url": "https://webbook.nist.gov",
+            "search_url": "https://webbook.nist.gov/cgi/cbook.cgi",
+            "rate_limit": 1.0,
+            "robots": "https://webbook.nist.gov/robots.txt",
+        },
+        "nasa": {
+            "base_url": "https://ntrs.nasa.gov",
+            "search_url": "https://ntrs.nasa.gov/api/citations/search",
+            "rate_limit": 1.0,
+            "robots": "https://ntrs.nasa.gov/robots.txt",
+        },
+    }
+    def __init__(
+        self,
+        output_dir: str = "./data/scraped",
+        respect_robots: bool = True,
+        user_agent: str = "VortexScientificBot/1.0",
+    ):
+        """
+        Initialize scraper.
+        Args:
+            output_dir: Directory to save scraped data
+            respect_robots: Whether to respect robots.txt
+            user_agent: User agent string for requests
+        """
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.respect_robots = respect_robots
+        self.user_agent = user_agent
+        self.session = requests.Session()
+        self.session.headers.update({"User-Agent": user_agent})
+        # Cache for robots.txt
+        self.robots_cache = {}
+        # Rate limit tracking
+        self.last_request_time = {}
+    def _check_robots_allowed(self, url: str) -> bool:
+        """Check if robots.txt allows scraping the URL."""
+        if not self.respect_robots:
+            return True
+        # Extract base domain
+        from urllib.parse import urlparse
+        parsed = urlparse(url)
+        base_url = f"{parsed.scheme}://{parsed.netloc}"
+        if base_url not in self.robots_cache:
+            rp = RobotFileParser()
+            rp.set_url(base_url + "/robots.txt")
+            try:
+                rp.read()
+                self.robots_cache[base_url] = rp
+            except Exception as e:
+                print(f"Could not read robots.txt for {base_url}: {e}")
+                return False
+        rp = self.robots_cache[base_url]
+        return rp.can_fetch(self.user_agent, url)
+    def _rate_limit(self, source: str):
+        """Enforce rate limiting for a source."""
+        now = time.time()
+        last = self.last_request_time.get(source, 0)
+        delay = self.SOURCES[source]["rate_limit"]
+        if now - last < delay:
+            time.sleep(delay - (now - last))
+        self.last_request_time[source] = time.time()
+    def scrape_arxiv(
+        self,
+        query: str,
+        max_results: int = 100,
+        categories: Optional[List[str]] = None,
+    ) -> List[Dict]:
+        """
+        Scrape arXiv papers.
+        Args:
+            query: Search query
+            max_results: Maximum number of results
+            categories: Optional list of arXiv categories (e.g., ['physics', 'math'])
+        Returns:
+            List of paper metadata and abstracts
+        """
+        papers = []
+        params = {
+            "query": query,
+            "searchtype": "all",
+            "abstracts": "show",
+            "size": min(max_results, 200),  # arXiv max per page
+            "order": "-announced_date_first",
+        }
+        if categories:
+            params["filter"] = "categories:" + "+OR+".join(categories)
+        url = self.SOURCES["arxiv"]["search_url"]
+        if not self._check_robots_allowed(url):
+            print(f"Robots.txt disallows scraping {url}")
+            return papers
+        try:
+            self._rate_limit("arxiv")
+            response = self.session.get(url, params=params)
+            response.raise_for_status()
+            # Parse HTML (simplified - would use BeautifulSoup in practice)
+            # For now, return placeholder
+            print(f"Scraped arXiv query '{query}' - got response status {response.status_code}")
+            # Placeholder: would extract paper titles, abstracts, PDF links
+            for i in range(min(10, max_results)):
+                papers.append({
+                    "source": "arxiv",
+                    "title": f"Paper {i}",
+                    "abstract": "Abstract placeholder...",
+                    "pdf_url": f"https://arxiv.org/pdf/{i}.pdf",
+                })
+        except Exception as e:
+            print(f"Error scraping arXiv: {e}")
+        return papers
+    def scrape_pubmed(
+        self,
+        query: str,
+        max_results: int = 100,
+    ) -> List[Dict]:
+        """Scrape PubMed Central articles."""
+        articles = []
+        # PubMed API endpoint
+        url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
+        params = {
+            "db": "pmc",
+            "term": query,
+            "retmax": max_results,
+            "retmode": "json",
+        }
+        if not self._check_robots_allowed(url):
+            print(f"Robots.txt disallows {url}")
+            return articles
+        try:
+            self._rate_limit("pubmed")
+            response = self.session.get(url, params=params)
+            response.raise_for_status()
+            data = response.json()
+            pmc_ids = data.get("esearchresult", {}).get("idlist", [])
+            for pmc_id in pmc_ids[:10]:  # Limit for demo
+                articles.append({
+                    "source": "pubmed",
+                    "pmc_id": pmc_id,
+                    "url": f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmc_id}/",
+                })
+            print(f"Found {len(pmc_ids)} PubMed articles")
+        except Exception as e:
+            print(f"Error scraping PubMed: {e}")
+        return articles
+    def scrape_wikipedia(
+        self,
+        topic: str,
+        max_pages: int = 10,
+    ) -> List[Dict]:
+        """Scrape Wikipedia science articles."""
+        pages = []
+        # Wikipedia API
+        url = "https://en.wikipedia.org/w/api.php"
+        params = {
+            "action": "query",
+            "format": "json",
+            "prop": "extracts",
+            "exintro": True,
+            "titles": topic,
+            "redirects": True,
+        }
+        if not self._check_robots_allowed(url):
+            print(f"Robots.txt disallows {url}")
+            return pages
+        try:
+            self._rate_limit("wikipedia")
+            response = self.session.get(url, params=params)
+            response.raise_for_status()
+            data = response.json()
+            pages_data = data.get("query", {}).get("pages", {})
+            for page_id, page in pages_data.items():
+                if "extract" in page:
+                    pages.append({
+                        "source": "wikipedia",
+                        "title": page.get("title", ""),
+                        "text": page.get("extract", ""),
+                    })
+        except Exception as e:
+            print(f"Error scraping Wikipedia: {e}")
+        return pages
+    def scrape_nist(
+        self,
+        element: str,
+    ) -> List[Dict]:
+        """Scrape NIST chemistry webbook for element data."""
+        data = []
+        url = "https://webbook.nist.gov/cgi/cbook.cgi"
+        params = {
+            "Formula": element,
+            "Units": "SI",
+            "Submit": "Submit",
+        }
+        if not self._check_robots_allowed(url):
+            print(f"Robots.txt disallows {url}")
+            return data
+        try:
+            self._rate_limit("nist")
+            response = self.session.get(url, params=params)
+            response.raise_for_status()
+            # Placeholder - would parse HTML tables
+            data.append({
+                "source": "nist",
+                "element": element,
+                "html": response.text[:1000],
+            })
+        except Exception as e:
+            print(f"Error scraping NIST: {e}")
+        return data
+    def scrape_nasa(
+        self,
+        query: str,
+        max_results: int = 50,
+    ) -> List[Dict]:
+        """Scrape NASA technical reports."""
+        reports = []
+        url = "https://ntrs.nasa.gov/api/citations/search"
+        params = {
+            "q": query,
+            "page[size]": max_results,
+        }
+        if not self._check_robots_allowed(url):
+            print(f"Robots.txt disallows {url}")
+            return reports
+        try:
+            self._rate_limit("nasa")
+            response = self.session.get(url, params=params)
+            response.raise_for_status()
+            data = response.json()
+            for item in data.get("data", [])[:10]:
+                reports.append({
+                    "source": "nasa",
+                    "title": item.get("attributes", {}).get("title", ""),
+                    "abstract": item.get("attributes", {}).get("abstract", ""),
+                    "download_url": item.get("attributes", {}).get("downloads", {}).get("pdf", ""),
+                })
+        except Exception as e:
+            print(f"Error scraping NASA: {e}")
+        return reports
+    def save_results(
+        self,
+        results: List[Dict],
+        filename: str,
+    ):
+        """Save scraped results to JSON."""
+        output_path = self.output_dir / filename
+        with open(output_path, "w", encoding="utf-8") as f:
+            json.dump(results, f, indent=2, ensure_ascii=False)
+        print(f"Saved {len(results)} results to {output_path}")
+    def scrape_all_sources(
+        self,
+        queries: Dict[str, str],
+        max_per_source: int = 50,
+    ) -> Dict[str, List[Dict]]:
+        """
+        Scrape all sources with given queries.
+        Args:
+            queries: Dict mapping source name to query string
+            max_per_source: Max results per source
+        Returns:
+            Dict mapping source to list of results
+        """
+        all_results = {}
+        for source, query in queries.items():
+            if source not in self.SOURCES:
+                print(f"Unknown source: {source}")
+                continue
+            print(f"Scraping {source} with query: {query}")
+            if source == "arxiv":
+                results = self.scrape_arxiv(query, max_results=max_per_source)
+            elif source == "pubmed":
+                results = self.scrape_pubmed(query, max_results=max_per_source)
+            elif source == "wikipedia":
+                results = self.scrape_wikipedia(query, max_pages=max_per_source)
+            elif source == "nist":
+                results = self.scrape_nist(query)
+            elif source == "nasa":
+                results = self.scrape_nasa(query, max_results=max_per_source)
+            else:
+                results = []
+            all_results[source] = results
+            # Save intermediate results
+            self.save_results(results, f"{source}_results.json")
+        return all_results
+def test_scraper():
+    """Test the scraper (limited)."""
+    scraper = VortexScienceScraper()
+    # Test Wikipedia (lightweight)
+    print("Testing Wikipedia scrape...")
+    results = scraper.scrape_wikipedia("quantum mechanics", max_pages=2)
+    print(f"Got {len(results)} Wikipedia pages")
+    # Test arXiv (rate limited)
+    print("Testing arXiv scrape...")
+    results = scraper.scrape_arxiv("quantum", max_results=5)
+    print(f"Got {len(results)} arXiv papers")
+    print("Scraper test passed!")
+if __name__ == "__main__":
+    test_scraper()

inference.py ADDED Viewed

	@@ -0,0 +1,213 @@

+#!/usr/bin/env python3
+"""
+Inference script for Vortex models.
+Supports both CUDA and MPS backends.
+"""
+import argparse
+import sys
+from pathlib import Path
+import torch
+from configs.vortex_7b_config import VORTEX_7B_CONFIG
+from configs.vortex_13b_config import VORTEX_13B_CONFIG
+from models.vortex_model import VortexModel
+from tokenizer.vortex_tokenizer import VortexScienceTokenizer
+from inference.cuda_optimize import optimize_for_cuda, profile_model
+from inference.mps_optimize import optimize_for_mps, profile_model_mps
+def parse_args():
+    parser = argparse.ArgumentParser(description="Run inference with Vortex model")
+    parser.add_argument("--model_path", type=str, required=True,
+                        help="Path to trained model checkpoint")
+    parser.add_argument("--config", type=str, default=None,
+                        help="Path to model config (if not in checkpoint)")
+    parser.add_argument("--tokenizer_path", type=str, default=None,
+                        help="Path to tokenizer")
+    parser.add_argument("--model_size", type=str, choices=["7b", "13b"], default="7b",
+                        help="Model size for config")
+    parser.add_argument("--device", type=str, default="cuda",
+                        choices=["cuda", "mps", "cpu"],
+                        help="Device to run on")
+    parser.add_argument("--use_mps", action="store_true",
+                        help="Use MPS backend (Apple Silicon)")
+    parser.add_argument("--quantization", type=str, choices=[None, "int8", "int4"], default=None,
+                        help="Apply quantization (CUDA only)")
+    parser.add_argument("--flash_attention", action="store_true",
+                        help="Use Flash Attention 2 (CUDA only)")
+    parser.add_argument("--torch_compile", action="store_true",
+                        help="Use torch.compile")
+    parser.add_argument("--prompt", type=str, default=None,
+                        help="Input prompt for generation")
+    parser.add_argument("--interactive", action="store_true",
+                        help="Run in interactive mode")
+    parser.add_argument("--max_new_tokens", type=int, default=100,
+                        help="Maximum new tokens to generate")
+    parser.add_argument("--temperature", type=float, default=0.8,
+                        help="Sampling temperature")
+    parser.add_argument("--top_p", type=float, default=0.9,
+                        help="Top-p sampling")
+    parser.add_argument("--profile", action="store_true",
+                        help="Profile performance")
+    return parser.parse_args()
+def load_model(args):
+    """Load model with appropriate optimizations."""
+    # Load config
+    if args.config:
+        from configuration_vortex import VortexConfig
+        config = VortexConfig.from_pretrained(args.config)
+    else:
+        # Use default config for size
+        if args.model_size == "7b":
+            config_dict = VORTEX_7B_CONFIG
+        else:
+            config_dict = VORTEX_13B_CONFIG
+        from configuration_vortex import VortexConfig
+        config = VortexConfig(**config_dict)
+    # Create model
+    print("Creating model...")
+    model = VortexModel(config.to_dict())
+    # Load checkpoint
+    print(f"Loading checkpoint from {args.model_path}")
+    checkpoint = torch.load(args.model_path, map_location="cpu", weights_only=False)
+    if "model_state_dict" in checkpoint:
+        model.load_state_dict(checkpoint["model_state_dict"])
+    else:
+        model.load_state_dict(checkpoint)
+    print("Model loaded")
+    # Apply optimizations
+    device = torch.device(args.device)
+    if args.use_mps or args.device == "mps":
+        print("Optimizing for MPS...")
+        model = optimize_for_mps(model, config.to_dict(), use_sdpa=True)
+    else:
+        print("Optimizing for CUDA...")
+        model = optimize_for_cuda(
+            model,
+            config.to_dict(),
+            use_flash_attention=args.flash_attention,
+            use_torch_compile=args.torch_compile,
+            quantization=args.quantization,
+        )
+    model = model.to(device)
+    model.eval()
+    return model, config
+def load_tokenizer(args):
+    """Load tokenizer."""
+    tokenizer_path = args.tokenizer_path
+    if not tokenizer_path:
+        # Try to find in model directory
+        model_dir = Path(args.model_path).parent
+        tokenizer_path = model_dir / "vortex_tokenizer.json"
+    if tokenizer_path and Path(tokenizer_path).exists():
+        from tokenization_vortex import VortexTokenizer
+        tokenizer = VortexTokenizer.from_pretrained(str(model_dir))
+    else:
+        print("Warning: No tokenizer found, using dummy tokenizer")
+        class DummyTokenizer:
+            def __call__(self, text, **kwargs):
+                return {"input_ids": torch.tensor([[1, 2, 3]])}
+            def decode(self, ids, **kwargs):
+                return "dummy"
+        tokenizer = DummyTokenizer()
+    return tokenizer
+def generate_text(model, tokenizer, prompt, args):
+    """Generate text from prompt."""
+    # Tokenize
+    inputs = tokenizer(
+        prompt,
+        return_tensors="pt",
+        padding=False,
+        truncation=True,
+        max_length=model.config.max_seq_len - args.max_new_tokens,
+    )
+    input_ids = inputs["input_ids"].to(next(model.parameters()).device)
+    # Generate
+    with torch.no_grad():
+        if hasattr(model, 'generate'):
+            output_ids = model.generate(
+                input_ids,
+                max_new_tokens=args.max_new_tokens,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                do_sample=True,
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        else:
+            # Manual generation
+            for _ in range(args.max_new_tokens):
+                outputs = model(input_ids)
+                next_token_logits = outputs["logits"][:, -1, :]
+                next_token = torch.multinomial(
+                    torch.softmax(next_token_logits / args.temperature, dim=-1),
+                    num_samples=1,
+                )
+                input_ids = torch.cat([input_ids, next_token], dim=-1)
+                # Check for EOS
+                if next_token.item() == tokenizer.eos_token_id:
+                    break
+    # Decode
+    generated = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True)
+    return generated
+def main():
+    args = parse_args()
+    # Load model and tokenizer
+    model, config = load_model(args)
+    tokenizer = load_tokenizer(args)
+    print(f"Model loaded on {next(model.parameters()).device}")
+    print(f"Model parameters: {model.get_num_params():,}")
+    # Profile if requested
+    if args.profile:
+        print("Profiling...")
+        dummy_input = torch.randint(0, config.vocab_size, (1, 128)).to(next(model.parameters()).device)
+        if args.use_mps or args.device == "mps":
+            stats = profile_model_mps(model, dummy_input)
+        else:
+            stats = profile_model(model, dummy_input)
+        print("Profile results:")
+        for k, v in stats.items():
+            print(f"  {k}: {v:.4f}")
+        return
+    # Interactive mode
+    if args.interactive:
+        print("Interactive mode. Type 'quit' to exit.")
+        while True:
+            prompt = input("\nPrompt: ")
+            if prompt.lower() == "quit":
+                break
+            response = generate_text(model, tokenizer, prompt, args)
+            print(f"\nResponse: {response}")
+    elif args.prompt:
+        response = generate_text(model, tokenizer, args.prompt, args)
+        print(f"Response: {response}")
+    else:
+        print("No prompt provided. Use --prompt or --interactive.")
+if __name__ == "__main__":
+    main()

modeling_vortex.py ADDED Viewed

	@@ -0,0 +1,222 @@

+"""
+Vortex model implementation for HuggingFace.
+Integrates with transformers library.
+"""
+from typing import Optional, Tuple, List, Dict, Any
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, PretrainedConfig, GenerationConfig
+from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
+from configuration_vortex import VortexConfig
+from models.vortex_model import VortexModel
+class VortexPreTrainedModel(PreTrainedModel):
+    """
+    Base class for Vortex models.
+    Handles loading/saving in HF format.
+    """
+    config_class = VortexConfig
+    base_model_prefix = "vortex"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+    def _init_weights(self, module):
+        """Initialize weights."""
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+    def get_input_embeddings(self):
+        return self.vortex.embed_tokens
+    def set_input_embeddings(self, value):
+        self.vortex.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.vortex.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.vortex.lm_head = new_embeddings
+class VortexForCausalLM(VortexPreTrainedModel):
+    """
+    Vortex model for causal language modeling.
+    """
+    _tied_weights_keys = ["vortex.lm_head.weight"]
+    def __init__(self, config: VortexConfig):
+        super().__init__(config)
+        self.config = config
+        # Build core model
+        self.vortex = VortexModel(config.to_dict())
+        # Initialize weights
+        self.apply(self._init_weights)
+        # Tie weights if configured
+        if self.config.tie_word_embeddings:
+            self.tie_weights()
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        domain_ids: Optional[torch.LongTensor] = None,
+        domain_tags: Optional[torch.Tensor] = None,
+        text: Optional[List[str]] = None,
+    ) -> CausalLMOutputWithCrossAttentions:
+        """
+        Forward pass.
+        Args:
+            input_ids: Token IDs (batch, seq_len)
+            attention_mask: Attention mask (batch, seq_len)
+            labels: Labels for LM loss (batch, seq_len)
+            domain_ids: Domain IDs (batch,)
+            domain_tags: Domain tag masks (batch, seq_len, num_domains)
+            text: Original text strings (for science modules)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # Pass through Vortex model
+        outputs = self.vortex(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            domain_ids=domain_ids,
+            domain_tags=domain_tags,
+            text=text,
+            return_dict=True,
+        )
+        logits = outputs["logits"]
+        last_hidden_state = outputs["last_hidden_state"]
+        loss = None
+        if labels is not None:
+            # Compute cross-entropy loss
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
+            loss = loss_fct(
+                shift_logits.view(-1, shift_logits.size(-1)),
+                shift_labels.view(-1),
+            )
+        if not return_dict:
+            output = (logits,) + (last_hidden_state,)
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=logits,
+            hidden_states=last_hidden_state,
+            attentions=None,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        attention_mask=None,
+        **kwargs,
+    ):
+        """Prepare inputs for text generation."""
+        # Omit tokens that are already past
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache", True),
+        }
+    def generate(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        **kwargs,
+    ):
+        """Generate text."""
+        from transformers import GenerationConfig
+        generation_config = kwargs.pop("generation_config", None)
+        if generation_config is None:
+            generation_config = GenerationConfig.from_model_config(self.config)
+        return super().generate(
+            input_ids=input_ids,
+            inputs_embeds=inputs_embeds,
+            generation_config=generation_config,
+            **kwargs,
+        )
+# Register model for AutoModel
+from transformers import AutoConfig, AutoModelForCausalLM
+AutoConfig.register("vortex", VortexConfig)
+AutoModelForCausalLM.register(VortexConfig, VortexForCausalLM)
+def test_hf_integration():
+    """Test HuggingFace integration."""
+    from transformers import AutoConfig, AutoModelForCausalLM
+    # Create config
+    config = VortexConfig(
+        d_model=512,
+        num_layers=2,
+        num_heads=8,
+        vocab_size=1000,
+    )
+    # Create model
+    model = VortexForCausalLM(config)
+    print(f"Model parameters: {model.get_num_parameters():,}")
+    # Test forward
+    batch_size = 2
+    seq_len = 32
+    input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_len))
+    labels = torch.randint(0, config.vocab_size, (batch_size, seq_len))
+    outputs = model(input_ids=input_ids, labels=labels)
+    print(f"Loss: {outputs.loss.item():.4f}")
+    print(f"Logits shape: {outputs.logits.shape}")
+    # Test save/load
+    model.save_pretrained("./test_vortex_model")
+    config.save_pretrained("./test_vortex_model")
+    loaded_config = AutoConfig.from_pretrained("./test_vortex_model")
+    loaded_model = AutoModelForCausalLM.from_pretrained("./test_vortex_model")
+    print(f"Loaded model type: {type(loaded_model)}")
+    print("HF integration test passed!")
+if __name__ == "__main__":
+    test_hf_integration()

models/__pycache__/attention_layer.cpython-313.pyc ADDED Viewed

Binary file (15.5 kB). View file

models/__pycache__/scigate_ffn.cpython-313.pyc ADDED Viewed

Binary file (7.94 kB). View file

models/__pycache__/ssm_layer.cpython-313.pyc ADDED Viewed

Binary file (10.2 kB). View file

models/__pycache__/vortex_model.cpython-313.pyc ADDED Viewed

Binary file (14.6 kB). View file

models/attention_layer.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""
+VortexLocalAttention: Local windowed attention with global token support.
+Uses a sliding window of 512 tokens for efficiency, with special handling
+for global tokens that can attend across the entire sequence.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple
+class VortexLocalAttention(nn.Module):
+    """
+    Local windowed attention with window_size=512.
+    Science documents have strong local coherence — equations reference
+    nearby text, not distant paragraphs.
+    Global tokens (special [SCIENCE] tokens) attend to everything.
+    """
+    def __init__(
+        self,
+        d_model: int,
+        num_heads: int,
+        window_size: int = 512,
+        use_flash_attention: bool = True,
+    ):
+        """
+        Initialize local windowed attention.
+        Args:
+            d_model: Model dimension
+            num_heads: Number of attention heads
+            window_size: Size of local attention window
+            use_flash_attention: Use Flash Attention 2 if available (CUDA only)
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.num_heads = num_heads
+        self.head_dim = d_model // num_heads
+        self.window_size = window_size
+        self.use_flash_attention = use_flash_attention
+        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+        # QKV projection
+        self.qkv = nn.Linear(d_model, d_model * 3, bias=False)
+        self.out_proj = nn.Linear(d_model, d_model, bias=False)
+        # Global token projection (for tokens that attend globally)
+        self.global_qkv = nn.Linear(d_model, d_model * 3, bias=False)
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.qkv, self.global_qkv, self.out_proj]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(
+        self,
+        x: torch.Tensor,
+        global_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass with local windowed attention.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            global_mask: Boolean mask indicating which tokens are global (attend everywhere)
+                        Shape: (batch, seq_len) or None
+            attention_mask: Optional padding mask (batch, seq_len)
+        Returns:
+            Output tensor (batch, seq_len, d_model)
+        """
+        batch, seq_len, _ = x.shape
+        device = x.device
+        dtype = x.dtype
+        if global_mask is None:
+            global_mask = torch.zeros(batch, seq_len, dtype=torch.bool, device=device)
+        # Compute QKV for all tokens
+        qkv = self.qkv(x)
+        q, k, v = qkv.chunk(3, dim=-1)
+        # Reshape for multi-head attention
+        q = q.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k = k.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        v = v.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Compute global token QKV separately
+        if global_mask.any():
+            global_qkv = self.global_qkv(x)
+            gq, gk, gv = global_qkv.chunk(3, dim=-1)
+            gq = gq.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+            gk = gk.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+            gv = gv.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Build output tensor
+        output = torch.zeros_like(x)
+        # Process each position
+        for t in range(seq_len):
+            # Determine window
+            window_start = max(0, t - self.window_size // 2)
+            window_end = min(seq_len, t + self.window_size // 2 + 1)
+            window_len = window_end - window_start
+            # Get window indices
+            window_indices = slice(window_start, window_end)
+            # Extract window queries (for position t)
+            q_t = q[:, :, t:t+1, :]  # (batch, heads, 1, head_dim)
+            # Determine which keys/values to use
+            # Local tokens: only those in window
+            # Global tokens: all positions (if they are global)
+            k_window = k[:, :, window_indices, :]
+            v_window = v[:, :, window_indices, :]
+            # Build full key/value set including global tokens
+            # Global tokens attend to all positions
+            if global_mask.any():
+                # Find global positions
+                global_positions = global_mask[0]  # (seq_len) - assume same across batch
+                if global_positions.any():
+                    gk_all = gk[:, :, :, :]  # All global keys
+                    gv_all = gv[:, :, :, :]
+                    # Concatenate window keys with global keys
+                    k_full = torch.cat([k_window, gk_all], dim=2)
+                    v_full = torch.cat([v_window, gv_all], dim=2)
+                else:
+                    k_full = k_window
+                    v_full = v_window
+            else:
+                k_full = k_window
+                v_full = v_window
+            # Compute attention scores
+            # q_t: (batch, heads, 1, head_dim)
+            # k_full: (batch, heads, window_len + num_global, head_dim)
+            attn_scores = torch.matmul(q_t, k_full.transpose(-2, -1)) / (self.head_dim ** 0.5)
+            # (batch, heads, 1, k_len)
+            # Apply attention mask if provided
+            if attention_mask is not None:
+                mask_t = attention_mask[:, window_indices].unsqueeze(1).unsqueeze(2)
+                attn_scores = attn_scores.masked_fill(mask_t == 0, -1e9)
+            # Softmax
+            attn_weights = F.softmax(attn_scores, dim=-1)
+            # Weighted sum
+            attn_output = torch.matmul(attn_weights, v_full)
+            # (batch, heads, 1, head_dim)
+            # Reshape and project
+            attn_output = attn_output.transpose(1, 2).contiguous()
+            attn_output = attn_output.view(batch, 1, self.d_model)
+            attn_output = self.out_proj(attn_output)
+            # Place in output
+            output[:, t:t+1, :] = attn_output
+        return output
+    def forward_optimized(
+        self,
+        x: torch.Tensor,
+        global_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Optimized forward pass using Flash Attention or efficient windowed attention.
+        This is a placeholder for actual Flash Attention integration.
+        """
+        batch, seq_len, _ = x.shape
+        if self.use_flash_attention and self.window_size >= seq_len:
+            # For short sequences, can use full attention
+            return self._flash_attention_forward(x, attention_mask)
+        else:
+            # Use windowed attention
+            return self._windowed_attention_forward(x, global_mask, attention_mask)
+    def _flash_attention_forward(
+        self,
+        x: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Use Flash Attention 2 if available.
+        Requires: pip install flash-attn
+        """
+        try:
+            from flash_attn import flash_attn_func
+            batch, seq_len, _ = x.shape
+            qkv = self.qkv(x)
+            q, k, v = qkv.chunk(3, dim=-1)
+            # Reshape for flash attention
+            q = q.view(batch, seq_len, self.num_heads, self.head_dim)
+            k = k.view(batch, seq_len, self.num_heads, self.head_dim)
+            v = v.view(batch, seq_len, self.num_heads, self.head_dim)
+            # Flash attention expects (batch, seq_len, num_heads, head_dim)
+            # and returns same shape
+            if attention_mask is not None:
+                # Flash attention uses causal mask or padding mask
+                output = flash_attn_func(
+                    q, k, v,
+                    causal=False,
+                    softmax_scale=1.0 / (self.head_dim ** 0.5),
+                )
+            else:
+                output = flash_attn_func(
+                    q, k, v,
+                    causal=False,
+                )
+            output = output.view(batch, seq_len, self.d_model)
+            return self.out_proj(output)
+        except ImportError:
+            print("Flash Attention not available, falling back to standard attention")
+            return self._standard_attention(x, attention_mask)
+    def _standard_attention(
+        self,
+        x: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """Standard full attention (quadratic)."""
+        batch, seq_len, _ = x.shape
+        qkv = self.qkv(x)
+        q, k, v = qkv.chunk(3, dim=-1)
+        q = q.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k = k.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        v = v.view(batch, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Compute attention scores
+        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
+        if attention_mask is not None:
+            attn_scores = attn_scores.masked_fill(
+                attention_mask.unsqueeze(1).unsqueeze(2) == 0,
+                -1e9
+            )
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        attn_output = torch.matmul(attn_weights, v)
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.view(batch, seq_len, self.d_model)
+        return self.out_proj(attn_output)
+    def _windowed_attention_forward(
+        self,
+        x: torch.Tensor,
+        global_mask: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Efficient windowed attention implementation.
+        Uses unfold to extract windows and batched matrix multiply.
+        """
+        batch, seq_len, _ = x.shape
+        device = x.device
+        if global_mask is None:
+            global_mask = torch.zeros(batch, seq_len, dtype=torch.bool, device=device)
+        # Compute QKV
+        qkv = self.qkv(x)
+        q, k, v = qkv.chunk(3, dim=-1)
+        # Reshape: (batch, seq_len, num_heads, head_dim)
+        q = q.view(batch, seq_len, self.num_heads, self.head_dim)
+        k = k.view(batch, seq_len, self.num_heads, self.head_dim)
+        v = v.view(batch, seq_len, self.num_heads, self.head_dim)
+        # Pad sequence for windowing
+        pad_len = self.window_size // 2
+        k_padded = F.pad(k, (0, 0, 0, 0, pad_len, pad_len))
+        v_padded = F.pad(v, (0, 0, 0, 0, pad_len, pad_len))
+        # Extract windows using unfold
+        # (batch, seq_len, num_heads, head_dim) -> (batch, seq_len, window_size, num_heads, head_dim)
+        k_windows = k_padded.unfold(1, self.window_size, 1)
+        v_windows = v_padded.unfold(1, self.window_size, 1)
+        # Permute to (batch, seq_len, num_heads, window_size, head_dim)
+        k_windows = k_windows.permute(0, 1, 3, 2, 4)
+        v_windows = v_windows.permute(0, 1, 3, 2, 4)
+        # Compute attention for each position
+        # q: (batch, seq_len, num_heads, 1, head_dim)
+        q_expanded = q.unsqueeze(3)
+        k_windows = k_windows
+        # Scores: (batch, seq_len, num_heads, 1, window_size)
+        attn_scores = torch.matmul(q_expanded, k_windows.transpose(-2, -1)) / (self.head_dim ** 0.5)
+        attn_scores = attn_scores.squeeze(3)  # (batch, seq_len, num_heads, window_size)
+        # Apply softmax
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        # Weighted sum
+        attn_output = torch.matmul(attn_weights.unsqueeze(3), v_windows).squeeze(3)
+        # (batch, seq_len, num_heads, head_dim)
+        # Concatenate heads
+        attn_output = attn_output.view(batch, seq_len, self.d_model)
+        # Add global token contribution if any
+        if global_mask.any():
+            # Compute full attention for global tokens only
+            # This is a simplified version - in practice would be optimized
+            global_indices = global_mask[0].nonzero(as_tuple=True)[0]
+            if len(global_indices) > 0:
+                # For positions with global tokens, add full attention
+                # (simplified: compute full attention for all)
+                full_attn = self._standard_attention(x, attention_mask)
+                # Blend: local for most, full for global positions
+                attn_output = torch.where(
+                    global_mask.unsqueeze(-1),
+                    full_attn,
+                    attn_output
+                )
+        return self.out_proj(attn_output)
+def test_vortex_local_attention():
+    """Test the VortexLocalAttention layer."""
+    batch_size = 2
+    seq_len = 256
+    d_model = 4096
+    num_heads = 32
+    window_size = 512
+    attn = VortexLocalAttention(d_model, num_heads, window_size, use_flash_attention=False)
+    x = torch.randn(batch_size, seq_len, d_model)
+    # Forward pass
+    output = attn(x)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape, f"Expected {x.shape}, got {output.shape}"
+    # With global mask
+    global_mask = torch.zeros(batch_size, seq_len, dtype=torch.bool)
+    global_mask[0, 0] = True  # First token is global
+    global_mask[1, -1] = True  # Last token is global
+    output2 = attn(x, global_mask=global_mask)
+    assert output2.shape == x.shape
+    print("VortexLocalAttention test passed!")
+if __name__ == "__main__":
+    test_vortex_local_attention()

models/science_modules/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""
+Science modules package.
+"""
+from .equation_module import EquationModule
+from .numerical_module import NumericalReasoningModule
+from .citation_module import CitationModule
+from .molecular_module import MolecularModule
+__all__ = [
+    "EquationModule",
+    "NumericalReasoningModule",
+    "CitationModule",
+    "MolecularModule",
+]

models/science_modules/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (490 Bytes). View file

models/science_modules/__pycache__/citation_module.cpython-313.pyc ADDED Viewed

Binary file (9.3 kB). View file

models/science_modules/__pycache__/equation_module.cpython-313.pyc ADDED Viewed

Binary file (10.4 kB). View file

models/science_modules/__pycache__/molecular_module.cpython-313.pyc ADDED Viewed

Binary file (13 kB). View file

models/science_modules/__pycache__/numerical_module.cpython-313.pyc ADDED Viewed

Binary file (10 kB). View file

models/science_modules/citation_module.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+CitationModule: Understands scientific citation structure.
+Detects citation spans, tracks provenance, and estimates claim confidence.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import re
+from typing import Optional, Tuple, List
+class CitationModule(nn.Module):
+    """
+    Understands scientific citation structure.
+    - Detects citation spans [Author, Year] or (1) style
+    - Learns that cited claims carry different epistemic weight
+    - Distinguishes established facts vs recent/contested findings
+    - Tracks claim provenance through the context window
+    """
+    def __init__(self, d_model: int):
+        """
+        Initialize CitationModule.
+        Args:
+            d_model: Model dimension
+        """
+        super().__init__()
+        self.d_model = d_model
+        # Citation span detector (3 classes: none, inline, reference)
+        # Inline: (Author, Year) or [1]
+        # Reference: full citation at end of paper
+        self.citation_detector = nn.Linear(d_model, 3)
+        # Provenance gate: modulates information flow based on citation context
+        self.provenance_gate = nn.Linear(d_model, d_model)
+        # Claim confidence head: estimates how well-supported a claim is
+        self.confidence_head = nn.Linear(d_model, 1)
+        # Citation type embeddings
+        self.citation_type_embedding = nn.Embedding(3, d_model)
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.citation_detector, self.provenance_gate, self.confidence_head, self.citation_type_embedding]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def detect_citation_spans(
+        self,
+        text: str,
+    ) -> List[Tuple[int, int, str]]:
+        """
+        Detect citation spans in text.
+        Supports: (Author, Year), [1], [Author, Year], et al.
+        Args:
+            text: Input text string
+        Returns:
+            List of (start_char, end_char, citation_type)
+            citation_type: "inline" or "reference"
+        """
+        spans = []
+        # Pattern 1: (Author, Year) or (Author Year)
+        for match in re.finditer(r'\([A-Za-z\s]+(?:et al\.)?,?\s*\d{4}\)', text):
+            spans.append((match.start(), match.end(), "inline"))
+        # Pattern 2: [1] or [1-3] or [1,2,3]
+        for match in re.finditer(r'\[\d+(?:[-,]\d+)*\]', text):
+            spans.append((match.start(), match.end(), "inline"))
+        # Pattern 3: [Author, Year]
+        for match in re.finditer(r'\[[A-Za-z\s]+,?\s*\d{4}\]', text):
+            spans.append((match.start(), match.end(), "inline"))
+        # Pattern 4: et al. (often indicates citation)
+        for match in re.finditer(r'\bet al\.\b', text):
+            spans.append((match.start(), match.end(), "inline"))
+        return spans
+    def forward(
+        self,
+        x: torch.Tensor,
+        text: Optional[List[str]] = None,
+        citation_spans: Optional[List[List[Tuple[int, int, str]]]] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass through citation module.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            text: Optional original text strings
+            citation_spans: Optional pre-computed citation spans per batch
+        Returns:
+            Citation-enhanced representation (batch, seq_len, d_model)
+        """
+        batch, seq_len, d_model = x.shape
+        # Detect citation spans
+        if citation_spans is None and text is not None:
+            citation_spans = []
+            for b in range(batch):
+                spans = self.detect_citation_spans(text[b])
+                # Convert char spans to token spans (approximate)
+                token_spans = []
+                for start_char, end_char, ctype in spans:
+                    start_tok = max(0, start_char // 4)
+                    end_tok = min(seq_len, end_char // 4 + 1)
+                    token_spans.append((start_tok, end_tok, ctype))
+                citation_spans.append(token_spans)
+        # Compute citation type logits
+        citation_logits = self.citation_detector(x)  # (batch, seq_len, 3)
+        citation_probs = F.softmax(citation_logits, dim=-1)
+        # Apply citation-specific transformations
+        output = x.clone()
+        if citation_spans:
+            for b in range(batch):
+                spans_b = citation_spans[b] if b < len(citation_spans) else []
+                for start_tok, end_tok, ctype in spans_b:
+                    if end_tok <= start_tok:
+                        continue
+                    # Get citation type embedding
+                    if ctype == "inline":
+                        type_id = 1
+                    elif ctype == "reference":
+                        type_id = 2
+                    else:
+                        type_id = 0
+                    type_emb = self.citation_type_embedding(
+                        torch.tensor(type_id, device=x.device)
+                    )
+                    # Apply provenance gate to citation span
+                    span_slice = x[b, start_tok:end_tok, :]
+                    gated = span_slice * torch.sigmoid(self.provenance_gate(span_slice))
+                    # Add citation type embedding
+                    gated = gated + type_emb.unsqueeze(0).unsqueeze(0)
+                    output[b, start_tok:end_tok, :] = gated
+        # Compute confidence scores (for auxiliary loss)
+        confidence = torch.sigmoid(self.confidence_head(x))  # (batch, seq_len, 1)
+        return output, confidence
+    def compute_citation_loss(
+        self,
+        x: torch.Tensor,
+        citation_mask: torch.Tensor,
+        confidence: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute auxiliary loss for citation detection and confidence.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            citation_mask: Ground truth citation mask (batch, seq_len), 1 if token is in citation
+            confidence: Predicted confidence scores (batch, seq_len, 1)
+        Returns:
+            Combined citation loss
+        """
+        # Citation detection loss
+        logits = self.citation_detector(x)  # (batch, seq_len, 3)
+        detection_loss = F.cross_entropy(
+            logits.view(-1, 3),
+            citation_mask.long().view(-1),
+        )
+        # Confidence calibration loss (encourage high confidence for true citations)
+        confidence_loss = F.mse_loss(
+            confidence.squeeze(-1),
+            citation_mask.float(),
+        )
+        return detection_loss + 0.1 * confidence_loss
+def test_citation_module():
+    """Test CitationModule."""
+    d_model = 512
+    batch_size = 2
+    seq_len = 128
+    module = CitationModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = [
+        "The theory of relativity (Einstein, 1905) revolutionized physics. See also [1, 2].",
+        "According to Smith et al., the results are significant. Further reading: [Doe, 2020]."
+    ]
+    output, confidence = module(x, text=text)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    print(f"Confidence shape: {confidence.shape}")
+    assert output.shape == x.shape
+    assert confidence.shape == (batch_size, seq_len, 1)
+    # Test loss
+    citation_mask = torch.zeros(batch_size, seq_len)
+    citation_mask[0, 20:25] = 1.0  # Simulate citation span
+    citation_mask[1, 10:18] = 1.0
+    loss = module.compute_citation_loss(x, citation_mask, confidence)
+    print(f"Citation loss: {loss.item():.4f}")
+    print("CitationModule test passed!")
+if __name__ == "__main__":
+    test_citation_module()

models/science_modules/equation_module.py ADDED Viewed

	@@ -0,0 +1,266 @@

+"""
+EquationModule: Specialized processing for mathematical equations and LaTeX.
+Detects equation spans, applies equation-specific attention, and learns
+structural representations of mathematical expressions.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import re
+from typing import Optional, Tuple, List
+class EquationModule(nn.Module):
+    """
+    Specialized processing for mathematical equations and LaTeX.
+    - Detects equation spans in input (between $ $ or \[ \] delimiters)
+    - Applies equation-specific attention patterns within equation spans
+    - Learns structural representations of mathematical expressions
+    - Tree-aware: understands operator precedence and nesting
+    """
+    def __init__(self, d_model: int, num_heads: int = 8):
+        """
+        Initialize EquationModule.
+        Args:
+            d_model: Model dimension
+            num_heads: Number of heads for equation-specific attention
+        """
+        super().__init__()
+        self.d_model = d_model
+        # Equation span detector (lightweight linear classifier)
+        self.span_detector = nn.Linear(d_model, 1)
+        # Equation-specific transformer (shallow, 2 layers)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_model,
+            nhead=num_heads,
+            dim_feedforward=d_model * 4,
+            activation=F.silu,
+            batch_first=True,
+            dropout=0.1,
+        )
+        self.equation_encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
+        # Merge equation representations back into main stream
+        self.merge = nn.Linear(d_model * 2, d_model)
+        # LaTeX structure awareness (simple positional encoding for tree depth)
+        self.depth_embedding = nn.Embedding(10, d_model)  # Max depth 10
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.span_detector, self.merge, self.depth_embedding]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def detect_equation_spans(
+        self,
+        text: str,
+        token_ids: Optional[torch.Tensor] = None,
+    ) -> List[Tuple[int, int]]:
+        """
+        Detect equation spans in text using delimiters.
+        Supports: $...$, $$...$$, \[...\], \(...\)
+        Args:
+            text: Input text string
+            token_ids: Optional token IDs for alignment
+        Returns:
+            List of (start_char, end_char) spans
+        """
+        spans = []
+        # Pattern 1: $...$ (inline math)
+        for match in re.finditer(r'\$(.+?)\$', text, re.DOTALL):
+            spans.append((match.start(), match.end()))
+        # Pattern 2: $$...$$ (display math)
+        for match in re.finditer(r'\$\$(.+?)\$\$', text, re.DOTALL):
+            spans.append((match.start(), match.end()))
+        # Pattern 3: \[...\] (LaTeX display math)
+        for match in re.finditer(r'\\\[(.+?)\\\]', text, re.DOTALL):
+            spans.append((match.start(), match.end()))
+        # Pattern 4: \(...\) (LaTeX inline math)
+        for match in re.finditer(r'\\\((.+?)\\\)', text, re.DOTALL):
+            spans.append((match.start(), match.end()))
+        return spans
+    def forward(
+        self,
+        x: torch.Tensor,
+        text: Optional[List[str]] = None,
+        token_spans: Optional[List[List[Tuple[int, int]]]] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass through the equation module.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            text: Optional original text strings (for delimiter-based detection)
+            token_spans: Optional pre-computed token-level equation spans
+                         Each element: list of (start_token, end_token) for that batch item
+        Returns:
+            Equation-enhanced representation (batch, seq_len, d_model)
+        """
+        batch, seq_len, d_model = x.shape
+        # Detect equation spans
+        if token_spans is None and text is not None:
+            # Use delimiter-based detection (requires text)
+            token_spans = []
+            for b in range(batch):
+                char_spans = self.detect_equation_spans(text[b])
+                # Convert char spans to token spans (simplified - assumes 1 char ≈ 1 token)
+                # In practice, would need proper tokenization alignment
+                token_spans_b = []
+                for start_char, end_char in char_spans:
+                    # Rough approximation: divide by average chars per token (~4)
+                    start_token = max(0, start_char // 4)
+                    end_token = min(seq_len, end_char // 4 + 1)
+                    token_spans_b.append((start_token, end_token))
+                token_spans.append(token_spans_b)
+        elif token_spans is None:
+            # Fallback: use learned detector
+            token_spans = self._learned_span_detection(x)
+        # Process each batch item
+        output = x.clone()
+        for b in range(batch):
+            spans_b = token_spans[b] if b < len(token_spans) else []
+            for start_tok, end_tok in spans_b:
+                if end_tok <= start_tok:
+                    continue
+                # Extract equation segment
+                eq_segment = x[b:b+1, start_tok:end_tok, :]  # (1, seg_len, d_model)
+                # Apply equation-specific transformer
+                eq_encoded = self.equation_encoder(eq_segment)
+                # Merge with original
+                merged = torch.cat([eq_segment, eq_encoded], dim=-1)
+                merged = self.merge(merged)
+                # Place back in output
+                output[b:b+1, start_tok:end_tok, :] = merged
+        return output
+    def _learned_span_detection(
+        self,
+        x: torch.Tensor,
+    ) -> List[List[Tuple[int, int]]]:
+        """
+        Use learned detector to find equation spans when delimiters missing.
+        Simple thresholding on span_detector output.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+        Returns:
+            List of token spans per batch item
+        """
+        batch, seq_len, _ = x.shape
+        # Compute equation probability per token
+        eq_probs = torch.sigmoid(self.span_detector(x))  # (batch, seq_len, 1)
+        eq_probs = eq_probs.squeeze(-1)  # (batch, seq_len)
+        # Threshold
+        threshold = 0.5
+        spans = []
+        for b in range(batch):
+            probs = eq_probs[b]
+            is_equation = (probs > threshold).cpu().numpy()
+            # Find contiguous spans
+            span_list = []
+            in_span = False
+            start = 0
+            for t in range(seq_len):
+                if is_equation[t] and not in_span:
+                    start = t
+                    in_span = True
+                elif not is_equation[t] and in_span:
+                    span_list.append((start, t))
+                    in_span = False
+            if in_span:
+                span_list.append((start, seq_len))
+            spans.append(span_list)
+        return spans
+    def compute_equation_loss(
+        self,
+        x: torch.Tensor,
+        equation_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute auxiliary loss for equation detection training.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            equation_mask: Ground truth equation mask (batch, seq_len), 1 if token is in equation
+        Returns:
+            Binary cross-entropy loss for equation detection
+        """
+        logits = self.span_detector(x).squeeze(-1)  # (batch, seq_len)
+        loss = F.binary_cross_entropy_with_logits(
+            logits,
+            equation_mask.float(),
+        )
+        return loss
+def test_equation_module():
+    """Test EquationModule."""
+    d_model = 512
+    batch_size = 2
+    seq_len = 128
+    module = EquationModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = [
+        "The energy is $E = mc^2$ and momentum is $p = mv$.",
+        "Equation: \[ F = ma \] and also $a^2 + b^2 = c^2$."
+    ]
+    output = module(x, text=text)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape
+    # Test equation loss
+    equation_mask = torch.zeros(batch_size, seq_len)
+    equation_mask[0, 10:15] = 1.0  # Simulate equation span
+    equation_mask[1, 5:12] = 1.0
+    loss = module.compute_equation_loss(x, equation_mask)
+    print(f"Equation loss: {loss.item():.4f}")
+    print("EquationModule test passed!")
+if __name__ == "__main__":
+    test_equation_module()

models/science_modules/molecular_module.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""
+MolecularModule: Domain knowledge for chemistry and biology.
+Element embeddings, SMILES understanding, bond types, amino acids.
+"""
+import re
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple, List
+class MolecularModule(nn.Module):
+    """
+    Domain knowledge for chemistry and biology.
+    - All 118 elements as learned embeddings with properties
+      (atomic number, mass, electronegativity, valence electrons)
+    - SMILES string understanding for molecular structures
+    - Bond type awareness (covalent, ionic, hydrogen, van der Waals)
+    - Amino acid sequence understanding for biology/zoology
+    - Molecular formula → property reasoning
+    """
+    def __init__(self, d_model: int, num_elements: int = 118):
+        """
+        Initialize MolecularModule.
+        Args:
+            d_model: Model dimension
+            num_elements: Number of chemical elements (default 118)
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.num_elements = num_elements
+        # Element embeddings — 118 elements
+        self.element_embed = nn.Embedding(num_elements + 1, d_model)  # +1 for unknown
+        # Element property encoder (12 properties)
+        # [atomic_number, mass, electronegativity, valence_e, period, group,
+        #  atomic_radius, ionization_energy, electron_affinity, density,
+        #  melting_point, boiling_point]
+        self.property_proj = nn.Linear(12, d_model)
+        # Bond type embeddings (8 types)
+        # 0: none, 1: single, 2: double, 3: triple, 4: aromatic,
+        # 5: ionic, 6: hydrogen, 7: van der waals
+        self.bond_embed = nn.Embedding(8, d_model)
+        # Amino acid embeddings (20 standard + special)
+        self.amino_acid_vocab = 25  # 20 standard + stop + start + unknown + special
+        self.amino_embed = nn.Embedding(self.amino_acid_vocab, d_model)
+        # Molecular graph attention (treats molecules as graphs)
+        self.mol_attention = nn.MultiheadAttention(
+            d_model,
+            num_heads=8,
+            batch_first=True,
+            dropout=0.1,
+        )
+        # Property prediction head (for auxiliary tasks)
+        self.property_head = nn.Linear(d_model, 12)
+        # Initialize weights
+        self._initialize_weights()
+        # Pre-compute element properties (simplified)
+        self._init_element_properties()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.element_embed, self.property_proj, self.bond_embed,
+                       self.amino_embed, self.mol_attention, self.property_head]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def _init_element_properties(self):
+        """Initialize element property table with approximate values."""
+        # This is a simplified version - in practice would load from database
+        # Properties: [atomic_number, mass, electronegativity, valence_e, period, group,
+        #              atomic_radius, ionization_energy, electron_affinity, density,
+        #              melting_point, boiling_point]
+        properties = torch.zeros(self.num_elements + 1, 12)
+        # Fill in known elements (simplified data for first 20 + some common ones)
+        # Real implementation would use a comprehensive chemistry database
+        element_data = {
+            1: [1, 1.008, 2.20, 1, 1, 1, 25, 1312, 72.8, 0.0000899, 14, 20],
+            6: [6, 12.011, 2.55, 4, 2, 14, 70, 1086, 153.9, 2.267, 3550, 4027],
+            7: [7, 14.007, 3.04, 5, 2, 15, 65, 1402, 7.0, 0.0012506, 63, 77],
+            8: [8, 15.999, 3.44, 6, 2, 16, 60, 1314, 141.0, 0.001429, 55, 90],
+            # ... would fill all 118 elements
+        }
+        for z, props in element_data.items():
+            properties[z] = torch.tensor(props)
+        self.register_buffer("element_properties", properties)
+    def detect_molecular_spans(
+        self,
+        text: str,
+    ) -> List[Tuple[int, int, str]]:
+        """
+        Detect molecular/chemical spans in text.
+        Args:
+            text: Input text string
+        Returns:
+            List of (start_char, end_char, span_type)
+            span_type: "formula", "smiles", "amino_acid"
+        """
+        spans = []
+        # Chemical formulas: H2O, CO2, C6H12O6, NaCl, HCl
+        formula_pattern = r'\b([A-Z][a-z]?\d*)+(?:[A-Z][a-z]?\d*)*\b'
+        for match in re.finditer(formula_pattern, text):
+            # Filter out single letters that are not formulas
+            span = match.group()
+            if len(span) > 1 or span.isupper():
+                spans.append((match.start(), match.end(), "formula"))
+        # SMILES patterns (simplified detection)
+        # Contains: =, #, @, [], (), numbers in sequence
+        smiles_hints = ['=', '#', '@', '[', ']', '(', ')']
+        words = re.findall(r'\S+', text)
+        for word in words:
+            if any(hint in word for hint in smiles_hints) and len(word) > 3:
+                # Find position in text
+                pos = text.find(word)
+                if pos >= 0:
+                    spans.append((pos, pos + len(word), "smiles"))
+        # Amino acid sequences (single letters, length > 5)
+        aa_pattern = r'\b([ACDEFGHIKLMNPQRSTVWY]{6,})\b'
+        for match in re.finditer(aa_pattern, text.upper()):
+            spans.append((match.start(), match.end(), "amino_acid"))
+        return spans
+    def encode_molecule(
+        self,
+        formula: str,
+    ) -> torch.Tensor:
+        """
+        Encode a molecular formula into embedding.
+        Args:
+            formula: Chemical formula string (e.g., "C6H12O6")
+        Returns:
+            Molecule embedding (d_model,)
+        """
+        # Parse formula into elements and counts
+        # Simplified parser - real would handle nested parentheses
+        pattern = r'([A-Z][a-z]?)(\d*)'
+        matches = re.findall(pattern, formula)
+        device = self.element_embed.weight.device
+        embeddings = []
+        weights = []
+        for element, count_str in matches:
+            # Get element atomic number (simplified mapping)
+            element_map = {
+                'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8,
+                'F': 9, 'Ne': 10, 'Na': 11, 'Mg': 12, 'Al': 13, 'Si': 14, 'P': 15,
+                'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20,
+                # ... extend as needed
+            }
+            z = element_map.get(element, 0)  # 0 = unknown
+            count = int(count_str) if count_str else 1
+            # Get element embedding
+            elem_emb = self.element_embed(torch.tensor(z, device=device))
+            # Get properties and project
+            props = self.element_properties[z].unsqueeze(0)  # (1, 12)
+            props_emb = self.property_proj(props).squeeze(0)
+            # Combine
+            combined = elem_emb + props_emb
+            embeddings.append(combined)
+            weights.append(count)
+        if not embeddings:
+            # Return zero embedding
+            return torch.zeros(self.d_model, device=device)
+        # Weighted average
+        embeddings = torch.stack(embeddings)
+        weights = torch.tensor(weights, dtype=torch.float32, device=device)
+        weights = weights / weights.sum()
+        return (embeddings * weights.unsqueeze(-1)).sum(dim=0)
+    def forward(
+        self,
+        x: torch.Tensor,
+        text: Optional[List[str]] = None,
+        molecular_spans: Optional[List[List[Tuple[int, int, str]]]] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass through molecular module.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            text: Optional original text strings
+            molecular_spans: Optional pre-computed molecular spans per batch
+        Returns:
+            Molecular-enhanced representation (batch, seq_len, d_model)
+        """
+        batch, seq_len, d_model = x.shape
+        device = x.device
+        # Detect molecular spans
+        if molecular_spans is None and text is not None:
+            molecular_spans = []
+            for b in range(batch):
+                spans = self.detect_molecular_spans(text[b])
+                # Convert char spans to token spans
+                token_spans = []
+                for start_char, end_char, span_type in spans:
+                    start_tok = max(0, start_char // 4)
+                    end_tok = min(seq_len, end_char // 4 + 1)
+                    token_spans.append((start_tok, end_tok, span_type))
+                molecular_spans.append(token_spans)
+        # Enhance molecular spans
+        output = x.clone()
+        if molecular_spans:
+            for b in range(batch):
+                spans_b = molecular_spans[b] if b < len(molecular_spans) else []
+                for start_tok, end_tok, span_type in spans_b:
+                    if end_tok <= start_tok:
+                        continue
+                    span_slice = x[b, start_tok:end_tok, :]
+                    if span_type == "formula":
+                        # Extract formula from text if available
+                        if text:
+                            formula = text[b][start_tok*4:end_tok*4]  # rough extraction
+                            mol_emb = self.encode_molecule(formula)
+                        else:
+                            mol_emb = torch.randn(d_model, device=device)
+                        # Add molecular embedding to first token
+                        output[b, start_tok, :] += mol_emb
+                    elif span_type == "amino_acid":
+                        # Encode as amino acid sequence
+                        # Simplified: treat each letter as amino acid
+                        seq_len_span = end_tok - start_tok
+                        aa_ids = torch.randint(0, 20, (seq_len_span,), device=device)
+                        aa_emb = self.amino_embed(aa_ids)  # (seq_len_span, d_model)
+                        output[b, start_tok:end_tok, :] += aa_emb
+                    elif span_type == "smiles":
+                        # For SMILES, apply graph attention (simplified)
+                        # Treat each character as a node
+                        seq_len_span = end_tok - start_tok
+                        if seq_len_span > 1:
+                            # Self-attention over the span
+                            attn_out, _ = self.mol_attention(
+                                span_slice.unsqueeze(0),
+                                span_slice.unsqueeze(0),
+                                span_slice.unsqueeze(0),
+                            )
+                            output[b, start_tok:end_tok, :] += attn_out.squeeze(0)
+        return output
+    def compute_property_loss(
+        self,
+        x: torch.Tensor,
+        element_ids: torch.Tensor,
+        target_properties: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute auxiliary loss for property prediction.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            element_ids: Element IDs (batch, seq_len)
+            target_properties: Target property values (batch, seq_len, 12)
+        Returns:
+            MSE loss for property prediction
+        """
+        # Get element embeddings
+        elem_emb = self.element_embed(element_ids)
+        # Predict properties
+        pred_props = self.property_head(elem_emb)
+        # Compute loss
+        loss = F.mse_loss(pred_props, target_properties)
+        return loss
+def test_molecular_module():
+    """Test MolecularModule."""
+    d_model = 512
+    batch_size = 2
+    seq_len = 128
+    module = MolecularModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = [
+        "Water is H2O. The DNA sequence is ACGTACGTACGT.",
+        "Proteins are made of amino acids like ACDEFGH. Benzene is C6H6."
+    ]
+    output = module(x, text=text)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape
+    print("MolecularModule test passed!")
+if __name__ == "__main__":
+    test_molecular_module()

models/science_modules/numerical_module.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""
+NumericalReasoningModule: Handles scientific numerical reasoning.
+Digit-level number encoding, scientific notation, unit awareness.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import re
+from typing import Optional, Tuple, List
+class NumericalReasoningModule(nn.Module):
+    """
+    Handles scientific numerical reasoning.
+    - Digit-level number encoding (each digit gets position-aware embedding)
+    - Scientific notation understanding (6.02 × 10²³)
+    - Unit awareness (meters, joules, moles, kelvin)
+    - Order of magnitude reasoning
+    - Significant figures tracking
+    """
+    def __init__(
+        self,
+        d_model: int,
+        max_digits: int = 20,
+        num_units: int = 256,
+    ):
+        """
+        Initialize NumericalReasoningModule.
+        Args:
+            d_model: Model dimension
+            max_digits: Maximum number of digits to encode
+            num_units: Number of unit types to embed
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.max_digits = max_digits
+        # Digit embeddings (0-9)
+        self.digit_embed = nn.Embedding(10, 64)
+        # Position embeddings (ones, tens, hundreds...)
+        self.position_embed = nn.Embedding(max_digits, 64)
+        # Project digit+position to model dimension
+        self.number_proj = nn.Linear(128, d_model)
+        # Unit embedding (SI units + common scientific units)
+        self.unit_embed = nn.Embedding(num_units, d_model)
+        # Scientific notation handler
+        self.sci_notation = nn.Linear(d_model * 2, d_model)
+        # Magnitude embedding (powers of 10: -10 to +10)
+        self.magnitude_embed = nn.Embedding(21, d_model)  # -10 to +10
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.digit_embed, self.position_embed, self.number_proj,
+                       self.unit_embed, self.sci_notation, self.magnitude_embed]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def encode_number(
+        self,
+        number_str: str,
+        device: torch.device,
+    ) -> torch.Tensor:
+        """
+        Encode a number string using digit-level encoding.
+        Args:
+            number_str: String representation of number (e.g., "123.45e-6")
+            device: Torch device
+        Returns:
+            Number embedding (d_model,)
+        """
+        # Extract digits (ignore decimal point, sign, exponent)
+        digits = [int(d) for d in re.findall(r'\d', number_str)]
+        if not digits:
+            digits = [0]
+        # Pad/truncate to max_digits
+        if len(digits) > self.max_digits:
+            digits = digits[:self.max_digits]
+        else:
+            digits = digits + [0] * (self.max_digits - len(digits))
+        digits_tensor = torch.tensor(digits, device=device)  # (max_digits,)
+        positions = torch.arange(self.max_digits, device=device)  # (max_digits,)
+        # Embed digits and positions
+        digit_emb = self.digit_embed(digits_tensor)  # (max_digits, 64)
+        pos_emb = self.position_embed(positions)  # (max_digits, 64)
+        # Concatenate and project
+        combined = torch.cat([digit_emb, pos_emb], dim=-1)  # (max_digits, 128)
+        number_emb = self.number_proj(combined)  # (max_digits, d_model)
+        # Mean pool over positions
+        return number_emb.mean(dim=0)  # (d_model,)
+    def detect_numbers(
+        self,
+        text: str,
+    ) -> List[Tuple[str, int, int, Optional[str]]]:
+        """
+        Detect numbers in text with optional units and scientific notation.
+        Returns:
+            List of (number_str, start_char, end_char, unit_str)
+        """
+        # Pattern: number with optional decimal, exponent, and unit
+        # Matches: 123, 123.45, 1.23e-4, 6.02×10²³, 100 m, 5.0 J/mol
+        pattern = r'(\d+(?:\.\d+)?(?:[eE][+-]?\d+)?(?:×10\^?[+-]?\d+)?)(?:\s*([a-zA-Z°%]+))?'
+        matches = []
+        for match in re.finditer(pattern, text):
+            number_str = match.group(1)
+            unit_str = match.group(2) if match.group(2) else None
+            matches.append((number_str, match.start(), match.end(), unit_str))
+        return matches
+    def forward(
+        self,
+        x: torch.Tensor,
+        text: Optional[List[str]] = None,
+        number_positions: Optional[List[List[Tuple[int, int, str]]]] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass through numerical reasoning module.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            text: Optional original text strings
+            number_positions: Optional list of (start_token, end_token, number_str) per batch
+        Returns:
+            Numerical-enhanced representation (batch, seq_len, d_model)
+        """
+        batch, seq_len, d_model = x.shape
+        device = x.device
+        # Detect numbers if text provided
+        if number_positions is None and text is not None:
+            number_positions = []
+            for b in range(batch):
+                numbers = self.detect_numbers(text[b])
+                # Convert char positions to token positions (approximate)
+                token_nums = []
+                for num_str, start_char, end_char, unit_str in numbers:
+                    start_tok = max(0, start_char // 4)
+                    end_tok = min(seq_len, end_char // 4 + 1)
+                    token_nums.append((start_tok, end_tok, num_str, unit_str))
+                number_positions.append(token_nums)
+        # Enhance number spans
+        output = x.clone()
+        if number_positions:
+            for b in range(batch):
+                nums_b = number_positions[b] if b < len(number_positions) else []
+                for start_tok, end_tok, num_str, unit_str in nums_b:
+                    if end_tok <= start_tok or start_tok >= seq_len:
+                        continue
+                    # Clamp to sequence bounds
+                    start_tok = min(start_tok, seq_len - 1)
+                    end_tok = min(end_tok, seq_len)
+                    # Encode the number
+                    number_emb = self.encode_number(num_str, device)  # (d_model,)
+                    # Add unit embedding if present
+                    if unit_str:
+                        # Simple hash-based unit ID (in practice would have unit vocab)
+                        unit_id = hash(unit_str) % self.unit_embed.num_embeddings
+                        unit_emb = self.unit_embed(torch.tensor(unit_id, device=device))
+                        number_emb = number_emb + unit_emb
+                    # Add magnitude embedding for scientific notation
+                    if 'e' in num_str.lower() or '×10' in num_str:
+                        # Extract exponent
+                        exp_match = re.search(r'[eE]([+-]?\d+)|×10\^?([+-]?\d+)', num_str)
+                        if exp_match:
+                            exp = int(exp_match.group(1) or exp_match.group(2))
+                            exp = max(-10, min(10, exp))  # Clamp to embedding range
+                            magnitude_emb = self.magnitude_embed(torch.tensor(exp + 10, device=device))
+                            number_emb = number_emb + magnitude_emb
+                    # Add to the first token of the number span
+                    output[b, start_tok, :] += number_emb
+        return output
+    def compute_numerical_loss(
+        self,
+        x: torch.Tensor,
+        number_mask: torch.Tensor,
+        target_values: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Compute auxiliary loss for numerical reasoning.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            number_mask: Mask for number tokens (batch, seq_len)
+            target_values: Target numeric values (batch, seq_len) or None
+        Returns:
+            MSE loss for value prediction (simplified)
+        """
+        # This is a simplified loss - in practice would have a value prediction head
+        # For now, return a small regularization loss on number embeddings
+        return 0.0
+def test_numerical_module():
+    """Test NumericalReasoningModule."""
+    d_model = 512
+    batch_size = 2
+    seq_len = 128
+    module = NumericalReasoningModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = [
+        "The speed of light is 2.998×10^8 m/s and Planck's constant is 6.626×10^-34 J·s.",
+        "Calculate: 123.45 + 67.89 = ? The answer is 191.34."
+    ]
+    output = module(x, text=text)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape
+    print("NumericalReasoningModule test passed!")
+if __name__ == "__main__":
+    test_numerical_module()

models/scigate_ffn.py ADDED Viewed

	@@ -0,0 +1,203 @@

+"""
+SciGateFFN: Science-aware gated feed-forward network.
+Learns to activate different FFN pathways based on science domain.
+Uses hybrid routing: explicit domain tags preferred, fallback to learned classifier.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple
+class SciGateFFN(nn.Module):
+    """
+    Gated FFN with science domain routing.
+    Learns to activate different FFN pathways for different science domains.
+    Gate is conditioned on detected domain (math, chemistry, biology etc).
+    """
+    def __init__(
+        self,
+        d_model: int,
+        expansion: int = 4,
+        num_domains: int = 7,
+        use_domain_tags: bool = True,
+    ):
+        """
+        Initialize SciGateFFN.
+        Args:
+            d_model: Model dimension
+            expansion: FFN expansion factor (default 4)
+            num_domains: Number of science domains (7)
+            use_domain_tags: Whether to use explicit domain tags for routing
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.expansion = expansion
+        self.num_domains = num_domains
+        self.use_domain_tags = use_domain_tags
+        hidden_dim = d_model * expansion
+        # Standard SwiGLU architecture: up_proj splits into two paths
+        self.up_proj = nn.Linear(d_model, hidden_dim * 2, bias=False)
+        self.down_proj = nn.Linear(hidden_dim, d_model, bias=False)
+        # Domain-specific scaling factors (learnable)
+        # Shape: (num_domains, hidden_dim)
+        self.domain_gate = nn.Linear(num_domains, hidden_dim, bias=True)
+        # Fallback domain classifier (when tags not present)
+        # Simple linear classifier based on sequence representation
+        self.fallback_classifier = nn.Sequential(
+            nn.Linear(d_model, d_model // 2),
+            nn.SiLU(),
+            nn.Linear(d_model // 2, num_domains),
+        )
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights."""
+        for module in [self.up_proj, self.down_proj, self.domain_gate, self.fallback_classifier]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                nn.init.zeros_(module.bias)
+    def get_domain_one_hot(
+        self,
+        domain_ids: Optional[torch.Tensor] = None,
+        domain_tags: Optional[torch.Tensor] = None,
+        hidden_states: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Get domain one-hot vector for routing.
+        Hybrid strategy:
+        1. If domain_tags provided (explicit [MATH], [CHEM] etc), use those
+        2. If domain_ids provided (from data loader), use those
+        3. Fallback: classify from hidden_states
+        Args:
+            domain_ids: Tensor of domain IDs (batch, seq_len) or (batch,)
+            domain_tags: Boolean mask for domain tags (batch, seq_len, num_domains)
+            hidden_states: Hidden states for fallback classification (batch, seq_len, d_model)
+        Returns:
+            domain_one_hot: (batch, seq_len, num_domains)
+        """
+        batch, seq_len, _ = hidden_states.shape if hidden_states is not None else (0, 0, 0)
+        if domain_tags is not None and domain_tags.any():
+            # Use explicit domain tags (one-hot already)
+            return domain_tags.float()
+        elif domain_ids is not None:
+            # Convert domain IDs to one-hot
+            if domain_ids.dim() == 1:
+                # Same domain for entire sequence
+                domain_one_hot = F.one_hot(domain_ids, num_classes=self.num_domains)
+                # Expand to sequence length
+                domain_one_hot = domain_one_hot.unsqueeze(1).expand(-1, seq_len, -1)
+            else:
+                # Per-token domain IDs
+                domain_one_hot = F.one_hot(domain_ids, num_classes=self.num_domains)
+            return domain_one_hot.float()
+        elif hidden_states is not None:
+            # Fallback: classify domain from hidden states
+            # Use mean pooling over sequence
+            pooled = hidden_states.mean(dim=1)  # (batch, d_model)
+            domain_logits = self.fallback_classifier(pooled)  # (batch, num_domains)
+            domain_probs = F.softmax(domain_logits, dim=-1)
+            # Expand to sequence length
+            return domain_probs.unsqueeze(1).expand(-1, seq_len, -1)
+        else:
+            # Uniform distribution (no domain info)
+            uniform = torch.ones(batch, seq_len, self.num_domains, device=hidden_states.device if hidden_states is not None else 'cpu')
+            return uniform / self.num_domains
+    def forward(
+        self,
+        x: torch.Tensor,
+        domain_ids: Optional[torch.Tensor] = None,
+        domain_tags: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass with domain-aware gating.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            domain_ids: Optional domain IDs (batch,) or (batch, seq_len)
+            domain_tags: Optional domain tag mask (batch, seq_len, num_domains)
+        Returns:
+            Output tensor (batch, seq_len, d_model)
+        """
+        batch, seq_len, d_model = x.shape
+        # Get domain routing weights
+        domain_weights = self.get_domain_one_hot(domain_ids, domain_tags, x)
+        # Shape: (batch, seq_len, num_domains)
+        # Project to hidden dimension
+        up = self.up_proj(x)  # (batch, seq_len, hidden_dim * 2)
+        up1, up2 = up.chunk(2, dim=-1)  # Each: (batch, seq_len, hidden_dim)
+        # Apply SwiGLU activation
+        hidden = up1 * F.silu(up2)  # (batch, seq_len, hidden_dim)
+        # Apply domain-specific scaling
+        # domain_weights: (batch, seq_len, num_domains)
+        # self.domain_gate.weight: (hidden_dim, num_domains) - Linear weight shape
+        # einsum: (batch, seq_len, num_domains) * (hidden_dim, num_domains) -> (batch, seq_len, hidden_dim)
+        domain_scaling = torch.einsum(
+            "bsd,hd->bsh",
+            domain_weights,
+            self.domain_gate.weight  # (hidden_dim, num_domains)
+        )
+        # domain_scaling: (batch, seq_len, hidden_dim)
+        # Apply domain scaling (multiplicative gating)
+        hidden = hidden * domain_scaling
+        # Project back to model dimension
+        output = self.down_proj(hidden)
+        return output
+def test_scigate_ffn():
+    """Test SciGateFFN."""
+    batch_size = 2
+    seq_len = 128
+    d_model = 4096
+    num_domains = 7
+    ffn = SciGateFFN(d_model, expansion=4, num_domains=num_domains)
+    # Test with no domain info (fallback)
+    x = torch.randn(batch_size, seq_len, d_model)
+    output = ffn(x)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape
+    # Test with explicit domain IDs
+    domain_ids = torch.randint(0, num_domains, (batch_size,))
+    output2 = ffn(x, domain_ids=domain_ids)
+    assert output2.shape == x.shape
+    # Test with domain tags
+    domain_tags = torch.zeros(batch_size, seq_len, num_domains)
+    domain_tags[:, :, 0] = 1.0  # All math
+    output3 = ffn(x, domain_tags=domain_tags)
+    assert output3.shape == x.shape
+    print("SciGateFFN test passed!")
+if __name__ == "__main__":
+    test_scigate_ffn()

models/ssm_layer.py ADDED Viewed

	@@ -0,0 +1,252 @@

+"""
+VortexSSM: Selective State-Space Layer
+Simplified Mamba-style SSM with input-dependent selection.
+Provides O(n) complexity for long sequences, ideal for scientific documents.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple
+class VortexSSM(nn.Module):
+    """
+    Selective state-space layer. Linear complexity O(n) vs attention's O(n²).
+    Handles long scientific documents efficiently with input-dependent selection.
+    Architecture based on Mamba but simplified for scientific reasoning tasks.
+    """
+    def __init__(
+        self,
+        d_model: int,
+        d_state: int = 16,
+        d_conv: int = 4,
+        expand: int = 2,
+        dt_rank: Optional[int] = None,
+    ):
+        """
+        Initialize VortexSSM.
+        Args:
+            d_model: Model dimension
+            d_state: State dimension (default 16 for 7B, 32 for 13B)
+            d_conv: Convolution kernel size for local context
+            expand: Expansion factor for inner dimension
+            dt_rank: Rank for delta projection (if None, uses ceil(d_model/16))
+        """
+        super().__init__()
+        self.d_model = d_model
+        self.d_state = d_state
+        self.d_conv = d_conv
+        self.expand = expand
+        self.d_inner = d_model * expand
+        if dt_rank is None:
+            self.dt_rank = max(1, d_model // 16)
+        else:
+            self.dt_rank = dt_rank
+        # Input projection: splits into x and z pathways
+        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
+        # Convolution for local context before SSM
+        # Depthwise convolution for efficiency
+        self.conv1d = nn.Conv1d(
+            in_channels=self.d_inner,
+            out_channels=self.d_inner,
+            kernel_size=d_conv,
+            padding=d_conv - 1,
+            groups=self.d_inner,
+            bias=False,
+        )
+        # SSM parameter projections (input-dependent)
+        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + 2 * self.d_state, bias=False)
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)
+        # State matrices (A is log-scale for stability)
+        # A is (d_inner, d_state)
+        self.A_log = nn.Parameter(torch.randn(self.d_inner, self.d_state))
+        self.D = nn.Parameter(torch.randn(self.d_inner))
+        # Output projection
+        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
+        # Initialize weights
+        self._initialize_weights()
+    def _initialize_weights(self):
+        """Initialize weights properly."""
+        # Initialize A_log with negative values for stable discretization
+        nn.init.normal_(self.A_log, mean=-4.0, std=0.5)
+        nn.init.normal_(self.D, mean=0.0, std=0.1)
+        # Initialize projections with small values
+        for module in [self.in_proj, self.x_proj, self.dt_proj, self.conv1d, self.out_proj]:
+            if hasattr(module, 'weight'):
+                nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(
+        self,
+        x: torch.Tensor,
+        state: Optional[torch.Tensor] = None,
+        return_state: bool = False,
+    ) -> torch.Tensor:
+        """
+        Forward pass through the SSM.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            state: Previous hidden state (batch, d_inner, d_state)
+            return_state: If True, return (output, state)
+        Returns:
+            Output tensor (batch, seq_len, d_model) or tuple with state
+        """
+        batch, seq_len, _ = x.shape
+        device = x.device
+        dtype = x.dtype
+        # Double-check d_inner matches A_log shape
+        d_inner = self.d_inner
+        # Project input to inner dimension
+        xz = self.in_proj(x)  # (batch, seq_len, 2 * d_inner)
+        x, z = xz.chunk(2, dim=-1)
+        # Apply 1D convolution for local context
+        # Need to transpose for conv1d: (batch, d_inner, seq_len)
+        x_conv = x.transpose(1, 2)
+        x_conv = self.conv1d(x_conv)[..., :seq_len]  # Trim padding
+        x = x_conv.transpose(1, 2)
+        # Discretization: compute delta, A, B parameters
+        # x_proj produces: delta (dt_rank), B (d_state), C (d_state)
+        x_dbl = self.x_proj(x)  # (batch, seq_len, dt_rank + 2*d_state)
+        (delta, B, C) = torch.split(
+            x_dbl,
+            [self.dt_rank, self.d_state, self.d_state],
+            dim=-1,
+        )
+        # Project delta
+        delta = self.dt_proj(delta)  # (batch, seq_len, d_inner)
+        delta = F.softplus(delta)
+        # Compute discretized state recurrence
+        # Use scan operation for efficient sequential processing
+        if state is None:
+            state = torch.zeros(batch, d_inner, self.d_state, device=device, dtype=dtype)
+        # Sequential scan (can be optimized with CUDA kernel)
+        output = []
+        for t in range(seq_len):
+            x_t = x[:, t]  # (batch, d_inner)
+            delta_t = delta[:, t]  # (batch, d_inner)
+            B_t = B[:, t]  # (batch, d_state)
+            C_t = C[:, t]  # (batch, d_state)
+            # Discretize A
+            A_delta = torch.exp(self.A_log * delta_t.unsqueeze(-1))  # (batch, d_inner, d_state)
+            # State update: state = A_delta * state + B_t * x_t
+            # B_t needs to be (batch, d_state) -> (batch, d_inner, d_state) via broadcasting
+            state = A_delta * state + B_t.unsqueeze(1) * x_t.unsqueeze(-1)
+            # Output: y = C_t * state + D * x_t
+            y = (C_t.unsqueeze(1) * state).sum(dim=-1) + self.D * x_t
+            output.append(y)
+        output = torch.stack(output, dim=1)  # (batch, seq_len, d_inner)
+        # Apply gating with z
+        output = output * F.silu(z)
+        # Project back to model dimension
+        output = self.out_proj(output)
+        if return_state:
+            return output, state
+        return output
+    def step(
+        self,
+        x: torch.Tensor,
+        state: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Single-step inference for autoregressive decoding.
+        Args:
+            x: Input at current step (batch, d_model)
+            state: Previous state (batch, d_inner, d_state)
+        Returns:
+            output: (batch, d_model)
+            new_state: updated state
+        """
+        batch, _ = x.shape
+        # Project input
+        xz = self.in_proj(x.unsqueeze(1))  # Add seq dim
+        x, z = xz.chunk(2, dim=-1)
+        x = x.squeeze(1)
+        z = z.squeeze(1)
+        # No convolution for single step (would need cache)
+        # Compute parameters
+        x_dbl = self.x_proj(x.unsqueeze(1)).squeeze(1)
+        delta, B, C = torch.split(
+            x_dbl,
+            [self.dt_rank, self.d_state, self.d_state],
+            dim=-1,
+        )
+        delta = self.dt_proj(delta)
+        delta = F.softplus(delta)
+        # Single step discretization
+        A_delta = torch.exp(self.A_log * delta.unsqueeze(-1))
+        state = A_delta * state + B.unsqueeze(1) * x.unsqueeze(-1)
+        y = (C.unsqueeze(1) * state).sum(dim=-1) + self.D * x
+        y = y * F.silu(z)
+        output = self.out_proj(y)
+        return output, state
+def test_vortex_ssm():
+    """Test the VortexSSM layer."""
+    batch_size = 2
+    seq_len = 128
+    d_model = 4096
+    d_state = 16
+    ssm = VortexSSM(d_model, d_state=d_state)
+    x = torch.randn(batch_size, seq_len, d_model)
+    # Forward pass
+    output = ssm(x)
+    print(f"Input shape: {x.shape}")
+    print(f"Output shape: {output.shape}")
+    assert output.shape == x.shape, f"Expected {x.shape}, got {output.shape}"
+    # Stateful forward
+    state = torch.zeros(batch_size, ssm.d_inner, d_state)
+    output2, new_state = ssm(x, state=state, return_state=True)
+    print(f"Stateful output shape: {output2.shape}")
+    print(f"State shape: {new_state.shape}")
+    # Single step
+    x_step = torch.randn(batch_size, d_model)
+    output_step, state_step = ssm.step(x_step, state)
+    print(f"Step output shape: {output_step.shape}")
+    print(f"Step state shape: {state_step.shape}")
+    print("VortexSSM test passed!")
+if __name__ == "__main__":
+    test_vortex_ssm()

models/vortex_model.py ADDED Viewed

	@@ -0,0 +1,377 @@

+"""
+VortexModel: Main model class combining SSM, attention, science modules, and SciGate FFN.
+Implements two block types: SSM-only and attention+science+SciGate FFN.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple, List, Dict
+from .ssm_layer import VortexSSM
+from .attention_layer import VortexLocalAttention
+from .scigate_ffn import SciGateFFN
+from .science_modules import (
+    EquationModule,
+    NumericalReasoningModule,
+    CitationModule,
+    MolecularModule,
+)
+class VortexBlock(nn.Module):
+    """
+    Two types of blocks:
+    1. SSMBlock: only VortexSSM
+    2. AttentionBlock: VortexLocalAttention + ScienceModules + SciGateFFN
+    """
+    def __init__(
+        self,
+        config: Dict,
+        is_ssm_block: bool = True,
+    ):
+        """
+        Initialize a Vortex block.
+        Args:
+            config: Model configuration
+            is_ssm_block: If True, this is an SSM-only block; else attention+science+FFN
+        """
+        super().__init__()
+        self.config = config
+        self.is_ssm_block = is_ssm_block
+        self.d_model = config["d_model"]
+        if is_ssm_block:
+            # SSM-only block
+            self.ssm = VortexSSM(
+                d_model=config["d_model"],
+                d_state=config["d_state"],
+                d_conv=config["d_conv"],
+            )
+            self.norm = nn.LayerNorm(config["d_model"])
+        else:
+            # Attention + Science + FFN block
+            self.attn = VortexLocalAttention(
+                d_model=config["d_model"],
+                num_heads=config["num_heads"],
+                window_size=config["window_size"],
+                use_flash_attention=config.get("use_flash_attention", True),
+            )
+            self.attn_norm = nn.LayerNorm(config["d_model"])
+            # Science modules (enabled based on config flags)
+            self.equation_module = None
+            self.numerical_module = None
+            self.citation_module = None
+            self.molecular_module = None
+            if config.get("enable_equation_module", True):
+                self.equation_module = EquationModule(config["d_model"])
+            if config.get("enable_numerical_module", True):
+                self.numerical_module = NumericalReasoningModule(config["d_model"])
+            if config.get("enable_citation_module", True):
+                self.citation_module = CitationModule(config["d_model"])
+            if config.get("enable_molecular_module", True):
+                self.molecular_module = MolecularModule(config["d_model"])
+            # SciGate FFN
+            self.ffn = SciGateFFN(
+                d_model=config["d_model"],
+                expansion=config["ffn_expansion"],
+                num_domains=config["num_domains"],
+            )
+            self.ffn_norm = nn.LayerNorm(config["d_model"])
+        # Final layer norm for both block types
+        self.final_norm = nn.LayerNorm(config["d_model"])
+    def forward(
+        self,
+        x: torch.Tensor,
+        domain_ids: Optional[torch.Tensor] = None,
+        domain_tags: Optional[torch.Tensor] = None,
+        text: Optional[List[str]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Forward pass through the block.
+        Args:
+            x: Input tensor (batch, seq_len, d_model)
+            domain_ids: Optional domain IDs for SciGate FFN
+            domain_tags: Optional domain tag masks
+            text: Optional original text for science module span detection
+            attention_mask: Optional attention mask
+        Returns:
+            Output tensor (batch, seq_len, d_model)
+        """
+        residual = x
+        if self.is_ssm_block:
+            # SSM-only pathway
+            x = self.norm(x)
+            x = self.ssm(x)
+            x = residual + x
+            x = self.final_norm(x)
+        else:
+            # Attention + Science + FFN pathway
+            # Attention
+            residual_attn = x
+            x = self.attn_norm(x)
+            global_mask = self._detect_global_tokens(x) if hasattr(self, '_detect_global_tokens') else None
+            x = self.attn(x, global_mask=global_mask, attention_mask=attention_mask)
+            x = residual_attn + x
+            # Science modules (applied sequentially)
+            if self.equation_module is not None:
+                x = x + self.equation_module(x, text=text)
+            if self.numerical_module is not None:
+                x = x + self.numerical_module(x, text=text)
+            if self.citation_module is not None:
+                x_cited, _ = self.citation_module(x, text=text)
+                x = x + x_cited
+            if self.molecular_module is not None:
+                x = x + self.molecular_module(x, text=text)
+            # SciGate FFN
+            residual_ffn = x
+            x = self.ffn_norm(x)
+            x = self.ffn(x, domain_ids=domain_ids, domain_tags=domain_tags)
+            x = residual_ffn + x
+            x = self.final_norm(x)
+        return x
+    def _detect_global_tokens(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Detect global tokens that should attend across the entire sequence.
+        Global tokens are those with special domain tags or high norm.
+        """
+        # Simple heuristic: tokens with large L2 norm are likely special
+        norms = torch.norm(x, dim=-1)  # (batch, seq_len)
+        threshold = torch.quantile(norms, 0.95, dim=-1, keepdim=True)
+        global_mask = norms > threshold
+        return global_mask
+class VortexModel(nn.Module):
+    """
+    Main Vortex model combining SSM and attention blocks.
+    Supports both 7B and 13B configurations.
+    """
+    def __init__(
+        self,
+        config: Dict,
+    ):
+        """
+        Initialize VortexModel.
+        Args:
+            config: Model configuration (from vortex_7b_config.py or vortex_13b_config.py)
+        """
+        super().__init__()
+        self.config = config
+        # Token embedding
+        self.embed_tokens = nn.Embedding(config["vocab_size"], config["d_model"])
+        # Build blocks according to layer ratio
+        self.blocks = nn.ModuleList()
+        self._build_blocks()
+        # Final layer norm
+        self.ln_f = nn.LayerNorm(config["d_model"])
+        # Output projection (weights will be tied by HuggingFace if config.tie_word_embeddings=True)
+        self.lm_head = nn.Linear(config["d_model"], config["vocab_size"], bias=False)
+        # Initialize weights
+        self._initialize_weights()
+    def _build_blocks(self):
+        """Build the sequence of SSM and attention blocks."""
+        num_layers = self.config["num_layers"]
+        ssm_ratio = self.config["ssm_ratio"]
+        # Calculate number of each block type
+        num_ssm_blocks = int(num_layers * ssm_ratio)
+        num_attn_blocks = num_layers - num_ssm_blocks
+        # Determine block pattern
+        if ssm_ratio == 0.6:  # 7B pattern: SSM, SSM, Attn, SSM, SSM, Attn...
+            pattern = [0, 0, 1]  # 0=SSM, 1=Attn
+            # Repeat pattern and fill remaining
+            blocks = []
+            while len(blocks) < num_layers:
+                blocks.extend(pattern[:min(len(pattern), num_layers - len(blocks))])
+        else:  # 13B pattern: SSM, Attn, SSM, Attn...
+            pattern = [0, 1]
+            blocks = []
+            while len(blocks) < num_layers:
+                blocks.extend(pattern[:min(len(pattern), num_layers - len(blocks))])
+        # Ensure exact count
+        blocks = blocks[:num_layers]
+        assert len(blocks) == num_layers
+        # Create blocks
+        for is_attn in blocks:
+            block = VortexBlock(
+                config=self.config,
+                is_ssm_block=not is_attn,
+            )
+            self.blocks.append(block)
+        print(f"Built {num_layers} layers: {num_ssm_blocks} SSM, {num_attn_blocks} Attention")
+    def _initialize_weights(self):
+        """Initialize weights."""
+        nn.init.normal_(self.embed_tokens.weight, mean=0.0, std=0.02)
+        for block in self.blocks:
+            if hasattr(block, 'ssm'):
+                block.ssm._initialize_weights()
+            if hasattr(block, 'attn'):
+                block.attn._initialize_weights()
+            if hasattr(block, 'ffn'):
+                block.ffn._initialize_weights()
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        domain_ids: Optional[torch.Tensor] = None,
+        domain_tags: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        text: Optional[List[str]] = None,
+        return_dict: bool = True,
+    ) -> torch.Tensor:
+        """
+        Forward pass through the model.
+        Args:
+            input_ids: Token IDs (batch, seq_len)
+            domain_ids: Optional domain IDs
+            domain_tags: Optional domain tag masks
+            attention_mask: Optional attention mask (batch, seq_len)
+            text: Optional original text for science modules
+            return_dict: Whether to return dict (always returns tensor for now)
+        Returns:
+            Logits (batch, seq_len, vocab_size)
+        """
+        # Embed tokens
+        x = self.embed_tokens(input_ids)
+        # Pass through blocks
+        for block in self.blocks:
+            x = block(
+                x,
+                domain_ids=domain_ids,
+                domain_tags=domain_tags,
+                text=text,
+                attention_mask=attention_mask,
+            )
+        # Final norm
+        x = self.ln_f(x)
+        # Project to vocabulary
+        logits = self.lm_head(x)
+        if return_dict:
+            return {"logits": logits, "last_hidden_state": x}
+        return logits
+    def get_num_params(self) -> int:
+        """Get total number of parameters."""
+        return sum(p.numel() for p in self.parameters())
+    def get_trainable_params(self) -> int:
+        """Get number of trainable parameters."""
+        return sum(p.numel() for p in self.parameters() if p.requires_grad)
+    def estimate_memory_usage(
+        self,
+        batch_size: int,
+        seq_len: int,
+        use_gradient_checkpointing: bool = False,
+    ) -> Dict[str, float]:
+        """
+        Estimate memory usage for a given batch size and sequence length.
+        Returns:
+            Dictionary with memory estimates in GB
+        """
+        params = self.get_num_params()
+        param_bytes = params * 2  # Assuming bfloat16
+        # Activation memory (rough estimate)
+        # Each layer: activations ~ batch * seq_len * d_model * 2
+        activations_per_layer = batch_size * seq_len * self.config["d_model"] * 2
+        total_activations = activations_per_layer * self.config["num_layers"]
+        # Gradients (same size as parameters)
+        gradients = param_bytes
+        # Optimizer states (AdamW: 2x parameters)
+        optimizer_states = params * 2 * 2
+        total_memory = (param_bytes + total_activations + gradients + optimizer_states) / 1e9
+        return {
+            "parameters_gb": param_bytes / 1e9,
+            "activations_gb": total_activations / 1e9,
+            "gradients_gb": gradients / 1e9,
+            "optimizer_states_gb": optimizer_states / 1e9,
+            "total_gb": total_memory,
+        }
+def test_vortex_model():
+    """Test the VortexModel."""
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    config = VORTEX_7B_CONFIG.copy()
+    # Reduce size for testing
+    config["d_model"] = 512
+    config["num_layers"] = 4
+    config["num_heads"] = 8
+    config["vocab_size"] = 1000
+    model = VortexModel(config)
+    batch_size = 2
+    seq_len = 128
+    input_ids = torch.randint(0, config["vocab_size"], (batch_size, seq_len))
+    # Forward pass
+    output = model(input_ids)
+    logits = output["logits"]
+    print(f"Model parameters: {model.get_num_params():,}")
+    print(f"Input shape: {input_ids.shape}")
+    print(f"Logits shape: {logits.shape}")
+    assert logits.shape == (batch_size, seq_len, config["vocab_size"])
+    # Memory estimate
+    mem = model.estimate_memory_usage(batch_size, seq_len)
+    print(f"Memory estimate for batch={batch_size}, seq_len={seq_len}:")
+    for k, v in mem.items():
+        print(f"  {k}: {v:.2f} GB")
+    print("VortexModel test passed!")
+if __name__ == "__main__":
+    test_vortex_model()

mps_optimize.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""
+MPS optimizations for Vortex model on Apple Silicon.
+Uses PyTorch MPS backend with MPS-compatible ops only.
+"""
+import torch
+import torch.nn as nn
+from typing import Optional, Dict, Any
+def optimize_for_mps(
+    model: nn.Module,
+    config: Dict,
+    use_sdpa: bool = True,
+) -> nn.Module:
+    """
+    Apply MPS optimizations to model.
+    Args:
+        model: VortexModel
+        config: Model config
+        use_sdpa: Use PyTorch scaled dot product attention (MPS compatible)
+    Returns:
+        Optimized model
+    """
+    device = torch.device("mps")
+    # Move to MPS
+    model = model.to(device)
+    # Set dtype - MPS supports float32 and float16 (bfloat16 limited)
+    dtype_str = config.get("dtype", "bfloat16")
+    if dtype_str == "bfloat16":
+        # MPS has limited bfloat16 support, use float16
+        dtype = torch.float16
+    else:
+        dtype = torch.float32
+    model = model.to(dtype)
+    # Replace Flash Attention with standard SDPA
+    if use_sdpa:
+        model = _apply_sdpa(model)
+        print("Applied PyTorch SDPA for MPS")
+    return model
+def _apply_sdpa(model: nn.Module) -> nn.Module:
+    """
+    Replace custom attention with PyTorch SDPA.
+    SDPA is optimized for MPS backend.
+    """
+    for name, module in model.named_modules():
+        if hasattr(module, 'attn') and hasattr(module.attn, 'forward_optimized'):
+            # Use the SDPA path
+            original_forward = module.attn.forward
+            def sdpa_forward(self, x, *args, **kwargs):
+                return self._standard_attention(x, kwargs.get('attention_mask'))
+            module.attn.forward = sdpa_forward.__get__(module.attn, type(module.attn))
+    return model
+def get_mps_memory_usage() -> Dict[str, float]:
+    """Get current MPS memory usage in GB."""
+    if not torch.backends.mps.is_available():
+        return {"error": "MPS not available"}
+    # MPS doesn't have direct memory query, use unified memory
+    import psutil
+    process = psutil.Process()
+    memory_info = process.memory_info()
+    return {
+        "rss_gb": memory_info.rss / 1e9,  # Resident set size
+        "vms_gb": memory_info.vms / 1e9,  # Virtual memory size
+    }
+def profile_model_mps(
+    model: nn.Module,
+    input_ids: torch.Tensor,
+    num_warmup: int = 10,
+    num_runs: int = 50,
+) -> Dict[str, float]:
+    """
+    Profile model performance on MPS.
+    Args:
+        model: Model to profile
+        input_ids: Example input
+        num_warmup: Number of warmup runs
+        num_runs: Number of profiling runs
+    Returns:
+        Dictionary with timing statistics
+    """
+    model.eval()
+    device = next(model.parameters()).device
+    input_ids = input_ids.to(device)
+    # Warmup
+    with torch.no_grad():
+        for _ in range(num_warmup):
+            _ = model(input_ids)
+            # MPS is async, need to wait
+            if device.type == "mps":
+                torch.mps.synchronize()
+    # Profile
+    if device.type == "mps":
+        torch.mps.synchronize()
+    import time
+    start = time.time()
+    with torch.no_grad():
+        for _ in range(num_runs):
+            _ = model(input_ids)
+            if device.type == "mps":
+                torch.mps.synchronize()
+    elapsed = time.time() - start
+    avg_time = elapsed / num_runs
+    tokens_per_sec = input_ids.shape[1] / avg_time
+    return {
+        "avg_time_sec": avg_time,
+        "tokens_per_sec": tokens_per_sec,
+    }
+def test_mps_optimize():
+    """Test MPS optimizations."""
+    if not torch.backends.mps.is_available():
+        print("MPS not available, skipping test")
+        return
+    from models.vortex_model import VortexModel
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    config = VORTEX_7B_CONFIG.copy()
+    config["d_model"] = 512
+    config["num_layers"] = 2
+    config["num_heads"] = 8
+    config["vocab_size"] = 1000
+    model = VortexModel(config)
+    print(f"Model parameters: {model.get_num_params():,}")
+    # Optimize for MPS
+    model = optimize_for_mps(model, config, use_sdpa=True)
+    # Test forward
+    batch_size = 2
+    seq_len = 128
+    input_ids = torch.randint(0, config["vocab_size"], (batch_size, seq_len)).to("mps")
+    with torch.no_grad():
+        output = model(input_ids)
+        logits = output["logits"]
+    print(f"Output shape: {logits.shape}")
+    print("MPS optimize test passed!")
+if __name__ == "__main__":
+    test_mps_optimize()

push_to_hf.py ADDED Viewed

	@@ -0,0 +1,39 @@

+#!/usr/bin/env python3
+import argparse
+import os
+from huggingface_hub import HfApi, create_repo
+def push_to_hf(repo_id, token=None, private=False):
+    api = HfApi(token=token)
+    # Create repo if it doesn't exist
+    try:
+        create_repo(repo_id, repo_type="model", private=private, exist_ok=True)
+        print(f"Repository {repo_id} ready")
+    except Exception as e:
+        print(f"Repo creation note: {e}")
+    # Upload all files in current directory
+    cwd = os.getcwd()
+    print(f"Uploading from {cwd} to {repo_id}...")
+    try:
+        api.upload_folder(
+            folder_path=cwd,
+            repo_id=repo_id,
+            repo_type="model",
+            commit_message="Upload Vortex model"
+        )
+        print(f"Successfully uploaded to {repo_id}")
+    except Exception as e:
+        print(f"Upload failed: {e}")
+        raise
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--repo_id", type=str, required=True, help="HuggingFace repo ID")
+    parser.add_argument("--token", type=str, help="HuggingFace token (optional if logged in)")
+    parser.add_argument("--private", action="store_true", help="Make repository private")
+    args = parser.parse_args()
+    push_to_hf(args.repo_id, args.token, args.private)

requirements.txt ADDED Viewed

	@@ -0,0 +1,50 @@

+# Core dependencies
+torch>=2.2.0
+transformers>=4.40.0
+accelerate>=0.30.0
+datasets>=2.18.0
+tokenizers>=0.19.0
+# Quantization
+bitsandbytes>=0.43.0
+# Flash Attention (CUDA only)
+flash-attn>=2.5.0
+# Scientific computing
+numpy>=1.26.0
+scipy>=1.12.0
+scikit-learn>=1.4.0
+# Chemistry/Biology
+rdkit>=2023.9.0
+pubchempy>=1.0.4
+# Web scraping
+arxiv>=2.1.0
+beautifulsoup4>=4.12.0
+requests>=2.31.0
+# Data processing
+pandas>=2.0.0
+pyarrow>=14.0.0
+# LaTeX parsing
+pylatexenc>=2.10
+# Deduplication
+minhash>=0.1.0
+# Utilities
+tqdm>=4.65.0
+psutil>=5.9.0
+jsonlines>=3.1.0
+# Optional: wandb for logging
+# wandb>=0.16.0
+# Development/testing
+pytest>=7.0.0
+black>=23.0.0
+flake8>=6.0.0
+mypy>=1.0.0

science_bench.py ADDED Viewed

	@@ -0,0 +1,360 @@

+"""
+Science benchmarks for Vortex model.
+Evaluates performance across 7 science domains.
+"""
+import torch
+from typing import Dict, List, Tuple
+from dataclasses import dataclass
+@dataclass
+class BenchmarkResult:
+    """Results from a benchmark."""
+    domain: str
+    accuracy: float
+    total_questions: int
+    correct_answers: int
+    details: List[Dict]
+class ScienceBenchmark:
+    """
+    Base class for science benchmarks.
+    """
+    def __init__(self, name: str, domain: str):
+        self.name = name
+        self.domain = domain
+    def load_questions(self) -> List[Dict]:
+        """Load benchmark questions."""
+        raise NotImplementedError
+    def evaluate(
+        self,
+        model,
+        tokenizer,
+        device: torch.device,
+        max_samples: int = 100,
+    ) -> BenchmarkResult:
+        """
+        Evaluate model on benchmark.
+        Args:
+            model: Vortex model
+            tokenizer: Tokenizer
+            device: Torch device
+            max_samples: Maximum samples to evaluate
+        Returns:
+            BenchmarkResult
+        """
+        questions = self.load_questions()[:max_samples]
+        correct = 0
+        details = []
+        for q in questions:
+            # Format prompt
+            prompt = self.format_prompt(q)
+            # Tokenize
+            inputs = tokenizer(prompt, return_tensors="pt").to(device)
+            # Generate answer
+            with torch.no_grad():
+                outputs = model.generate(
+                    **inputs,
+                    max_new_tokens=50,
+                    temperature=0.0,  # Greedy
+                    do_sample=False,
+                )
+            # Decode
+            generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            answer = self.extract_answer(generated)
+            # Check correctness
+            is_correct = self.check_answer(answer, q["answer"])
+            if is_correct:
+                correct += 1
+            details.append({
+                "question": q["question"],
+                "expected": q["answer"],
+                "generated": answer,
+                "correct": is_correct,
+            })
+        accuracy = correct / len(questions) if questions else 0.0
+        return BenchmarkResult(
+            domain=self.domain,
+            accuracy=accuracy,
+            total_questions=len(questions),
+            correct_answers=correct,
+            details=details,
+        )
+    def format_prompt(self, question: Dict) -> str:
+        """Format question into prompt."""
+        raise NotImplementedError
+    def extract_answer(self, text: str) -> str:
+        """Extract answer from generated text."""
+        raise NotImplementedError
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        """Check if predicted answer matches expected."""
+        raise NotImplementedError
+class PhysicsBenchmark(ScienceBenchmark):
+    """Physics benchmark (Feynman Questions style)."""
+    def __init__(self):
+        super().__init__("physics_benchmark", "physics")
+    def load_questions(self) -> List[Dict]:
+        # Placeholder - would load from dataset
+        return [
+            {
+                "question": "What is the formula for kinetic energy?",
+                "answer": "KE = 1/2 mv^2",
+                "type": "formula",
+            },
+            {
+                "question": "Explain Newton's first law of motion.",
+                "answer": "An object at rest stays at rest unless acted upon by a force.",
+                "type": "conceptual",
+            },
+        ]
+    def format_prompt(self, question: Dict) -> str:
+        return f"Question: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        # Extract after "Answer:"
+        if "Answer:" in text:
+            return text.split("Answer:")[-1].strip()
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        # Simple string match (would use more sophisticated in practice)
+        pred_lower = predicted.lower()
+        exp_lower = expected.lower()
+        return exp_lower in pred_lower or pred_lower in exp_lower
+class MathBenchmark(ScienceBenchmark):
+    """Math benchmark (MATH dataset style)."""
+    def __init__(self):
+        super().__init__("math_benchmark", "math")
+    def load_questions(self) -> List[Dict]:
+        return [
+            {
+                "question": "Solve for x: 2x + 5 = 15",
+                "answer": "x = 5",
+                "type": "algebra",
+            },
+            {
+                "question": "What is the derivative of x^2?",
+                "answer": "2x",
+                "type": "calculus",
+            },
+        ]
+    def format_prompt(self, question: Dict) -> str:
+        return f"Problem: {question['question']}\nSolution:"
+    def extract_answer(self, text: str) -> str:
+        if "Solution:" in text:
+            return text.split("Solution:")[-1].strip()
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        # Normalize whitespace and case
+        pred = " ".join(predicted.lower().split())
+        exp = " ".join(expected.lower().split())
+        return pred == exp
+class ChemistryBenchmark(ScienceBenchmark):
+    """Chemistry benchmark."""
+    def __init__(self):
+        super().__init__("chemistry_benchmark", "chemistry")
+    def load_questions(self) -> List[Dict]:
+        return [
+            {
+                "question": "What is the chemical formula for water?",
+                "answer": "H2O",
+                "type": "factual",
+            },
+            {
+                "question": "How many protons does carbon have?",
+                "answer": "6",
+                "type": "factual",
+            },
+        ]
+    def format_prompt(self, question: Dict) -> str:
+        return f"Chemistry question: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        if "Answer:" in text:
+            return text.split("Answer:")[-1].strip()
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        pred = predicted.strip().lower()
+        exp = expected.strip().lower()
+        return exp in pred
+class BiologyBenchmark(ScienceBenchmark):
+    """Biology benchmark."""
+    def __init__(self):
+        super().__init__("biology_benchmark", "biology")
+    def load_questions(self) -> List[Dict]:
+        return [
+            {
+                "question": "What is the powerhouse of the cell?",
+                "answer": "mitochondria",
+                "type": "factual",
+            },
+            {
+                "question": "What molecule carries genetic information?",
+                "answer": "DNA",
+                "type": "factual",
+            },
+        ]
+    def format_prompt(self, question: Dict) -> str:
+        return f"Biology: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        if "Answer:" in text:
+            return text.split("Answer:")[-1].strip()
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        pred = predicted.strip().lower()
+        exp = expected.strip().lower()
+        return exp in pred
+# Placeholder for other domains
+class EarthScienceBenchmark(ScienceBenchmark):
+    def __init__(self):
+        super().__init__("earth_science_benchmark", "earth")
+    def load_questions(self) -> List[Dict]:
+        return []
+    def format_prompt(self, question: Dict) -> str:
+        return f"Earth Science: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        return predicted.strip().lower() == expected.strip().lower()
+class SpaceScienceBenchmark(ScienceBenchmark):
+    def __init__(self):
+        super().__init__("space_science_benchmark", "space")
+    def load_questions(self) -> List[Dict]:
+        return []
+    def format_prompt(self, question: Dict) -> str:
+        return f"Space Science: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        return predicted.strip().lower() == expected.strip().lower()
+class ZoologyBenchmark(ScienceBenchmark):
+    def __init__(self):
+        super().__init__("zoology_benchmark", "zoology")
+    def load_questions(self) -> List[Dict]:
+        return []
+    def format_prompt(self, question: Dict) -> str:
+        return f"Zoology: {question['question']}\nAnswer:"
+    def extract_answer(self, text: str) -> str:
+        return text.strip()
+    def check_answer(self, predicted: str, expected: str) -> bool:
+        return predicted.strip().lower() == expected.strip().lower()
+def run_all_benchmarks(
+    model,
+    tokenizer,
+    device: torch.device,
+    max_samples_per_domain: int = 50,
+) -> Dict[str, BenchmarkResult]:
+    """
+    Run all benchmarks and return results.
+    Args:
+        model: Vortex model
+        tokenizer: Tokenizer
+        device: Torch device
+        max_samples_per_domain: Max samples per domain
+    Returns:
+        Dictionary mapping domain to results
+    """
+    benchmarks = [
+        PhysicsBenchmark(),
+        MathBenchmark(),
+        ChemistryBenchmark(),
+        BiologyBenchmark(),
+        EarthScienceBenchmark(),
+        SpaceScienceBenchmark(),
+        ZoologyBenchmark(),
+    ]
+    results = {}
+    for bench in benchmarks:
+        print(f"Running {bench.name}...")
+        result = bench.evaluate(model, tokenizer, device, max_samples=max_samples_per_domain)
+        results[bench.domain] = result
+        print(f"  Accuracy: {result.accuracy:.2%} ({result.correct_answers}/{result.total_questions})")
+    return results
+def print_summary(results: Dict[str, BenchmarkResult]):
+    """Print summary of benchmark results."""
+    print("\n" + "="*60)
+    print("BENCHMARK RESULTS")
+    print("="*60)
+    for domain, result in results.items():
+        print(f"{domain:15} {result.accuracy:6.2%}  ({result.correct_answers}/{result.total_questions})")
+    # Overall average
+    all_accuracies = [r.accuracy for r in results.values() if r.total_questions > 0]
+    if all_accuracies:
+        avg = sum(all_accuracies) / len(all_accuracies)
+        print(f"{'OVERALL':15} {avg:6.2%}")
+    print("="*60)
+if __name__ == "__main__":
+    # Quick test
+    print("This script benchmarks the model across science domains.")
+    print("To run full benchmarks, integrate with a trained model.")

test_model.py ADDED Viewed

	@@ -0,0 +1,449 @@

+#!/usr/bin/env python3
+"""
+Comprehensive unit tests for Vortex model components.
+Run with: python -m pytest test_model.py -v
+"""
+import pytest
+import torch
+import sys
+from pathlib import Path
+# Add Vortex to path
+sys.path.insert(0, str(Path(__file__).parent))
+def test_tokenizer():
+    """Test VortexScienceTokenizer."""
+    from tokenizer.vortex_tokenizer import VortexScienceTokenizer
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    tokenizer = VortexScienceTokenizer(VORTEX_7B_CONFIG)
+    # Test encoding/decoding
+    text = "The equation is $E = mc^2$ and H2O is water."
+    encoded = tokenizer.encode(text, return_tensors="pt")
+    assert "input_ids" in encoded
+    assert encoded["input_ids"].shape[0] == 1  # batch dim
+    decoded = tokenizer.decode(encoded["input_ids"][0].tolist())
+    assert isinstance(decoded, str)
+    print("✓ Tokenizer test passed")
+def test_ssm_layer():
+    """Test VortexSSM."""
+    from models.ssm_layer import VortexSSM
+    batch_size = 2
+    seq_len = 64
+    d_model = 512
+    d_state = 16
+    ssm = VortexSSM(d_model, d_state=d_state)
+    x = torch.randn(batch_size, seq_len, d_model)
+    # Forward pass
+    output = ssm(x)
+    assert output.shape == x.shape
+    # Stateful forward
+    state = torch.zeros(batch_size, ssm.d_inner, d_state)
+    output2, new_state = ssm(x, state=state, return_state=True)
+    assert output2.shape == x.shape
+    assert new_state.shape == (batch_size, ssm.d_inner, d_state)
+    # Single step
+    x_step = torch.randn(batch_size, d_model)
+    output_step, state_step = ssm.step(x_step, state)
+    assert output_step.shape == (batch_size, d_model)
+    assert state_step.shape == (batch_size, ssm.d_inner, d_state)
+    print("✓ SSM layer test passed")
+def test_attention_layer():
+    """Test VortexLocalAttention."""
+    from models.attention_layer import VortexLocalAttention
+    batch_size = 2
+    seq_len = 128
+    d_model = 512
+    num_heads = 8
+    attn = VortexLocalAttention(d_model, num_heads, window_size=64, use_flash_attention=False)
+    x = torch.randn(batch_size, seq_len, d_model)
+    # Forward pass
+    output = attn(x)
+    assert output.shape == x.shape
+    # With global mask
+    global_mask = torch.zeros(batch_size, seq_len, dtype=torch.bool)
+    global_mask[0, 0] = True
+    output2 = attn(x, global_mask=global_mask)
+    assert output2.shape == x.shape
+    print("✓ Local attention test passed")
+def test_scigate_ffn():
+    """Test SciGateFFN."""
+    from models.scigate_ffn import SciGateFFN
+    batch_size = 2
+    seq_len = 64
+    d_model = 512
+    num_domains = 7
+    ffn = SciGateFFN(d_model, expansion=4, num_domains=num_domains)
+    x = torch.randn(batch_size, seq_len, d_model)
+    # Without domain info
+    output = ffn(x)
+    assert output.shape == x.shape
+    # With domain IDs
+    domain_ids = torch.randint(0, num_domains, (batch_size,))
+    output2 = ffn(x, domain_ids=domain_ids)
+    assert output2.shape == x.shape
+    # With domain tags
+    domain_tags = torch.zeros(batch_size, seq_len, num_domains)
+    domain_tags[:, :, 0] = 1.0
+    output3 = ffn(x, domain_tags=domain_tags)
+    assert output3.shape == x.shape
+    print("✓ SciGate FFN test passed")
+def test_equation_module():
+    """Test EquationModule."""
+    from models.science_modules.equation_module import EquationModule
+    d_model = 512
+    batch_size = 2
+    seq_len = 64
+    module = EquationModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = ["E = mc^2 is famous.", "The integral $\\int x dx = x^2/2$."]
+    output = module(x, text=text)
+    assert output.shape == x.shape
+    # Test equation loss
+    equation_mask = torch.zeros(batch_size, seq_len)
+    equation_mask[0, 5:10] = 1.0
+    loss = module.compute_equation_loss(x, equation_mask)
+    assert loss.item() >= 0
+    print("✓ Equation module test passed")
+def test_numerical_module():
+    """Test NumericalReasoningModule."""
+    from models.science_modules.numerical_module import NumericalReasoningModule
+    d_model = 512
+    batch_size = 2
+    seq_len = 64
+    module = NumericalReasoningModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = ["Speed of light: 2.998e8 m/s", "6.022e23 is Avogadro's number."]
+    output = module(x, text=text)
+    assert output.shape == x.shape
+    print("✓ Numerical reasoning module test passed")
+def test_citation_module():
+    """Test CitationModule."""
+    from models.science_modules.citation_module import CitationModule
+    d_model = 512
+    batch_size = 2
+    seq_len = 64
+    module = CitationModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = ["(Einstein, 1905) changed physics.", "See also [1, 2] for details."]
+    output, confidence = module(x, text=text)
+    assert output.shape == x.shape
+    assert confidence.shape == (batch_size, seq_len, 1)
+    # Test loss
+    citation_mask = torch.zeros(batch_size, seq_len)
+    citation_mask[0, 0:5] = 1.0
+    loss = module.compute_citation_loss(x, citation_mask, confidence)
+    assert loss.item() >= 0
+    print("✓ Citation module test passed")
+def test_molecular_module():
+    """Test MolecularModule."""
+    from models.science_modules.molecular_module import MolecularModule
+    d_model = 512
+    batch_size = 2
+    seq_len = 64
+    module = MolecularModule(d_model)
+    x = torch.randn(batch_size, seq_len, d_model)
+    text = ["H2O is water.", "DNA sequence: ACGTACGT"]
+    output = module(x, text=text)
+    assert output.shape == x.shape
+    print("✓ Molecular module test passed")
+def test_vortex_model():
+    """Test full VortexModel."""
+    from models.vortex_model import VortexModel
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    # Small config for testing
+    config = VORTEX_7B_CONFIG.copy()
+    config["d_model"] = 256
+    config["num_layers"] = 4
+    config["num_heads"] = 4
+    config["vocab_size"] = 1000
+    model = VortexModel(config)
+    batch_size = 2
+    seq_len = 32
+    input_ids = torch.randint(0, config["vocab_size"], (batch_size, seq_len))
+    # Forward pass
+    output = model(input_ids)
+    logits = output["logits"]
+    assert logits.shape == (batch_size, seq_len, config["vocab_size"])
+    # Count parameters
+    num_params = model.get_num_params()
+    assert num_params > 0
+    print(f"✓ VortexModel test passed (params: {num_params:,})")
+def test_quality_filter():
+    """Test ScienceQualityFilter."""
+    from data.quality_filter import ScienceQualityFilter
+    filter = ScienceQualityFilter()
+    # Good text
+    good_text = """
+    The experiment collected data from 100 participants. Results show a
+    significant effect (p < 0.05). The equation E = mc^2 is fundamental.
+    According to Smith et al., this confirms the hypothesis.
+    """
+    assert filter.filter(good_text)
+    # Bad: too short
+    assert not filter.filter("Too short.")
+    # Bad: unmatched equations
+    bad_eq = "Equation $E = mc^2 and another $F = ma."
+    assert not filter.filter(bad_eq)
+    print("✓ Quality filter test passed")
+def test_domain_classifier():
+    """Test DomainClassifier."""
+    from data.domain_classifier import DomainClassifier
+    d_model = 256
+    classifier = DomainClassifier(d_model)
+    # Test with random hidden states
+    batch_size = 4
+    seq_len = 32
+    hidden = torch.randn(batch_size, seq_len, d_model)
+    logits = classifier(hidden)
+    assert logits.shape == (batch_size, 7)
+    # Test text classification
+    text = "Quantum mechanics describes particle behavior."
+    domain, conf = classifier.classify_text(text)
+    assert domain in range(7)
+    assert 0 <= conf <= 1
+    print("✓ Domain classifier test passed")
+def test_deduplication():
+    """Test MinHashLSH."""
+    from data.deduplication import MinHashLSH
+    lsh = MinHashLSH(num_permutations=32, threshold=0.7, bands=4, rows_per_band=8)
+    docs = [
+        ("doc1", "The quick brown fox jumps over the lazy dog."),
+        ("doc2", "The quick brown fox jumps over the lazy dog!!!"),
+        ("doc3", "Completely different text about science."),
+    ]
+    for doc_id, text in docs:
+        lsh.add_document(doc_id, text)
+    # Query similar
+    results = lsh.query(docs[0][1])
+    # Should find doc2 as similar
+    assert len(results) >= 1
+    assert any(r[0] == "doc2" for r in results)
+    print("✓ Deduplication test passed")
+def test_losses():
+    """Test VortexLoss."""
+    from training.losses import VortexLoss
+    config = {"loss_weights": {
+        "lm_loss": 1.0,
+        "equation_loss": 0.3,
+        "domain_loss": 0.1,
+        "citation_loss": 0.1,
+        "numerical_loss": 0.2,
+    }}
+    loss_fn = VortexLoss(config)
+    batch_size = 2
+    seq_len = 32
+    vocab_size = 1000
+    logits = torch.randn(batch_size, seq_len, vocab_size)
+    labels = torch.randint(0, vocab_size, (batch_size, seq_len))
+    losses = loss_fn(logits, labels)
+    assert "total_loss" in losses
+    assert "lm_loss" in losses
+    assert losses["total_loss"].item() > 0
+    print("✓ Losses test passed")
+def test_curriculum():
+    """Test CurriculumScheduler."""
+    from training.curriculum import CurriculumScheduler
+    config = {
+        "curriculum_stages": [
+            {"name": "foundation", "start": 0.0, "end": 0.2},
+            {"name": "domain", "start": 0.2, "end": 0.5},
+            {"name": "reasoning", "start": 0.5, "end": 0.8},
+            {"name": "integration", "start": 0.8, "end": 1.0},
+        ]
+    }
+    total_steps = 1000
+    scheduler = CurriculumScheduler(config, total_steps)
+    # Test stage at different steps
+    assert scheduler.get_stage_name(0) == "foundation"
+    assert scheduler.get_stage_name(250) == "domain"
+    assert scheduler.get_stage_name(500) == "reasoning"
+    assert scheduler.get_stage_name(800) == "integration"
+    # Test sampler
+    weights = scheduler.get_dataset_sampler(100)
+    assert isinstance(weights, dict)
+    assert sum(weights.values()) == 1.0
+    print("✓ Curriculum test passed")
+def test_hf_integration():
+    """Test HuggingFace integration."""
+    from configuration_vortex import VortexConfig
+    from modeling_vortex import VortexForCausalLM
+    from tokenization_vortex import VortexTokenizer
+    # Config
+    config = VortexConfig(
+        d_model=128,
+        num_layers=2,
+        num_heads=4,
+        vocab_size=100,
+    )
+    # Model
+    model = VortexForCausalLM(config)
+    batch_size = 2
+    seq_len = 16
+    input_ids = torch.randint(0, 100, (batch_size, seq_len))
+    outputs = model(input_ids)
+    assert outputs.logits.shape == (batch_size, seq_len, 100)
+    # Save and load
+    model.save_pretrained("./test_hf_model")
+    config.save_pretrained("./test_hf_model")
+    from transformers import AutoConfig, AutoModelForCausalLM
+    loaded_config = AutoConfig.from_pretrained("./test_hf_model")
+    loaded_model = AutoModelForCausalLM.from_pretrained("./test_hf_model")
+    assert loaded_config.model_type == "vortex"
+    assert isinstance(loaded_model, VortexForCausalLM)
+    # Cleanup
+    import shutil
+    shutil.rmtree("./test_hf_model")
+    print("✓ HuggingFace integration test passed")
+def run_all_tests():
+    """Run all tests."""
+    tests = [
+        test_tokenizer,
+        test_ssm_layer,
+        test_attention_layer,
+        test_scigate_ffn,
+        test_equation_module,
+        test_numerical_module,
+        test_citation_module,
+        test_molecular_module,
+        test_vortex_model,
+        test_quality_filter,
+        test_domain_classifier,
+        test_deduplication,
+        test_losses,
+        test_curriculum,
+        test_hf_integration,
+    ]
+    print("Running Vortex unit tests...\n")
+    passed = 0
+    failed = 0
+    for test in tests:
+        try:
+            test()
+            passed += 1
+        except Exception as e:
+            print(f"✗ {test.__name__} failed: {e}")
+            failed += 1
+            import traceback
+            traceback.print_exc()
+    print(f"\n{'='*50}")
+    print(f"Tests: {passed + failed} total, {passed} passed, {failed} failed")
+    print(f"{'='*50}")
+    return failed == 0
+if __name__ == "__main__":
+    success = run_all_tests()
+    sys.exit(0 if success else 1)

tokenization_vortex.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""
+Vortex tokenizer for HuggingFace.
+Wraps VortexScienceTokenizer for HF compatibility.
+"""
+from typing import List, Optional, Dict, Any
+import json
+import os
+class VortexTokenizer:
+    """
+    HuggingFace-compatible tokenizer for Vortex.
+    Wraps VortexScienceTokenizer.
+    """
+    def __init__(
+        self,
+        tokenizer_file: Optional[str] = None,
+        config: Optional[Dict] = None,
+        **kwargs,
+    ):
+        """
+        Initialize tokenizer.
+        Args:
+            tokenizer_file: Path to tokenizer JSON
+            config: Tokenizer configuration
+        """
+        from .tokenizer.vortex_tokenizer import VortexScienceTokenizer
+        self.config = config or {}
+        self.special_tokens = self.config.get("special_tokens", {})
+        if tokenizer_file and os.path.exists(tokenizer_file):
+            self.tokenizer = VortexScienceTokenizer(
+                self.config,
+                tokenizer_path=tokenizer_file,
+            )
+        else:
+            # Initialize empty - needs training
+            self.tokenizer = VortexScienceTokenizer(self.config)
+        # HF compatibility attributes
+        self.pad_token = "[PAD]"
+        self.unk_token = "[UNK]"
+        self.bos_token = "[BOS]"
+        self.eos_token = "[EOS]"
+        self.pad_token_id = self.special_tokens.get("[PAD]", 0)
+        self.unk_token_id = self.special_tokens.get("[UNK]", 1)
+        self.bos_token_id = self.special_tokens.get("[BOS]", 2)
+        self.eos_token_id = self.special_tokens.get("[EOS]", 3)
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: str,
+        **kwargs,
+    ):
+        """Load tokenizer from pretrained model."""
+        tokenizer_path = os.path.join(pretrained_model_name_or_path, "vortex_tokenizer.json")
+        config_path = os.path.join(pretrained_model_name_or_path, "tokenizer_config.json")
+        config = {}
+        if os.path.exists(config_path):
+            with open(config_path, "r") as f:
+                config = json.load(f)
+        return cls(tokenizer_file=tokenizer_path, config=config, **kwargs)
+    def __call__(
+        self,
+        text: str | List[str],
+        padding: bool = False,
+        truncation: bool = False,
+        max_length: Optional[int] = None,
+        return_tensors: str = "pt",
+        **kwargs,
+    ) -> Dict[str, Any]:
+        """
+        Tokenize text.
+        Args:
+            text: Input text or list of texts
+            padding: Pad to same length
+            truncation: Truncate to max_length
+            max_length: Maximum length
+            return_tensors: "pt" for PyTorch, "np" for numpy, None for list
+        Returns:
+            Dictionary with input_ids, attention_mask
+        """
+        if isinstance(text, str):
+            text = [text]
+        if max_length is None:
+            max_length = self.config.get("max_seq_len", 16384)
+        # Use batch_encode
+        result = self.tokenizer.batch_encode(
+            text,
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            return_tensors=return_tensors,
+        )
+        return result
+    def encode(
+        self,
+        text: str,
+        add_special_tokens: bool = True,
+        **kwargs,
+    ) -> List[int]:
+        """Encode text to token IDs."""
+        result = self.tokenizer.encode(
+            text,
+            add_special_tokens=add_special_tokens,
+            return_tensors=None,
+        )
+        return result["input_ids"]
+    def decode(
+        self,
+        token_ids: List[int],
+        skip_special_tokens: bool = True,
+        **kwargs,
+    ) -> str:
+        """Decode token IDs to text."""
+        return self.tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
+    def save_pretrained(self, save_directory: str):
+        """Save tokenizer to directory."""
+        os.makedirs(save_directory, exist_ok=True)
+        tokenizer_path = os.path.join(save_directory, "vortex_tokenizer.json")
+        self.tokenizer.save(tokenizer_path)
+        # Save tokenizer config
+        config_path = os.path.join(save_directory, "tokenizer_config.json")
+        with open(config_path, "w") as f:
+            json.dump({
+                "model_type": "vortex",
+                "special_tokens": self.special_tokens,
+            }, f, indent=2)
+    @property
+    def vocab_size(self) -> int:
+        """Get vocabulary size."""
+        return self.tokenizer.vocab_size
+    def get_vocab(self) -> Dict[str, int]:
+        """Get vocabulary dictionary."""
+        return self.tokenizer.get_vocab()
+def test_vortex_tokenizer():
+    """Test VortexTokenizer."""
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    tokenizer = VortexTokenizer(config=VORTEX_7B_CONFIG)
+    text = "The equation is $E = mc^2$ and the reaction is H2O."
+    encoded = tokenizer(text, padding=False, truncation=True, max_length=128)
+    print(f"Encoded: {encoded['input_ids'][0][:10]}...")
+    decoded = tokenizer.decode(encoded["input_ids"][0])
+    print(f"Decoded: {decoded[:50]}...")
+    print("VortexTokenizer test passed!")
+if __name__ == "__main__":
+    test_vortex_tokenizer()

tokenizer/__pycache__/vortex_tokenizer.cpython-313.pyc ADDED Viewed

Binary file (17.4 kB). View file

tokenizer/vortex_tokenizer.py ADDED Viewed

	@@ -0,0 +1,442 @@

+"""
+VortexScienceTokenizer: A custom BPE tokenizer optimized for scientific text.
+Trains on science corpus and extends vocabulary with domain-specific tokens.
+"""
+import os
+import json
+import re
+from pathlib import Path
+from typing import List, Dict, Optional, Tuple, Union
+import torch
+try:
+    from tokenizers import Tokenizer, models, pre_tokenizers, processors, trainers
+    from tokenizers.normalizers import Lowercase, NFD, StripAccents
+except ImportError:
+    print("Please install tokenizers: pip install tokenizers")
+    raise
+class VortexScienceTokenizer:
+    """
+    Science-optimized BPE tokenizer with domain extensions.
+    Features:
+    - Base BPE vocabulary (40,000 tokens) trained on scientific corpus
+    - Extended science vocabulary (10,000 tokens) for LaTeX, chemistry, units, etc.
+    - Special tokens for equation/citation/molecule spans
+    - Domain tags for science areas
+    - Digit-level number handling (optional, can be toggled)
+    """
+    def __init__(
+        self,
+        config: Dict,
+        tokenizer_path: Optional[str] = None,
+        vocab_size: int = 50000,
+        base_vocab_size: int = 40000,
+        extension_vocab_size: int = 10000,
+    ):
+        """
+        Initialize the tokenizer.
+        Args:
+            config: Model configuration with special tokens
+            tokenizer_path: Path to pre-trained tokenizer (if loading)
+            vocab_size: Total vocabulary size
+            base_vocab_size: Size of base BPE vocabulary
+            extension_vocab_size: Size of science extension vocabulary
+        """
+        self.config = config
+        self.base_vocab_size = base_vocab_size
+        self.extension_vocab_size = extension_vocab_size
+        self._vocab_size = vocab_size
+        self.special_tokens = config.get("special_tokens", {})
+        self.domain_tags = config.get("domain_tags", [])
+        if tokenizer_path and os.path.exists(tokenizer_path):
+            self.tokenizer = Tokenizer.from_file(tokenizer_path)
+            print(f"Loaded tokenizer from {tokenizer_path}")
+        else:
+            # Initialize empty BPE tokenizer
+            self.tokenizer = Tokenizer(models.BPE())
+            self._setup_pre_tokenizer()
+            print("Initialized empty BPE tokenizer")
+    def _setup_pre_tokenizer(self):
+        """Configure pre-tokenization rules."""
+        # Use byte-level pre-tokenization for robustness
+        self.tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
+        self.tokenizer.normalizer = None  # Keep original casing for science terms
+    def train(
+        self,
+        file_paths: List[str],
+        min_frequency: int = 2,
+        special_tokens: Optional[List[str]] = None,
+    ):
+        """
+        Train the BPE tokenizer on scientific text files.
+        Args:
+            file_paths: List of text file paths for training
+            min_frequency: Minimum token frequency to keep
+            special_tokens: Additional special tokens to add
+        """
+        if special_tokens is None:
+            special_tokens = list(self.special_tokens.keys()) + self.domain_tags
+        print(f"Training tokenizer on {len(file_paths)} files...")
+        print(f"Base vocab size: {self.base_vocab_size}")
+        print(f"Special tokens: {special_tokens}")
+        trainer = trainers.BpeTrainer(
+            vocab_size=self.base_vocab_size,
+            min_frequency=min_frequency,
+            special_tokens=special_tokens,
+            show_progress=True,
+        )
+        self.tokenizer.train(file_paths, trainer=trainer)
+        print(f"Training complete. Vocabulary size: {self.tokenizer.get_vocab_size()}")
+        # Extend with science-specific tokens
+        self._extend_science_vocabulary()
+    def _extend_science_vocabulary(self):
+        """Add science-specific tokens to the vocabulary."""
+        current_vocab = self.tokenizer.get_vocab()
+        new_tokens = []
+        # LaTeX math symbols (common ones)
+        latex_symbols = [
+            "\\alpha", "\\beta", "\\gamma", "\\delta", "\\epsilon", "\\zeta",
+            "\\eta", "\\theta", "\\iota", "\\kappa", "\\lambda", "\\mu",
+            "\\nu", "\\xi", "\\pi", "\\rho", "\\sigma", "\\tau",
+            "\\upsilon", "\\phi", "\\chi", "\\psi", "\\omega",
+            "\\Gamma", "\\Delta", "\\Theta", "\\Lambda", "\\Xi", "\\Pi",
+            "\\Sigma", "\\Phi", "\\Psi", "\\Omega",
+            "\\sum", "\\prod", "\\int", "\\partial", "\\nabla", "\\infty",
+            "\\leq", "\\geq", "\\neq", "\\approx", "\\equiv", "\\sim",
+            "\\in", "\\notin", "\\subset", "\\supset", "\\cup", "\\cap",
+            "\\forall", "\\exists", "\\neg", "\\land", "\\lor", "\\rightarrow",
+            "\\leftarrow", "\\Rightarrow", "\\Leftarrow", "\\leftrightarrow",
+            "\\frac", "\\sqrt", "\\binom", "\\begin", "\\end", "\\mathbf",
+            "\\mathcal", "\\mathrm", "\\mathbb", "\\mathfrak",
+        ]
+        new_tokens.extend(latex_symbols)
+        # Greek letters (Unicode)
+        greek_letters = [
+            "α", "β", "γ", "δ", "ε", "ζ", "η", "θ", "ι", "κ", "λ", "μ",
+            "ν", "ξ", "ο", "π", "ρ", "σ", "τ", "υ", "φ", "χ", "ψ", "ω",
+            "Γ", "Δ", "Θ", "Λ", "Ξ", "Π", "Σ", "Φ", "Ψ", "Ω",
+        ]
+        new_tokens.extend(greek_letters)
+        # SI units and derived units
+        si_units = [
+            "m", "kg", "s", "mol", "K", "A", "cd", "mol",
+            "Hz", "N", "Pa", "J", "W", "C", "V", "F", "Ω", "S",
+            "Wb", "T", "H", "lm", "lx", "Bq", "Gy", "Sv", "kat",
+            "eV", "u", "Da", "Å", "°C", "%", "‰",
+            "M", "mM", "μM", "nM", "pM",
+            "g", "mg", "μg", "ng", "pg",
+            "km", "m", "cm", "mm", "μm", "nm", "pm",
+            "L", "mL", "μL", "nL",
+            "h", "min", "s", "ms", "μs", "ns",
+        ]
+        new_tokens.extend(si_units)
+        # Common scientific abbreviations
+        sci_abbrevs = [
+            "DNA", "RNA", "mRNA", "tRNA", "rRNA", "cDNA", "gDNA",
+            "ATP", "ADP", "AMP", "NAD", "NADP", "FAD", "CoA",
+            "pH", "pKa", "pKb", "pI",
+            "PCR", "RT", "qPCR", "NGS", "WGS",
+            "IC50", "EC50", "KD", "Ki",
+            "XRD", "NMR", "IR", "UV", "VIS", "MS", "GC", "HPLC",
+            "SEM", "TEM", "AFM", "STM",
+            "S/N", "SNR", "RMS", "Std", "Var", "Cov",
+            "et al.", "vs.", "cf.", "viz.",
+            "Fig", "Eq", "Ref", "Tab", "Suppl",
+        ]
+        new_tokens.extend(sci_abbrevs)
+        # Chemical element symbols
+        elements = [
+            "H", "He", "Li", "Be", "B", "C", "N", "O", "F", "Ne",
+            "Na", "Mg", "Al", "Si", "P", "S", "Cl", "Ar",
+            "K", "Ca", "Sc", "Ti", "V", "Cr", "Mn", "Fe", "Co", "Ni", "Cu", "Zn",
+            "Ga", "Ge", "As", "Se", "Br", "Kr",
+            "Rb", "Sr", "Y", "Zr", "Nb", "Mo", "Tc", "Ru", "Rh", "Pd", "Ag", "Cd",
+            "In", "Sn", "Sb", "Te", "I", "Xe",
+            "Cs", "Ba", "La", "Ce", "Pr", "Nd", "Pm", "Sm", "Eu", "Gd", "Tb",
+            "Dy", "Ho", "Er", "Tm", "Yb", "Lu",
+            "Hf", "Ta", "W", "Re", "Os", "Ir", "Pt", "Au", "Hg", "Tl", "Pb",
+            "Bi", "Po", "At", "Rn",
+            "Fr", "Ra", "Ac", "Th", "Pa", "U", "Np", "Pu", "Am", "Cm", "Bk",
+            "Cf", "Es", "Fm", "Md", "No", "Lr",
+            "Rf", "Db", "Sg", "Bh", "Hs", "Mt", "Ds", "Rg", "Cn", "Nh",
+            "Fl", "Mc", "Lv", "Ts", "Og",
+        ]
+        new_tokens.extend(elements)
+        # Amino acid single-letter codes
+        amino_acids = ["A", "R", "N", "D", "C", "Q", "E", "G", "H", "I",
+                       "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"]
+        new_tokens.extend(amino_acids)
+        # Mathematical operators (Unicode)
+        math_ops = [
+            "±", "∓", "×", "÷", "∈", "∉", "∋", "∏", "∑", "∧", "∨", "¬",
+            "≤", "≥", "≠", "≈", "≡", "≅", "≆", "≇", "≉", "≊", "≋",
+            "⊂", "⊃", "⊆", "⊇", "⊄", "⊅", "⊈", "⊉",
+            "∞", "∂", "∇", "√", "∛", "∜",
+            "∫", "∬", "∭", "∮", "∯", "∰",
+            "∴", "∵", "∶", "∷", "∼", "∽", "≈", "≋",
+            "⟨", "⟩", "|", "‖", "‵", "′", "″", "‴",
+            "•", "·", "‣", "⁂", "※", "‼", "⁇", "⁈",
+        ]
+        new_tokens.extend(math_ops)
+        # Add tokens that aren't already in vocabulary
+        for token in new_tokens:
+            if token not in current_vocab:
+                self.tokenizer.add_tokens([token])
+        print(f"Extended vocabulary with {len(new_tokens)} science tokens")
+        print(f"Final vocabulary size: {self.tokenizer.get_vocab_size()}")
+    def save(self, path: str):
+        """Save tokenizer to disk."""
+        self.tokenizer.save(path)
+        print(f"Tokenizer saved to {path}")
+    def encode(
+        self,
+        text: str,
+        add_special_tokens: bool = True,
+        return_tensors: str = "pt",
+    ) -> Union[Dict, torch.Tensor]:
+        """
+        Encode text to token IDs.
+        Args:
+            text: Input text
+            add_special_tokens: Add BOS/EOS tokens
+            return_tensors: "pt" for PyTorch tensors, "np" for numpy, None for list
+        Returns:
+            Dictionary with input_ids and attention_mask, or tensors/list
+        """
+        encoding = self.tokenizer.encode(text, add_special_tokens=add_special_tokens)
+        result = {
+            "input_ids": encoding.ids,
+            "attention_mask": encoding.attention_mask,
+        }
+        if return_tensors == "pt":
+            result = {k: torch.tensor(v).unsqueeze(0) for k, v in result.items()}
+        elif return_tensors == "np":
+            import numpy as np
+            result = {k: np.array(v) for k, v in result.items()}
+        return result
+    def decode(self, token_ids: List[int], skip_special_tokens: bool = True) -> str:
+        """Decode token IDs back to text."""
+        return self.tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
+    def batch_encode(
+        self,
+        texts: List[str],
+        padding: bool = True,
+        truncation: bool = True,
+        max_length: Optional[int] = None,
+        return_tensors: str = "pt",
+    ) -> Dict:
+        """
+        Encode a batch of texts.
+        Args:
+            texts: List of input texts
+            padding: Pad to same length
+            truncation: Truncate to max_length
+            max_length: Maximum sequence length
+            return_tensors: Tensor format
+        Returns:
+            Batch encoded dictionary
+        """
+        if max_length is None:
+            max_length = self.config.get("max_seq_len", 16384)
+        encodings = self.tokenizer.encode_batch(
+            texts,
+            add_special_tokens=True,
+        )
+        # Manual padding/truncation
+        input_ids = []
+        attention_masks = []
+        for enc in encodings:
+            ids = enc.ids
+            mask = enc.attention_mask
+            if truncation and len(ids) > max_length:
+                ids = ids[:max_length]
+                mask = mask[:max_length]
+            input_ids.append(ids)
+            attention_masks.append(mask)
+        # Pad to same length if requested
+        if padding:
+            max_len = max(len(ids) for ids in input_ids)
+            padded_ids = []
+            padded_masks = []
+            for ids, mask in zip(input_ids, attention_masks):
+                pad_len = max_len - len(ids)
+                padded_ids.append(ids + [self.special_tokens["[PAD]"]] * pad_len)
+                padded_masks.append(mask + [0] * pad_len)
+            input_ids = padded_ids
+            attention_masks = padded_masks
+        result = {
+            "input_ids": input_ids,
+            "attention_mask": attention_masks,
+        }
+        if return_tensors == "pt":
+            result = {k: torch.tensor(v) for k, v in result.items()}
+        return result
+    @property
+    def vocab_size(self) -> int:
+        """Get vocabulary size."""
+        return self.tokenizer.get_vocab_size()
+    def get_vocab(self) -> Dict[str, int]:
+        """Get vocabulary dictionary."""
+        return self.tokenizer.get_vocab()
+    def token_to_id(self, token: str) -> int:
+        """Convert token to ID."""
+        return self.tokenizer.token_to_id(token)
+    def id_to_token(self, id: int) -> str:
+        """Convert ID to token."""
+        return self.tokenizer.id_to_token(id)
+def build_science_vocabulary_file(output_path: str):
+    """
+    Build a science vocabulary text file for BPE training.
+    This file contains seed vocabulary terms to ensure science tokens are present.
+    """
+    science_terms = []
+    # LaTeX commands
+    latex_terms = [
+        "\\alpha", "\\beta", "\\gamma", "\\delta", "\\epsilon", "\\zeta",
+        "\\eta", "\\theta", "\\iota", "\\kappa", "\\lambda", "\\mu",
+        "\\nu", "\\xi", "\\pi", "\\rho", "\\sigma", "\\tau",
+        "\\upsilon", "\\phi", "\\chi", "\\psi", "\\omega",
+        "\\sum", "\\prod", "\\int", "\\partial", "\\nabla", "\\infty",
+        "\\frac", "\\sqrt", "\\binom", "\\begin", "\\end",
+        "\\mathbf", "\\mathcal", "\\mathrm", "\\mathbb",
+        "\\in", "\\subset", "\\cup", "\\cap", "\\forall", "\\exists",
+        "\\rightarrow", "\\leftarrow", "\\Rightarrow", "\\Leftarrow",
+        "\\leq", "\\geq", "\\neq", "\\approx", "\\equiv",
+    ]
+    science_terms.extend(latex_terms)
+    # Chemical formulas
+    chem_formulas = [
+        "H2O", "CO2", "O2", "N2", "H2", "CH4", "C2H6", "C3H8",
+        "C6H12O6", "C12H22O11", "HCl", "H2SO4", "HNO3", "H3PO4",
+        "NaOH", "KOH", "CaCO3", "NaCl", "KCl", "MgCl2",
+        "Fe2O3", "Fe3O4", "CuO", "Cu2O", "ZnO", "Al2O3",
+        "SiO2", "TiO2", "MnO2", "NH3", "NO", "NO2", "N2O",
+        "SO2", "SO3", "CO", "CH3COOH", "C2H5OH",
+    ]
+    science_terms.extend(chem_formulas)
+    # Mathematical expressions
+    math_exprs = [
+        "x^2", "x^3", "e^x", "ln(x)", "log(x)", "sin(x)", "cos(x)",
+        "tan(x)", "arcsin(x)", "arccos(x)", "arctan(x)",
+        "f(x)", "g(x)", "h(x)", "F(x)", "G(x)",
+        "dx", "dy", "dz", "dt", "∂x", "∂y", "∂z",
+        "∫", "∬", "∭", "∮", "∑_{i=1}^{n}", "∏_{i=1}^{n}",
+    ]
+    science_terms.extend(math_exprs)
+    # Units with numbers
+    unit_exprs = [
+        "10^6", "10^9", "10^12", "10^15", "10^18",
+        "10^-3", "10^-6", "10^-9", "10^-12",
+        "m/s", "km/h", "cm/s", "mm/s",
+        "J/mol", "kJ/mol", "cal", "kcal",
+        "eV", "MeV", "GeV", "TeV",
+        "Hz", "kHz", "MHz", "GHz",
+        "Pa", "kPa", "MPa", "GPa",
+        "°C", "K", "°F",
+    ]
+    science_terms.extend(unit_exprs)
+    # Write to file
+    with open(output_path, "w", encoding="utf-8") as f:
+        for term in science_terms:
+            f.write(term + "\n")
+    print(f"Science vocabulary seed file written to {output_path}")
+    print(f"Total seed terms: {len(science_terms)}")
+if __name__ == "__main__":
+    # Example usage
+    import sys
+    if len(sys.argv) < 2:
+        print("Usage: python vortex_tokenizer.py <train_data.txt> [output_dir]")
+        sys.exit(1)
+    train_data = sys.argv[1]
+    output_dir = sys.argv[2] if len(sys.argv) > 2 else "."
+    # Load config (simplified for standalone)
+    config = {
+        "special_tokens": {
+            "[PAD]": 0, "[UNK]": 1, "[BOS]": 2, "[EOS]": 3,
+            "[EQUATION]": 4, "[/EQUATION]": 5,
+            "[CITATION]": 6, "[/CITATION]": 7,
+            "[MOLECULE]": 8, "[/MOLECULE]": 9,
+            "[FIGURE]": 10, "[TABLE]": 11,
+            "[MATH]": 12, "[CHEM]": 13, "[BIO]": 14,
+            "[PHYS]": 15, "[EARTH]": 16, "[SPACE]": 17, "[ZOO]": 18,
+        },
+        "domain_tags": ["[MATH]", "[CHEM]", "[BIO]", "[PHYS]", "[EARTH]", "[SPACE]", "[ZOO]"],
+        "max_seq_len": 16384,
+    }
+    # Build seed vocabulary
+    seed_vocab_path = os.path.join(output_dir, "science_seed_vocab.txt")
+    build_science_vocabulary_file(seed_vocab_path)
+    # Initialize and train tokenizer
+    tokenizer = VortexScienceTokenizer(config)
+    tokenizer.train([train_data])
+    # Save tokenizer
+    tokenizer_path = os.path.join(output_dir, "vortex_tokenizer.json")
+    tokenizer.save(tokenizer_path)
+    print(f"Tokenizer saved to {tokenizer_path}")

train.py ADDED Viewed

	@@ -0,0 +1,146 @@

+#!/usr/bin/env python3
+"""
+Main training entry point for Vortex models.
+"""
+import argparse
+import sys
+from pathlib import Path
+import torch
+from configs.vortex_7b_config import VORTEX_7B_CONFIG
+from configs.vortex_13b_config import VORTEX_13B_CONFIG
+from configs.training_config import TRAINING_CONFIG, TRAINING_CONFIG_7B_CUDA, TRAINING_CONFIG_13B_CUDA, TRAINING_CONFIG_MPS
+from models.vortex_model import VortexModel
+from tokenizer.vortex_tokenizer import VortexScienceTokenizer
+from training.trainer import VortexTrainer, VortexDataset
+def parse_args():
+    parser = argparse.ArgumentParser(description="Train Vortex scientific language model")
+    parser.add_argument("--model_size", type=str, choices=["7b", "13b"], default="7b",
+                        help="Model size to train")
+    parser.add_argument("--device", type=str, default="cuda",
+                        choices=["cuda", "mps", "cpu"],
+                        help="Device to train on")
+    parser.add_argument("--use_mps", action="store_true",
+                        help="Use MPS backend (Apple Silicon)")
+    parser.add_argument("--data_dir", type=str, default="./data/processed",
+                        help="Directory with processed data shards")
+    parser.add_argument("--tokenizer_path", type=str, default=None,
+                        help="Path to pretrained tokenizer")
+    parser.add_argument("--resume_from_checkpoint", type=str, default=None,
+                        help="Resume training from checkpoint")
+    parser.add_argument("--output_dir", type=str, default="./checkpoints",
+                        help="Output directory for checkpoints")
+    parser.add_argument("--max_steps", type=int, default=None,
+                        help="Override max training steps")
+    parser.add_argument("--micro_batch_size", type=int, default=None,
+                        help="Override micro batch size")
+    parser.add_argument("--quantization", type=str, choices=[None, "int8", "int4"], default=None,
+                        help="Quantization for 13B on 8GB")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    # Load configs
+    if args.model_size == "7b":
+        model_config = VORTEX_7B_CONFIG.copy()
+        train_config = TRAINING_CONFIG_7B_CUDA.copy()
+    else:
+        model_config = VORTEX_13B_CONFIG.copy()
+        train_config = TRAINING_CONFIG_13B_CUDA.copy()
+    # Override with MPS config if needed
+    if args.use_mps or args.device == "mps":
+        train_config = TRAINING_CONFIG_MPS.copy()
+        train_config["use_mps"] = True
+    # Apply overrides
+    if args.max_steps:
+        train_config["max_steps"] = args.max_steps
+    if args.micro_batch_size:
+        train_config["micro_batch_size"] = args.micro_batch_size
+    if args.quantization:
+        train_config["quantization"] = args.quantization
+    # Set device
+    device = torch.device(args.device)
+    train_config["device"] = args.device
+    print(f"Training Vortex-{args.model_size.upper()}")
+    print(f"Device: {device}")
+    print(f"Max steps: {train_config['max_steps']}")
+    print(f"Micro batch size: {train_config['micro_batch_size']}")
+    # Create tokenizer
+    print("Loading tokenizer...")
+    tokenizer = VortexScienceTokenizer(
+        model_config,
+        tokenizer_path=args.tokenizer_path,
+    )
+    print(f"Tokenizer vocab size: {tokenizer.vocab_size}")
+    # Create model
+    print("Creating model...")
+    model = VortexModel(model_config)
+    print(f"Model parameters: {model.get_num_params():,}")
+    # Estimate memory
+    mem = model.estimate_memory_usage(
+        train_config["micro_batch_size"],
+        model_config["max_seq_len"],
+    )
+    print("Memory estimate:")
+    for k, v in mem.items():
+        print(f"  {k}: {v:.2f} GB")
+    # Load dataset
+    print("Loading dataset...")
+    data_dir = Path(args.data_dir)
+    shard_files = sorted(list(data_dir.glob("train_*.parquet")))
+    if not shard_files:
+        print(f"No training shards found in {data_dir}")
+        print("Please run data pipeline first.")
+        sys.exit(1)
+    train_dataset = VortexDataset(
+        shard_files,
+        tokenizer,
+        max_seq_len=model_config["max_seq_len"],
+    )
+    print(f"Training dataset size: {len(train_dataset)} samples")
+    # Create eval dataset (use first few shards)
+    eval_shard_files = shard_files[:1]  # Use first shard for eval
+    eval_dataset = VortexDataset(
+        eval_shard_files,
+        tokenizer,
+        max_seq_len=model_config["max_seq_len"],
+    )
+    # Create trainer
+    trainer = VortexTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=train_dataset,
+        config=train_config,
+        eval_dataset=eval_dataset,
+    )
+    # Resume from checkpoint if specified
+    if args.resume_from_checkpoint:
+        trainer.load_checkpoint(args.resume_from_checkpoint)
+    # Train
+    trainer.train()
+    print("Training complete!")
+if __name__ == "__main__":
+    main()

training/__pycache__/curriculum.cpython-313.pyc ADDED Viewed

Binary file (5 kB). View file

training/__pycache__/losses.cpython-313.pyc ADDED Viewed

Binary file (5.84 kB). View file

training/curriculum.py ADDED Viewed

	@@ -0,0 +1,175 @@

+"""
+Curriculum learning for Vortex model.
+Progresses through stages: Foundation → Domain → Reasoning → Integration.
+"""
+from typing import List, Dict, Optional
+import torch
+class CurriculumScheduler:
+    """
+    Schedules curriculum stages during training.
+    Each stage has a start and end fraction of total training steps.
+    """
+    STAGES = ["foundation", "domain", "reasoning", "integration"]
+    def __init__(
+        self,
+        config: Dict,
+        total_steps: int,
+    ):
+        """
+        Initialize curriculum scheduler.
+        Args:
+            config: Training config with curriculum_stages
+            total_steps: Total number of training steps
+        """
+        self.config = config
+        self.total_steps = total_steps
+        self.stages = config.get("curriculum_stages", [
+            {"name": "foundation", "start": 0.0, "end": 0.2},
+            {"name": "domain", "start": 0.2, "end": 0.5},
+            {"name": "reasoning", "start": 0.5, "end": 0.8},
+            {"name": "integration", "start": 0.8, "end": 1.0},
+        ])
+        # Convert fractions to step numbers
+        for stage in self.stages:
+            stage["start_step"] = int(stage["start"] * total_steps)
+            stage["end_step"] = int(stage["end"] * total_steps)
+    def get_stage(
+        self,
+        current_step: int,
+    ) -> Optional[Dict]:
+        """
+        Get current curriculum stage.
+        Args:
+            current_step: Current training step
+        Returns:
+            Stage dictionary or None if training complete
+        """
+        for stage in self.stages:
+            if stage["start_step"] <= current_step < stage["end_step"]:
+                return stage
+        return None
+    def get_stage_name(self, current_step: int) -> str:
+        """Get current stage name."""
+        stage = self.get_stage(current_step)
+        return stage["name"] if stage else "complete"
+    def get_stage_weight(
+        self,
+        current_step: int,
+        base_weight: float,
+    ) -> float:
+        """
+        Get weight for a curriculum component based on stage.
+        Args:
+            current_step: Current training step
+            base_weight: Base weight for the component
+        Returns:
+            Adjusted weight (can be 0 if component not active in current stage)
+        """
+        stage = self.get_stage(current_step)
+        if not stage:
+            return 0.0
+        stage_name = stage["name"]
+        # Define which components are active in each stage
+        stage_components = {
+            "foundation": ["lm_loss"],  # Only language modeling
+            "domain": ["lm_loss", "equation_loss", "domain_loss"],
+            "reasoning": ["lm_loss", "equation_loss", "domain_loss", "citation_loss"],
+            "integration": ["lm_loss", "equation_loss", "domain_loss", "citation_loss", "numerical_loss"],
+        }
+        active_components = stage_components.get(stage_name, ["lm_loss"])
+        # Return base weight if component active, else 0
+        # (Caller checks if their component is in active_components)
+        return base_weight if "lm_loss" in active_components else 0.0
+    def get_dataset_sampler(
+        self,
+        current_step: int,
+    ):
+        """
+        Get dataset sampler for current stage.
+        Different stages may mix datasets differently.
+        Returns:
+            Sampler weights for different datasets
+        """
+        stage = self.get_stage(current_step)
+        if not stage:
+            return None
+        stage_name = stage["name"]
+        # Dataset mixing proportions per stage
+        mixing_proportions = {
+            "foundation": {
+                "pile_scientific": 0.3,
+                "s2orc": 0.3,
+                "automath": 0.2,
+                "pubmed_qa": 0.2,
+            },
+            "domain": {
+                "pile_scientific": 0.2,
+                "s2orc": 0.2,
+                "automath": 0.2,
+                "pubmed_qa": 0.2,
+                "deepmind_math": 0.2,
+            },
+            "reasoning": {
+                "pile_scientific": 0.15,
+                "s2orc": 0.15,
+                "automath": 0.3,
+                "deepmind_math": 0.3,
+                "pubmed_qa": 0.1,
+            },
+            "integration": {
+                "pile_scientific": 0.2,
+                "s2orc": 0.2,
+                "automath": 0.2,
+                "deepmind_math": 0.2,
+                "pubmed_qa": 0.2,
+            },
+        }
+        return mixing_proportions.get(stage_name, {"pile_scientific": 1.0})
+def test_curriculum():
+    """Test curriculum scheduler."""
+    config = {
+        "curriculum_stages": [
+            {"name": "foundation", "start": 0.0, "end": 0.2},
+            {"name": "domain", "start": 0.2, "end": 0.5},
+            {"name": "reasoning", "start": 0.5, "end": 0.8},
+            {"name": "integration", "start": 0.8, "end": 1.0},
+        ]
+    }
+    total_steps = 1000
+    scheduler = CurriculumScheduler(config, total_steps)
+    for step in [0, 100, 250, 500, 750, 999]:
+        stage = scheduler.get_stage(step)
+        name = scheduler.get_stage_name(step)
+        print(f"Step {step}: {name}")
+    print("Curriculum test passed!")
+if __name__ == "__main__":
+    test_curriculum()

training/losses.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""
+Science-aware losses for Vortex model training.
+Combines standard language modeling with auxiliary tasks.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Dict, Optional, Tuple
+class VortexLoss(nn.Module):
+    """
+    Combined loss for Vortex model with science-aware components.
+    total_loss = (
+        lm_loss * 1.0
+        + equation_loss * 0.3
+        + domain_loss * 0.1
+        + citation_loss * 0.1
+        + numerical_loss * 0.2
+    )
+    """
+    def __init__(self, config: Dict):
+        """
+        Initialize loss.
+        Args:
+            config: Training config with loss_weights
+        """
+        super().__init__()
+        self.loss_weights = config.get("loss_weights", {
+            "lm_loss": 1.0,
+            "equation_loss": 0.3,
+            "domain_loss": 0.1,
+            "citation_loss": 0.1,
+            "numerical_loss": 0.2,
+        })
+    def forward(
+        self,
+        logits: torch.Tensor,
+        labels: torch.Tensor,
+        equation_module: Optional[nn.Module] = None,
+        equation_mask: Optional[torch.Tensor] = None,
+        domain_logits: Optional[torch.Tensor] = None,
+        domain_labels: Optional[torch.Tensor] = None,
+        citation_module: Optional[nn.Module] = None,
+        citation_mask: Optional[torch.Tensor] = None,
+        citation_confidence: Optional[torch.Tensor] = None,
+        numerical_module: Optional[nn.Module] = None,
+        numerical_mask: Optional[torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        Compute total loss.
+        Args:
+            logits: (batch, seq_len, vocab_size)
+            labels: (batch, seq_len) with token IDs
+            equation_module: EquationModule for equation loss
+            equation_mask: (batch, seq_len) 1 if token in equation
+            domain_logits: (batch, num_domains)
+            domain_labels: (batch,)
+            citation_module: CitationModule for citation loss
+            citation_mask: (batch, seq_len)
+            citation_confidence: (batch, seq_len, 1)
+            numerical_module: NumericalReasoningModule
+            numerical_mask: (batch, seq_len)
+        Returns:
+            Dictionary with total loss and component losses
+        """
+        losses = {}
+        # 1. Language modeling loss (next token prediction)
+        lm_loss = F.cross_entropy(
+            logits.view(-1, logits.size(-1)),
+            labels.view(-1),
+            ignore_index=-100,  # ignore padding
+        )
+        losses["lm_loss"] = lm_loss
+        # 2. Equation detection loss
+        if equation_module is not None and equation_mask is not None:
+            # Need hidden states from equation module - would need to modify forward pass
+            # For now, placeholder
+            equation_loss = torch.tensor(0.0, device=logits.device)
+            losses["equation_loss"] = equation_loss
+        else:
+            losses["equation_loss"] = torch.tensor(0.0, device=logits.device)
+        # 3. Domain classification loss
+        if domain_logits is not None and domain_labels is not None:
+            domain_loss = F.cross_entropy(domain_logits, domain_labels)
+            losses["domain_loss"] = domain_loss
+        else:
+            losses["domain_loss"] = torch.tensor(0.0, device=logits.device)
+        # 4. Citation detection loss
+        if citation_module is not None and citation_mask is not None and citation_confidence is not None:
+            citation_loss = citation_module.compute_citation_loss(
+                # Would need hidden states - placeholder
+                torch.zeros_like(logits[:, :, :1]),  # dummy
+                citation_mask,
+                citation_confidence,
+            )
+            losses["citation_loss"] = citation_loss
+        else:
+            losses["citation_loss"] = torch.tensor(0.0, device=logits.device)
+        # 5. Numerical reasoning loss
+        if numerical_module is not None and numerical_mask is not None:
+            numerical_loss = numerical_module.compute_numerical_loss(
+                torch.zeros_like(logits),  # dummy hidden states
+                numerical_mask,
+                None,  # target values
+            )
+            losses["numerical_loss"] = numerical_loss
+        else:
+            losses["numerical_loss"] = torch.tensor(0.0, device=logits.device)
+        # Weighted sum
+        total_loss = torch.tensor(0.0, device=logits.device)
+        for name, loss in losses.items():
+            weight = self.loss_weights.get(name, 1.0)
+            total_loss = total_loss + loss * weight
+        losses["total_loss"] = total_loss
+        return losses
+def test_vortex_loss():
+    """Test the loss function."""
+    config = {"loss_weights": {
+        "lm_loss": 1.0,
+        "equation_loss": 0.3,
+        "domain_loss": 0.1,
+        "citation_loss": 0.1,
+        "numerical_loss": 0.2,
+    }}
+    loss_fn = VortexLoss(config)
+    batch_size = 2
+    seq_len = 128
+    vocab_size = 1000
+    logits = torch.randn(batch_size, seq_len, vocab_size)
+    labels = torch.randint(0, vocab_size, (batch_size, seq_len))
+    losses = loss_fn(logits, labels)
+    print("Losses:")
+    for name, value in losses.items():
+        print(f"  {name}: {value.item():.4f}")
+    assert "total_loss" in losses
+    print("VortexLoss test passed!")
+if __name__ == "__main__":
+    test_vortex_loss()

training/trainer.py ADDED Viewed

	@@ -0,0 +1,442 @@

+"""
+Trainer: Main training loop for Vortex model.
+Handles gradient accumulation, mixed precision, checkpointing.
+"""
+import os
+import json
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader, Dataset
+from typing import Optional, Dict, List, Callable
+from pathlib import Path
+import logging
+from ..training.losses import VortexLoss
+from ..training.curriculum import CurriculumScheduler
+class VortexDataset(Dataset):
+    """Simple dataset wrapper."""
+    def __init__(
+        self,
+        shard_files: List[str],
+        tokenizer,
+        max_seq_len: int = 16384,
+    ):
+        """
+        Initialize dataset.
+        Args:
+            shard_files: List of parquet shard files
+            tokenizer: Tokenizer for encoding text
+            max_seq_len: Maximum sequence length
+        """
+        self.shard_files = shard_files
+        self.tokenizer = tokenizer
+        self.max_seq_len = max_seq_len
+        # Load all shards into memory (for simplicity - would stream in practice)
+        self.samples = []
+        self._load_shards()
+    def _load_shards(self):
+        """Load all shards."""
+        import pandas as pd
+        for shard in self.shard_files:
+            df = pd.read_parquet(shard)
+            for _, row in df.iterrows():
+                self.samples.append({
+                    "text": row["text"],
+                    "dataset": row.get("dataset", ""),
+                    "domain": row.get("domain", ""),
+                })
+    def __len__(self) -> int:
+        return len(self.samples)
+    def __getitem__(self, idx) -> Dict:
+        sample = self.samples[idx]
+        text = sample["text"]
+        # Tokenize
+        encoding = self.tokenizer.encode(
+            text,
+            add_special_tokens=True,
+            return_tensors="pt",
+        )
+        input_ids = encoding["input_ids"].squeeze(0)
+        attention_mask = encoding["attention_mask"].squeeze(0)
+        # Truncate if needed
+        if len(input_ids) > self.max_seq_len:
+            input_ids = input_ids[:self.max_seq_len]
+            attention_mask = attention_mask[:self.max_seq_len]
+        # Labels are same as input_ids (causal LM)
+        labels = input_ids.clone()
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "labels": labels,
+            "domain": sample["domain"],
+        }
+class VortexTrainer:
+    """
+    Main trainer for Vortex model.
+    """
+    def __init__(
+        self,
+        model: nn.Module,
+        tokenizer,
+        train_dataset: Dataset,
+        config: Dict,
+        eval_dataset: Optional[Dataset] = None,
+        optimizer: Optional[torch.optim.Optimizer] = None,
+        scheduler: Optional[torch.optim.lr_scheduler._LRScheduler] = None,
+    ):
+        """
+        Initialize trainer.
+        Args:
+            model: VortexModel
+            tokenizer: VortexScienceTokenizer
+            train_dataset: Training dataset
+            config: Training configuration
+            eval_dataset: Optional evaluation dataset
+            optimizer: Optional optimizer (created if None)
+            scheduler: Optional LR scheduler
+        """
+        self.model = model
+        self.tokenizer = tokenizer
+        self.train_dataset = train_dataset
+        self.eval_dataset = eval_dataset
+        self.config = config
+        self.device = torch.device(config["device"])
+        self.use_amp = config.get("use_amp", True)
+        self.amp_dtype = getattr(torch, config.get("amp_dtype", "bfloat16"))
+        # Move model to device
+        self.model.to(self.device)
+        # Setup optimizer
+        if optimizer is None:
+            self.optimizer = self._create_optimizer()
+        else:
+            self.optimizer = optimizer
+        # Setup scheduler
+        if scheduler is None:
+            self.scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
+                self.optimizer,
+                T_max=config["max_steps"],
+            )
+        else:
+            self.scheduler = scheduler
+        # Setup AMP scaler
+        self.scaler = torch.cuda.amp.GradScaler() if self.use_amp and self.device.type == "cuda" else None
+        # Loss function
+        self.loss_fn = VortexLoss(config)
+        # Curriculum scheduler
+        self.curriculum = CurriculumScheduler(config, config["max_steps"])
+        # Logging
+        self.log_dir = Path(config.get("log_dir", "logs"))
+        self.log_dir.mkdir(parents=True, exist_ok=True)
+        self.log_interval = config.get("log_interval", 100)
+        # Checkpointing
+        self.checkpoint_dir = Path(config.get("checkpoint_dir", "checkpoints"))
+        self.checkpoint_dir.mkdir(parents=True, exist_ok=True)
+        self.save_interval = config.get("save_interval", 5000)
+        # Training state
+        self.global_step = 0
+        self.best_eval_loss = float('inf')
+        # Data loader
+        self.train_loader = DataLoader(
+            train_dataset,
+            batch_size=config["micro_batch_size"],
+            shuffle=True,
+            num_workers=config.get("num_workers", 4),
+            pin_memory=config.get("pin_memory", True),
+            prefetch_factor=config.get("prefetch_factor", 2),
+        )
+        if eval_dataset:
+            self.eval_loader = DataLoader(
+                eval_dataset,
+                batch_size=config["micro_batch_size"],
+                shuffle=False,
+                num_workers=config.get("num_workers", 4),
+            )
+    def _create_optimizer(self) -> torch.optim.Optimizer:
+        """Create AdamW optimizer."""
+        return torch.optim.AdamW(
+            self.model.parameters(),
+            lr=self.config["learning_rate"],
+            betas=(self.config["beta1"], self.config["beta2"]),
+            weight_decay=self.config["weight_decay"],
+        )
+    def train_step(
+        self,
+        batch: Dict,
+        current_step: int,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        Single training step.
+        Args:
+            batch: Batch dictionary
+            current_step: Current step number
+        Returns:
+            Dictionary of losses
+        """
+        self.model.train()
+        # Move batch to device
+        input_ids = batch["input_ids"].to(self.device)
+        attention_mask = batch["attention_mask"].to(self.device)
+        labels = batch["labels"].to(self.device)
+        # Domain info (placeholder - would extract from batch)
+        domain_ids = None
+        domain_tags = None
+        # Forward pass with AMP
+        with torch.cuda.amp.autocast(enabled=self.use_amp and self.device.type == "cuda"):
+            outputs = self.model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                domain_ids=domain_ids,
+                domain_tags=domain_tags,
+                return_dict=True,
+            )
+            logits = outputs["logits"]
+            # Compute losses
+            losses = self.loss_fn(
+                logits=logits,
+                labels=labels,
+                # Pass modules and masks for auxiliary losses
+            )
+        # Backward pass
+        if self.scaler:
+            self.scaler.scale(losses["total_loss"]).backward()
+        else:
+            losses["total_loss"].backward()
+        return losses
+    def train_epoch(self):
+        """Train for one epoch."""
+        self.model.train()
+        for batch_idx, batch in enumerate(self.train_loader):
+            # Train step
+            losses = self.train_step(batch, self.global_step)
+            # Gradient accumulation
+            if (self.global_step + 1) % self.config["gradient_accumulation_steps"] == 0:
+                # Gradient clipping
+                if self.config.get("clip_grad_norm", 0) > 0:
+                    if self.scaler:
+                        self.scaler.unscale_(self.optimizer)
+                    torch.nn.utils.clip_grad_norm_(
+                        self.model.parameters(),
+                        self.config["clip_grad_norm"],
+                    )
+                # Optimizer step
+                if self.scaler:
+                    self.scaler.step(self.optimizer)
+                    self.scaler.update()
+                else:
+                    self.optimizer.step()
+                self.optimizer.zero_grad()
+                self.scheduler.step()
+            # Logging
+            if self.global_step % self.log_interval == 0:
+                self._log_losses(losses, batch_idx)
+            # Evaluation
+            if self.eval_dataset and self.global_step % self.config.get("eval_interval", 1000) == 0:
+                self.evaluate()
+            # Checkpointing
+            if self.global_step % self.save_interval == 0:
+                self.save_checkpoint()
+            self.global_step += 1
+            if self.global_step >= self.config["max_steps"]:
+                print("Reached max steps")
+                return
+    def evaluate(self) -> Dict[str, float]:
+        """Run evaluation."""
+        self.model.eval()
+        total_loss = 0.0
+        num_batches = 0
+        with torch.no_grad():
+            for batch in self.eval_loader:
+                input_ids = batch["input_ids"].to(self.device)
+                attention_mask = batch["attention_mask"].to(self.device)
+                labels = batch["labels"].to(self.device)
+                with torch.cuda.amp.autocast(enabled=self.use_amp and self.device.type == "cuda"):
+                    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
+                    logits = outputs["logits"]
+                    loss = F.cross_entropy(
+                        logits.view(-1, logits.size(-1)),
+                        labels.view(-1),
+                        ignore_index=-100,
+                    )
+                total_loss += loss.item()
+                num_batches += 1
+        avg_loss = total_loss / num_batches if num_batches > 0 else 0.0
+        print(f"Evaluation at step {self.global_step}: loss = {avg_loss:.4f}")
+        return {"eval_loss": avg_loss}
+    def save_checkpoint(self, is_best: bool = False):
+        """Save model checkpoint."""
+        checkpoint = {
+            "step": self.global_step,
+            "model_state_dict": self.model.state_dict(),
+            "optimizer_state_dict": self.optimizer.state_dict(),
+            "scheduler_state_dict": self.scheduler.state_dict(),
+            "config": self.config,
+            "best_eval_loss": self.best_eval_loss,
+        }
+        if self.scaler:
+            checkpoint["scaler_state_dict"] = self.scaler.state_dict()
+        # Save latest
+        checkpoint_path = self.checkpoint_dir / f"checkpoint_{self.global_step:06d}.pt"
+        torch.save(checkpoint, checkpoint_path)
+        print(f"Saved checkpoint to {checkpoint_path}")
+        # Save best
+        if is_best:
+            best_path = self.checkpoint_dir / "best_model.pt"
+            torch.save(checkpoint, best_path)
+            print(f"Saved best model to {best_path}")
+        # Save latest link
+        latest_path = self.checkpoint_dir / "latest.pt"
+        torch.save(checkpoint, latest_path)
+    def load_checkpoint(self, checkpoint_path: str):
+        """Load checkpoint."""
+        checkpoint = torch.load(checkpoint_path, map_location=self.device, weights_only=False)
+        self.model.load_state_dict(checkpoint["model_state_dict"])
+        self.optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
+        self.scheduler.load_state_dict(checkpoint["scheduler_state_dict"])
+        self.global_step = checkpoint["step"]
+        self.best_eval_loss = checkpoint.get("best_eval_loss", float('inf'))
+        if self.scaler and "scaler_state_dict" in checkpoint:
+            self.scaler.load_state_dict(checkpoint["scaler_state_dict"])
+        print(f"Loaded checkpoint from {checkpoint_path} at step {self.global_step}")
+    def _log_losses(self, losses: Dict[str, torch.Tensor], batch_idx: int):
+        """Log losses to console and file."""
+        loss_str = " | ".join([f"{k}: {v.item():.4f}" for k, v in losses.items()])
+        print(f"Step {self.global_step} | {loss_str}")
+    def train(self):
+        """Main training loop."""
+        print("Starting training...")
+        print(f"Total steps: {self.config['max_steps']}")
+        print(f"Device: {self.device}")
+        print(f"Batch size: {self.config['micro_batch_size']}")
+        print(f"Gradient accumulation steps: {self.config['gradient_accumulation_steps']}")
+        try:
+            self.train_epoch()
+        except KeyboardInterrupt:
+            print("Training interrupted")
+        finally:
+            self.save_checkpoint()
+def test_trainer():
+    """Test trainer with small model."""
+    from models.vortex_model import VortexModel
+    from tokenizer.vortex_tokenizer import VortexScienceTokenizer
+    from configs.vortex_7b_config import VORTEX_7B_CONFIG
+    # Small config for testing
+    config = VORTEX_7B_CONFIG.copy()
+    config["d_model"] = 256
+    config["num_layers"] = 2
+    config["num_heads"] = 4
+    config["vocab_size"] = 1000
+    config["max_steps"] = 10
+    config["device"] = "cpu"
+    # Create model
+    model = VortexModel(config)
+    # Create dummy tokenizer
+    class DummyTokenizer:
+        def encode(self, text, add_special_tokens=True, return_tensors="pt"):
+            return {"input_ids": torch.randint(0, 1000, (1, 10)), "attention_mask": torch.ones(1, 10)}
+    tokenizer = DummyTokenizer()
+    # Create dummy dataset
+    class DummyDataset(torch.utils.data.Dataset):
+        def __len__(self): return 10
+        def __getitem__(self, idx):
+            return {
+                "input_ids": torch.randint(0, 1000, (32,)),
+                "attention_mask": torch.ones(32),
+                "labels": torch.randint(0, 1000, (32,)),
+                "domain": "physics",
+            }
+    train_dataset = DummyDataset()
+    eval_dataset = DummyDataset()
+    # Create trainer
+    trainer = VortexTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=train_dataset,
+        config=config,
+        eval_dataset=eval_dataset,
+    )
+    # Run a few steps
+    trainer.train()
+    print("Trainer test passed!")
+if __name__ == "__main__":
+    test_trainer()

vortex_config.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""
+Vortex-7B model configuration.
+Optimized for 8GB VRAM (4060 laptop) and MacBook Pro M2/M3.
+"""
+VORTEX_7B_CONFIG = {
+    # Model dimensions
+    "d_model": 4096,
+    "num_layers": 32,
+    "num_heads": 32,
+    "head_dim": 128,  # d_model // num_heads
+    # State-space layer parameters
+    "d_state": 16,          # SSM state dimension
+    "d_conv": 4,            # SSM convolution width
+    # Attention parameters
+    "window_size": 512,     # Local attention window
+    "use_flash_attention": True,  # CUDA only
+    # Feed-forward parameters
+    "ffn_expansion": 4,     # Hidden dim = d_model * expansion
+    "num_domains": 7,       # Physics, Math, Chemistry, Biology, Earth, Space, Zoology
+    # Tokenizer parameters
+    "vocab_size": 50000,
+    "max_seq_len": 16384,
+    # Layer ratio: 60% SSM, 40% attention
+    "ssm_ratio": 0.6,
+    # Data types
+    "dtype": "bfloat16",
+    # Special tokens
+    "special_tokens": {
+        "[PAD]": 0,
+        "[UNK]": 1,
+        "[BOS]": 2,
+        "[EOS]": 3,
+        "[EQUATION]": 4,
+        "[/EQUATION]": 5,
+        "[CITATION]": 6,
+        "[/CITATION]": 7,
+        "[MOLECULE]": 8,
+        "[/MOLECULE]": 9,
+        "[FIGURE]": 10,
+        "[TABLE]": 11,
+        "[MATH]": 12,
+        "[CHEM]": 13,
+        "[BIO]": 14,
+        "[PHYS]": 15,
+        "[EARTH]": 16,
+        "[SPACE]": 17,
+        "[ZOO]": 18,
+    },
+    # Domain tags
+    "domain_tags": ["[MATH]", "[CHEM]", "[BIO]", "[PHYS]", "[EARTH]", "[SPACE]", "[ZOO]"],
+    # Science module flags (enable/disable for ablation)
+    "enable_equation_module": True,
+    "enable_numerical_module": True,
+    "enable_citation_module": True,
+    "enable_molecular_module": True,
+}
+def get_config():
+    """Return the 7B configuration dictionary."""
+    return VORTEX_7B_CONFIG