File size: 14,842 Bytes
a70eb3d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 | # Vivek Varikuti β Complete Profile & Project Portfolio
## Who I Am
- 22 years old, AI Engineer & Startup Founder
- GitHub: vivekvar-dl
- Email: domainluther1234@gmail.com
- Strong Python/PyTorch/LLM skills, deep transformer training experience
- Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
- CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
- Transformers 5.4.0
---
## Working Style
- No AI fluff. No menus of options. Make the decision and execute.
- Write like a human β no perfect grammar, no emojis, no "leveraging" or "seamless"
- Any public text must read like a tired developer typed it at 2am
- No co-authored-by Claude in git commits β public contributions look fully human
- Verify before claiming. Test before shipping. Always run the actual code.
---
## Project 1: TurboQuant β KV Cache Compression
**What:** Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.
**Why:** Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).
**Location:** /home/azureuser/turboquant/
**Status:** Working prototype. Google hasn't released their code publicly β this is one of the first working implementations.
**Core Method:** Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning β drop-in replacement.
**Key Discovery:** Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.
### Benchmark Results (H100 NVL 96GB)
#### Model Architecture Summary
| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity |
|-------|-------------|----------|---------|----------------|-----------------|
| Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact |
| Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact |
| Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact |
| Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact |
| Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact |
| Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A |
#### Memory Savings at 8K Context
| Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction |
|-------|-------------|----------------|-------|-------------------|
| Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% |
| Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% |
| Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% |
| LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% |
| Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% |
#### Memory Savings Scaling (LLaMA-3.1-8B)
| Context Length | Default VRAM | TurboQuant VRAM | Saved |
|---------------|-------------|----------------|-------|
| 1K tokens | 6.00 GB | 5.91 GB | 93 MB |
| 4K tokens | 6.67 GB | 6.27 GB | 417 MB |
| 8K tokens | 7.71 GB | 6.84 GB | 890 MB |
#### Full Memory Data Per Model
**Qwen2.5-7B (5.45 GB model)**
- Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
- Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
- 1K: 5.76β5.73 GB (37 MB saved)
- 4K: 6.27β6.10 GB (176 MB saved)
- 8K: 7.08β6.71 GB (380 MB saved)
**LLaMA-3.1-8B (5.68 GB model)**
- Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
- No outlier layers
- 1K: 6.00β5.91 GB (93 MB saved, output match)
- 4K: 6.67β6.27 GB (417 MB saved, output match)
- 8K: 7.71β6.84 GB (890 MB saved, output match)
**Gemma-2-9B (6.08 GB model)**
- Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
- No outlier layers
- 1K: 6.62β6.38 GB (244 MB saved)
- 4K: 7.96β6.89 GB (1,096 MB saved)
- 8K: 9.98β7.71 GB (2,323 MB saved)
**Phi-4-14B (9.10 GB model)**
- Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
- No outlier layers
- 1K: 9.75β9.61 GB (146 MB saved)
- 4K: 10.72β10.09 GB (650 MB saved)
- 8K: 12.28β10.92 GB (1,392 MB saved)
**Qwen2.5-32B (19.31 GB model)**
- Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
- No outlier layers
- 1K: 19.97β19.79 GB (186 MB saved)
- 4K: 21.23β20.42 GB (833 MB saved)
- 8K: 23.16β21.41 GB (1,791 MB saved)
**LLaMA-3.3-70B** β failed with "No space left on device"
#### Quality Verification
All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"
- Prefill logit difference: 0.0 across ALL models
- Same top-1 token prediction: 100% across ALL models
- Output coherence: 100% β both default and TurboQuant outputs fully coherent
- Token match rate varies (18-100%) due to natural autoregressive sampling divergence β both outputs equally valid
**Detailed quality per model:**
Qwen2.5-7B: token match 39%, 3%, 54% β both coherent all 3 prompts
LLaMA-3.1-8B: token match 89.1%, 100%, 100% β 2/3 exact match
Phi-4-14B: token match 100%, 44%, 100% β 2/3 exact match
Gemma-2-9B: token match 100%, 100%, 18.8% β 2/3 exact match
Qwen2.5-32B: token match 71%, 25%, 53% β both coherent all 3 prompts
#### Infrastructure Notes
- Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
- PyTorch compiled for CUDA 13.0+ won't work β need cu128 wheel
- Core quantizer verified (MSE matches paper bounds)
- Cache integrates with HF Transformers v5.4.0 QuantizedLayer API
---
## Project 2: Parameter Golf Competition (OpenAI)
**What:** OpenAI competition β train the best language model within a 16MB artifact, 10 minutes on 8xH100.
**Metric:** Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)
**Timeline:** March 18 - April 30, 2026
**Current SOTA (merged):** 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)
### Our Edge: sp4096 Vocabulary
- sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 β 26.2% fewer tokens
- Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB β -5.5%
- #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB β -6.8%
- Extrapolated SOTA: 1.1194 Γ 0.93 β 1.04-1.06 BPB
### Architecture
- 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
- Factored embeddings: tok_emb(4096x256) + embed_up(256β512) + embed_down(512β256)
- All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite
### Key Files
- our_submission/train_gpt.py β modified #1 with sp4096 + factored embed + FA2 fallback
- our_submission/train_gpt_original.py β unmodified #1 with FA2 fallback
- train_sp4096.py β tokenizer training + data sharding script
- data/tokenizers/fineweb_4096_bpe.model β trained sp4096 tokenizer
- data/datasets/fineweb10B_sp4096/ β 80 train shards + 1 val shard
### N-gram Cache: CONFIRMED FAKE
- 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
- All sub-1.0 BPB claims are measurement artifacts from hash collisions
- Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement
### Next Steps
1. Medium fidelity run (10min 1xH100)
2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
3. Get 8xH100 access for final submission (compute grant or RunPod)
4. Temperature scaling, document-isolated TTT for extra gains
### Hardware
- Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
- flash-attn 2.8.3 (FA2, not FA3)
- Final submission needs 8xH100
---
## Project 3: GSoC 2026 β DeepChem OLMo Wrapper
**What:** Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.
**Org:** DeepChem (standalone first time in GSoC 2026)
**Mentors:** Riya, Harindhar
**Deadline:** March 31, 2026 18:00 UTC (submitted)
### What Was Built
**PR #4913 (LIVE) β Bug Fix**
- Fixed ChemBERTa broken import for transformers 5.x
- `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x
- 3 additions / 4 deletions
- https://github.com/deepchem/deepchem/pull/4913
**Issue #4912 (LIVE) β Compat Report**
- Broader transformers 5.x compatibility issues documented
- https://github.com/deepchem/deepchem/issues/4912
**OLMo Wrapper (LOCAL ONLY β not pushed)**
- Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
- Olmo2ForSequenceClassification β built from scratch (doesn't exist in HF)
- OLMo wrapper class extending HuggingFaceModel
- Added causal_lm task + generate() to base HuggingFaceModel
- 8/8 tests pass in 27 seconds on CPU
- Uses OLMo-2 (allenai/OLMo-2-1124-7B)
### Experiments Run
- BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
- ESOL regression: RΒ² 0.37, MAE 1.27
- SMILES generation: 0% validity (proves pretraining is core challenge)
- Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters
### Proposal
- ~/gsoc_proposal_final.md β human-written version
- ~/gsoc_proposal_content.md β raw technical reference
### Key Context
- PR #4907 by Aditya-ad48 also adds causal LM generation β complementary not competing
- DeepChem wants small PRs (<50 lines) for new contributors
- rbharath is the main reviewer/maintainer
- Office hours MWF 9am PST
- Discord: https://discord.gg/RYTrUY8Ssn
---
## Project 4: Genesis β Artificial Life Simulation
**What:** Virtual world where blank GRU neural net agents evolve survival behaviors from scratch β foraging, water-seeking, communication β on H100 GPU using JAX.
**Location:** /home/azureuser/genesis/ (venv at ~/genesis_env/)
### World Setup
- 512x512 grid with Perlin noise terrain
- Food regrowth, water sources, day/night cycles, seasons
- 1000 agents with GRU brains (~82K params each)
- Tournament selection + Gaussian mutation (self-adaptive sigma)
- Agents start with zero knowledge β must learn to survive
### Status (2026-04-01)
Phases 1-3 complete. 500K step run finished successfully:
- 86 generations evolved
- Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
- Signal entropy dropping (4.28β3.58) β indicating early communication structure
- Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)
### Key Fix
food_growth_rate bumped from 0.005β0.02, food_eat_amount 0.05β0.03 to prevent ecological collapse at high generations.
### Architecture
- World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
- Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
- Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
- Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
- Analysis: emergence.py (signal entropy, magnitude, RΒ², diversity, clustering)
- Visualization: renderer.py (dashboard, world map, zoom views)
### Run Data
~/genesis/runs/run_20260401_111309/ β metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json
### Next Steps
- Phase 4: TRIBE v2 integration β compare evolved GRU representations to human brain activity via RSA
- Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
- Checkpoints at 50K intervals allow comparing brain representations across evolutionary time
---
## Project 5: TRIBE v2 β AI-Brain Loop
**What:** Closing the AI-brain comparison loop using Meta's TRIBE v2 β comparing AI encoder representations to predicted brain activity to find architectural gaps.
**Location:** /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)
### What's Built
- Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
- 8 phases: load model β extract per-layer features β brain parcellation β layer-wise encoding β modality ablation β RSA β divergence mapping β visualization
- Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
- Results: /home/azureuser/loop_results_v2/
- Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)
### Status
LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times β all failed.
### Why Attempts Failed
- Never had real brain targets β routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
- Evaluated on wrong metric (classification accuracy instead of robustness)
- Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy
### What Would Actually Work (from RESEARCH_BRIEF.md)
1. Pre-compute real brain targets using TRIBE v2's full pipeline
2. Train student with classification + per-vertex Pearson correlation brain loss
3. Evaluate on corruption/adversarial robustness, shape bias, brain-score β NOT accuracy
4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions
### Key Infrastructure
- Training scripts: /home/azureuser/brain_guided/train_*.py
- UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
- Results: /home/azureuser/brain_guided/results_final/
---
## Project 6: Instagram Cinema
**What:** AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.
**Setup:** LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.
**Format:** Instagram Reels β 9:16 portrait, 544x960
**Goal:** Create viral-quality cinematic content for Instagram Reels.
---
## Money-Making Strategy (April 2026)
### Sellable Assets
1. **TurboQuant** β working implementation nobody else has publicly. Lead magnet for consulting.
2. **Parameter Golf** β competition result (if top placement) = massive credibility signal
3. **Fine-tuning expertise** β proven on H100, multiple model families
4. **Inference optimization consulting** β directly from TurboQuant benchmarks
### Immediate Plan
- Path to 10L: Freelancing/consulting β fine-tuning + inference optimization
- Path to 1Cr: Productized consulting at scale or AI startup
- Channel: X (Twitter) for distribution, direct DMs to founders for sales
### X (Twitter) Growth Strategy
- Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
- Strategy: 70% replies (to bigger accounts), 30% original posts
- Target: 15 strategic replies/day to accounts with 100-5000 followers
- Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
- Pinned thread: TurboQuant benchmarks
- Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks
### Cold Outreach Template
"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"
### Target Clients
- Indian startups using LLMs in production (inc42 AI list)
- US startups from YC directory (AI/ML category, S24/W25 batches)
- Anyone on Twitter complaining about GPU costs / inference scaling
- Companies with >$10K/month GPU spend
|