Vivek Varikuti β Complete Profile & Project Portfolio
Who I Am
- 22 years old, AI Engineer & Startup Founder
- GitHub: vivekvar-dl
- Email: domainluther1234@gmail.com
- Strong Python/PyTorch/LLM skills, deep transformer training experience
- Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
- CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
- Transformers 5.4.0
Working Style
- No AI fluff. No menus of options. Make the decision and execute.
- Write like a human β no perfect grammar, no emojis, no "leveraging" or "seamless"
- Any public text must read like a tired developer typed it at 2am
- No co-authored-by Claude in git commits β public contributions look fully human
- Verify before claiming. Test before shipping. Always run the actual code.
Project 1: TurboQuant β KV Cache Compression
What: Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.
Why: Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).
Location: /home/azureuser/turboquant/
Status: Working prototype. Google hasn't released their code publicly β this is one of the first working implementations.
Core Method: Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning β drop-in replacement.
Key Discovery: Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.
Benchmark Results (H100 NVL 96GB)
Model Architecture Summary
| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity |
|---|---|---|---|---|---|
| Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact |
| Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact |
| Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact |
| Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact |
| Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact |
| Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A |
Memory Savings at 8K Context
| Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction |
|---|---|---|---|---|
| Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% |
| Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% |
| Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% |
| LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% |
| Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% |
Memory Savings Scaling (LLaMA-3.1-8B)
| Context Length | Default VRAM | TurboQuant VRAM | Saved |
|---|---|---|---|
| 1K tokens | 6.00 GB | 5.91 GB | 93 MB |
| 4K tokens | 6.67 GB | 6.27 GB | 417 MB |
| 8K tokens | 7.71 GB | 6.84 GB | 890 MB |
Full Memory Data Per Model
Qwen2.5-7B (5.45 GB model)
- Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
- Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
- 1K: 5.76β5.73 GB (37 MB saved)
- 4K: 6.27β6.10 GB (176 MB saved)
- 8K: 7.08β6.71 GB (380 MB saved)
LLaMA-3.1-8B (5.68 GB model)
- Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
- No outlier layers
- 1K: 6.00β5.91 GB (93 MB saved, output match)
- 4K: 6.67β6.27 GB (417 MB saved, output match)
- 8K: 7.71β6.84 GB (890 MB saved, output match)
Gemma-2-9B (6.08 GB model)
- Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
- No outlier layers
- 1K: 6.62β6.38 GB (244 MB saved)
- 4K: 7.96β6.89 GB (1,096 MB saved)
- 8K: 9.98β7.71 GB (2,323 MB saved)
Phi-4-14B (9.10 GB model)
- Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
- No outlier layers
- 1K: 9.75β9.61 GB (146 MB saved)
- 4K: 10.72β10.09 GB (650 MB saved)
- 8K: 12.28β10.92 GB (1,392 MB saved)
Qwen2.5-32B (19.31 GB model)
- Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
- No outlier layers
- 1K: 19.97β19.79 GB (186 MB saved)
- 4K: 21.23β20.42 GB (833 MB saved)
- 8K: 23.16β21.41 GB (1,791 MB saved)
LLaMA-3.3-70B β failed with "No space left on device"
Quality Verification
All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"
- Prefill logit difference: 0.0 across ALL models
- Same top-1 token prediction: 100% across ALL models
- Output coherence: 100% β both default and TurboQuant outputs fully coherent
- Token match rate varies (18-100%) due to natural autoregressive sampling divergence β both outputs equally valid
Detailed quality per model:
Qwen2.5-7B: token match 39%, 3%, 54% β both coherent all 3 prompts LLaMA-3.1-8B: token match 89.1%, 100%, 100% β 2/3 exact match Phi-4-14B: token match 100%, 44%, 100% β 2/3 exact match Gemma-2-9B: token match 100%, 100%, 18.8% β 2/3 exact match Qwen2.5-32B: token match 71%, 25%, 53% β both coherent all 3 prompts
Infrastructure Notes
- Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
- PyTorch compiled for CUDA 13.0+ won't work β need cu128 wheel
- Core quantizer verified (MSE matches paper bounds)
- Cache integrates with HF Transformers v5.4.0 QuantizedLayer API
Project 2: Parameter Golf Competition (OpenAI)
What: OpenAI competition β train the best language model within a 16MB artifact, 10 minutes on 8xH100.
Metric: Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)
Timeline: March 18 - April 30, 2026
Current SOTA (merged): 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)
Our Edge: sp4096 Vocabulary
- sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 β 26.2% fewer tokens
- Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB β -5.5%
- #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB β -6.8%
- Extrapolated SOTA: 1.1194 Γ 0.93 β 1.04-1.06 BPB
Architecture
- 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
- Factored embeddings: tok_emb(4096x256) + embed_up(256β512) + embed_down(512β256)
- All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite
Key Files
- our_submission/train_gpt.py β modified #1 with sp4096 + factored embed + FA2 fallback
- our_submission/train_gpt_original.py β unmodified #1 with FA2 fallback
- train_sp4096.py β tokenizer training + data sharding script
- data/tokenizers/fineweb_4096_bpe.model β trained sp4096 tokenizer
- data/datasets/fineweb10B_sp4096/ β 80 train shards + 1 val shard
N-gram Cache: CONFIRMED FAKE
- 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
- All sub-1.0 BPB claims are measurement artifacts from hash collisions
- Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement
Next Steps
- Medium fidelity run (10min 1xH100)
- Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
- Get 8xH100 access for final submission (compute grant or RunPod)
- Temperature scaling, document-isolated TTT for extra gains
Hardware
- Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
- flash-attn 2.8.3 (FA2, not FA3)
- Final submission needs 8xH100
Project 3: GSoC 2026 β DeepChem OLMo Wrapper
What: Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.
Org: DeepChem (standalone first time in GSoC 2026) Mentors: Riya, Harindhar Deadline: March 31, 2026 18:00 UTC (submitted)
What Was Built
PR #4913 (LIVE) β Bug Fix
- Fixed ChemBERTa broken import for transformers 5.x
transformers.models.roberta.tokenization_roberta_fastremoved in 5.x- 3 additions / 4 deletions
- https://github.com/deepchem/deepchem/pull/4913
Issue #4912 (LIVE) β Compat Report
- Broader transformers 5.x compatibility issues documented
- https://github.com/deepchem/deepchem/issues/4912
OLMo Wrapper (LOCAL ONLY β not pushed)
- Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
- Olmo2ForSequenceClassification β built from scratch (doesn't exist in HF)
- OLMo wrapper class extending HuggingFaceModel
- Added causal_lm task + generate() to base HuggingFaceModel
- 8/8 tests pass in 27 seconds on CPU
- Uses OLMo-2 (allenai/OLMo-2-1124-7B)
Experiments Run
- BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
- ESOL regression: RΒ² 0.37, MAE 1.27
- SMILES generation: 0% validity (proves pretraining is core challenge)
- Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters
Proposal
- ~/gsoc_proposal_final.md β human-written version
- ~/gsoc_proposal_content.md β raw technical reference
Key Context
- PR #4907 by Aditya-ad48 also adds causal LM generation β complementary not competing
- DeepChem wants small PRs (<50 lines) for new contributors
- rbharath is the main reviewer/maintainer
- Office hours MWF 9am PST
- Discord: https://discord.gg/RYTrUY8Ssn
Project 4: Genesis β Artificial Life Simulation
What: Virtual world where blank GRU neural net agents evolve survival behaviors from scratch β foraging, water-seeking, communication β on H100 GPU using JAX.
Location: /home/azureuser/genesis/ (venv at ~/genesis_env/)
World Setup
- 512x512 grid with Perlin noise terrain
- Food regrowth, water sources, day/night cycles, seasons
- 1000 agents with GRU brains (~82K params each)
- Tournament selection + Gaussian mutation (self-adaptive sigma)
- Agents start with zero knowledge β must learn to survive
Status (2026-04-01)
Phases 1-3 complete. 500K step run finished successfully:
- 86 generations evolved
- Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
- Signal entropy dropping (4.28β3.58) β indicating early communication structure
- Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)
Key Fix
food_growth_rate bumped from 0.005β0.02, food_eat_amount 0.05β0.03 to prevent ecological collapse at high generations.
Architecture
- World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
- Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
- Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
- Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
- Analysis: emergence.py (signal entropy, magnitude, RΒ², diversity, clustering)
- Visualization: renderer.py (dashboard, world map, zoom views)
Run Data
~/genesis/runs/run_20260401_111309/ β metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json
Next Steps
- Phase 4: TRIBE v2 integration β compare evolved GRU representations to human brain activity via RSA
- Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
- Checkpoints at 50K intervals allow comparing brain representations across evolutionary time
Project 5: TRIBE v2 β AI-Brain Loop
What: Closing the AI-brain comparison loop using Meta's TRIBE v2 β comparing AI encoder representations to predicted brain activity to find architectural gaps.
Location: /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)
What's Built
- Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
- 8 phases: load model β extract per-layer features β brain parcellation β layer-wise encoding β modality ablation β RSA β divergence mapping β visualization
- Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
- Results: /home/azureuser/loop_results_v2/
- Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)
Status
LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times β all failed.
Why Attempts Failed
- Never had real brain targets β routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
- Evaluated on wrong metric (classification accuracy instead of robustness)
- Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy
What Would Actually Work (from RESEARCH_BRIEF.md)
- Pre-compute real brain targets using TRIBE v2's full pipeline
- Train student with classification + per-vertex Pearson correlation brain loss
- Evaluate on corruption/adversarial robustness, shape bias, brain-score β NOT accuracy
- Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions
Key Infrastructure
- Training scripts: /home/azureuser/brain_guided/train_*.py
- UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
- Results: /home/azureuser/brain_guided/results_final/
Project 6: Instagram Cinema
What: AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.
Setup: LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.
Format: Instagram Reels β 9:16 portrait, 544x960
Goal: Create viral-quality cinematic content for Instagram Reels.
Money-Making Strategy (April 2026)
Sellable Assets
- TurboQuant β working implementation nobody else has publicly. Lead magnet for consulting.
- Parameter Golf β competition result (if top placement) = massive credibility signal
- Fine-tuning expertise β proven on H100, multiple model families
- Inference optimization consulting β directly from TurboQuant benchmarks
Immediate Plan
- Path to 10L: Freelancing/consulting β fine-tuning + inference optimization
- Path to 1Cr: Productized consulting at scale or AI startup
- Channel: X (Twitter) for distribution, direct DMs to founders for sales
X (Twitter) Growth Strategy
- Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
- Strategy: 70% replies (to bigger accounts), 30% original posts
- Target: 15 strategic replies/day to accounts with 100-5000 followers
- Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
- Pinned thread: TurboQuant benchmarks
- Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks
Cold Outreach Template
"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"
Target Clients
- Indian startups using LLMs in production (inc42 AI list)
- US startups from YC directory (AI/ML category, S24/W25 batches)
- Anyone on Twitter complaining about GPU costs / inference scaling
- Companies with >$10K/month GPU spend