| # Vivek Varikuti β Complete Profile & Project Portfolio |
|
|
| ## Who I Am |
|
|
| - 22 years old, AI Engineer & Startup Founder |
| - GitHub: vivekvar-dl |
| - Email: domainluther1234@gmail.com |
| - Strong Python/PyTorch/LLM skills, deep transformer training experience |
| - Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5) |
| - CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2) |
| - Transformers 5.4.0 |
|
|
| --- |
|
|
| ## Working Style |
|
|
| - No AI fluff. No menus of options. Make the decision and execute. |
| - Write like a human β no perfect grammar, no emojis, no "leveraging" or "seamless" |
| - Any public text must read like a tired developer typed it at 2am |
| - No co-authored-by Claude in git commits β public contributions look fully human |
| - Verify before claiming. Test before shipping. Always run the actual code. |
|
|
| --- |
|
|
| ## Project 1: TurboQuant β KV Cache Compression |
|
|
| **What:** Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference. |
|
|
| **Why:** Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB). |
|
|
| **Location:** /home/azureuser/turboquant/ |
|
|
| **Status:** Working prototype. Google hasn't released their code publicly β this is one of the first working implementations. |
|
|
| **Core Method:** Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning β drop-in replacement. |
|
|
| **Key Discovery:** Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers. |
| |
| ### Benchmark Results (H100 NVL 96GB) |
| |
| #### Model Architecture Summary |
| |
| | Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity | |
| |-------|-------------|----------|---------|----------------|-----------------| |
| | Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact | |
| | Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact | |
| | Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact | |
| | Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact | |
| | Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact | |
| | Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A | |
|
|
| #### Memory Savings at 8K Context |
|
|
| | Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction | |
| |-------|-------------|----------------|-------|-------------------| |
| | Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% | |
| | Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% | |
| | Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% | |
| | LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% | |
| | Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% | |
|
|
| #### Memory Savings Scaling (LLaMA-3.1-8B) |
|
|
| | Context Length | Default VRAM | TurboQuant VRAM | Saved | |
| |---------------|-------------|----------------|-------| |
| | 1K tokens | 6.00 GB | 5.91 GB | 93 MB | |
| | 4K tokens | 6.67 GB | 6.27 GB | 417 MB | |
| | 8K tokens | 7.71 GB | 6.84 GB | 890 MB | |
|
|
| #### Full Memory Data Per Model |
|
|
| **Qwen2.5-7B (5.45 GB model)** |
| - Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x |
| - Outlier layers: 0 (norm 273.84), 27 (norm 239.91) |
| - 1K: 5.76β5.73 GB (37 MB saved) |
| - 4K: 6.27β6.10 GB (176 MB saved) |
| - 8K: 7.08β6.71 GB (380 MB saved) |
|
|
| **LLaMA-3.1-8B (5.68 GB model)** |
| - Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x |
| - No outlier layers |
| - 1K: 6.00β5.91 GB (93 MB saved, output match) |
| - 4K: 6.67β6.27 GB (417 MB saved, output match) |
| - 8K: 7.71β6.84 GB (890 MB saved, output match) |
|
|
| **Gemma-2-9B (6.08 GB model)** |
| - Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x |
| - No outlier layers |
| - 1K: 6.62β6.38 GB (244 MB saved) |
| - 4K: 7.96β6.89 GB (1,096 MB saved) |
| - 8K: 9.98β7.71 GB (2,323 MB saved) |
|
|
| **Phi-4-14B (9.10 GB model)** |
| - Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x |
| - No outlier layers |
| - 1K: 9.75β9.61 GB (146 MB saved) |
| - 4K: 10.72β10.09 GB (650 MB saved) |
| - 8K: 12.28β10.92 GB (1,392 MB saved) |
|
|
| **Qwen2.5-32B (19.31 GB model)** |
| - Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x |
| - No outlier layers |
| - 1K: 19.97β19.79 GB (186 MB saved) |
| - 4K: 21.23β20.42 GB (833 MB saved) |
| - 8K: 23.16β21.41 GB (1,791 MB saved) |
|
|
| **LLaMA-3.3-70B** β failed with "No space left on device" |
|
|
| #### Quality Verification |
|
|
| All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?" |
|
|
| - Prefill logit difference: 0.0 across ALL models |
| - Same top-1 token prediction: 100% across ALL models |
| - Output coherence: 100% β both default and TurboQuant outputs fully coherent |
| - Token match rate varies (18-100%) due to natural autoregressive sampling divergence β both outputs equally valid |
|
|
| **Detailed quality per model:** |
|
|
| Qwen2.5-7B: token match 39%, 3%, 54% β both coherent all 3 prompts |
| LLaMA-3.1-8B: token match 89.1%, 100%, 100% β 2/3 exact match |
| Phi-4-14B: token match 100%, 44%, 100% β 2/3 exact match |
| Gemma-2-9B: token match 100%, 100%, 18.8% β 2/3 exact match |
| Qwen2.5-32B: token match 71%, 25%, 53% β both coherent all 3 prompts |
|
|
| #### Infrastructure Notes |
| - Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570) |
| - PyTorch compiled for CUDA 13.0+ won't work β need cu128 wheel |
| - Core quantizer verified (MSE matches paper bounds) |
| - Cache integrates with HF Transformers v5.4.0 QuantizedLayer API |
|
|
| --- |
|
|
| ## Project 2: Parameter Golf Competition (OpenAI) |
|
|
| **What:** OpenAI competition β train the best language model within a 16MB artifact, 10 minutes on 8xH100. |
|
|
| **Metric:** Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096) |
|
|
| **Timeline:** March 18 - April 30, 2026 |
|
|
| **Current SOTA (merged):** 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon) |
|
|
| ### Our Edge: sp4096 Vocabulary |
|
|
| - sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 β 26.2% fewer tokens |
| - Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB β -5.5% |
| - #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB β -6.8% |
| - Extrapolated SOTA: 1.1194 Γ 0.93 β 1.04-1.06 BPB |
|
|
| ### Architecture |
|
|
| - 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2 |
| - Factored embeddings: tok_emb(4096x256) + embed_up(256β512) + embed_down(512β256) |
| - All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite |
| |
| ### Key Files |
| |
| - our_submission/train_gpt.py β modified #1 with sp4096 + factored embed + FA2 fallback |
| - our_submission/train_gpt_original.py β unmodified #1 with FA2 fallback |
| - train_sp4096.py β tokenizer training + data sharding script |
| - data/tokenizers/fineweb_4096_bpe.model β trained sp4096 tokenizer |
| - data/datasets/fineweb10B_sp4096/ β 80 train shards + 1 val shard |
|
|
| ### N-gram Cache: CONFIRMED FAKE |
|
|
| - 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement) |
| - All sub-1.0 BPB claims are measurement artifacts from hash collisions |
| - Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement |
|
|
| ### Next Steps |
|
|
| 1. Medium fidelity run (10min 1xH100) |
| 2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom) |
| 3. Get 8xH100 access for final submission (compute grant or RunPod) |
| 4. Temperature scaling, document-isolated TTT for extra gains |
|
|
| ### Hardware |
|
|
| - Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128 |
| - flash-attn 2.8.3 (FA2, not FA3) |
| - Final submission needs 8xH100 |
|
|
| --- |
|
|
| ## Project 3: GSoC 2026 β DeepChem OLMo Wrapper |
|
|
| **What:** Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation. |
|
|
| **Org:** DeepChem (standalone first time in GSoC 2026) |
| **Mentors:** Riya, Harindhar |
| **Deadline:** March 31, 2026 18:00 UTC (submitted) |
|
|
| ### What Was Built |
|
|
| **PR #4913 (LIVE) β Bug Fix** |
| - Fixed ChemBERTa broken import for transformers 5.x |
| - `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x |
| - 3 additions / 4 deletions |
| - https://github.com/deepchem/deepchem/pull/4913 |
|
|
| **Issue #4912 (LIVE) β Compat Report** |
| - Broader transformers 5.x compatibility issues documented |
| - https://github.com/deepchem/deepchem/issues/4912 |
|
|
| **OLMo Wrapper (LOCAL ONLY β not pushed)** |
| - Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py |
| - Olmo2ForSequenceClassification β built from scratch (doesn't exist in HF) |
| - OLMo wrapper class extending HuggingFaceModel |
| - Added causal_lm task + generate() to base HuggingFaceModel |
| - 8/8 tests pass in 27 seconds on CPU |
| - Uses OLMo-2 (allenai/OLMo-2-1124-7B) |
|
|
| ### Experiments Run |
|
|
| - BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples) |
| - ESOL regression: RΒ² 0.37, MAE 1.27 |
| - SMILES generation: 0% validity (proves pretraining is core challenge) |
| - Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters |
|
|
| ### Proposal |
|
|
| - ~/gsoc_proposal_final.md β human-written version |
| - ~/gsoc_proposal_content.md β raw technical reference |
|
|
| ### Key Context |
|
|
| - PR #4907 by Aditya-ad48 also adds causal LM generation β complementary not competing |
| - DeepChem wants small PRs (<50 lines) for new contributors |
| - rbharath is the main reviewer/maintainer |
| - Office hours MWF 9am PST |
| - Discord: https://discord.gg/RYTrUY8Ssn |
|
|
| --- |
|
|
| ## Project 4: Genesis β Artificial Life Simulation |
|
|
| **What:** Virtual world where blank GRU neural net agents evolve survival behaviors from scratch β foraging, water-seeking, communication β on H100 GPU using JAX. |
|
|
| **Location:** /home/azureuser/genesis/ (venv at ~/genesis_env/) |
| |
| ### World Setup |
| |
| - 512x512 grid with Perlin noise terrain |
| - Food regrowth, water sources, day/night cycles, seasons |
| - 1000 agents with GRU brains (~82K params each) |
| - Tournament selection + Gaussian mutation (self-adaptive sigma) |
| - Agents start with zero knowledge β must learn to survive |
| |
| ### Status (2026-04-01) |
| |
| Phases 1-3 complete. 500K step run finished successfully: |
| - 86 generations evolved |
| - Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79 |
| - Signal entropy dropping (4.28β3.58) β indicating early communication structure |
| - Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled) |
| |
| ### Key Fix |
| |
| food_growth_rate bumped from 0.005β0.02, food_eat_amount 0.05β0.03 to prevent ecological collapse at high generations. |
| |
| ### Architecture |
| |
| - World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py |
| - Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py |
| - Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py |
| - Communication: signals.py (8-channel, spatial attenuation, top-4 reception) |
| - Analysis: emergence.py (signal entropy, magnitude, RΒ², diversity, clustering) |
| - Visualization: renderer.py (dashboard, world map, zoom views) |
| |
| ### Run Data |
| |
| ~/genesis/runs/run_20260401_111309/ β metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json |
| |
| ### Next Steps |
| |
| - Phase 4: TRIBE v2 integration β compare evolved GRU representations to human brain activity via RSA |
| - Phase 5: Scale to 5K+ agents, longer runs for 500+ generations |
| - Checkpoints at 50K intervals allow comparing brain representations across evolutionary time |
| |
| --- |
| |
| ## Project 5: TRIBE v2 β AI-Brain Loop |
| |
| **What:** Closing the AI-brain comparison loop using Meta's TRIBE v2 β comparing AI encoder representations to predicted brain activity to find architectural gaps. |
| |
| **Location:** /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env) |
|
|
| ### What's Built |
|
|
| - Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py |
| - 8 phases: load model β extract per-layer features β brain parcellation β layer-wise encoding β modality ablation β RSA β divergence mapping β visualization |
| - Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4 |
| - Results: /home/azureuser/loop_results_v2/ |
| - Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B) |
|
|
| ### Status |
|
|
| LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times β all failed. |
|
|
| ### Why Attempts Failed |
|
|
| - Never had real brain targets β routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs |
| - Evaluated on wrong metric (classification accuracy instead of robustness) |
| - Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy |
|
|
| ### What Would Actually Work (from RESEARCH_BRIEF.md) |
| |
| 1. Pre-compute real brain targets using TRIBE v2's full pipeline |
| 2. Train student with classification + per-vertex Pearson correlation brain loss |
| 3. Evaluate on corruption/adversarial robustness, shape bias, brain-score β NOT accuracy |
| 4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions |
| |
| ### Key Infrastructure |
| |
| - Training scripts: /home/azureuser/brain_guided/train_*.py |
| - UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos) |
| - Results: /home/azureuser/brain_guided/results_final/ |
|
|
| --- |
|
|
| ## Project 6: Instagram Cinema |
|
|
| **What:** AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth. |
|
|
| **Setup:** LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel. |
|
|
| **Format:** Instagram Reels β 9:16 portrait, 544x960 |
|
|
| **Goal:** Create viral-quality cinematic content for Instagram Reels. |
|
|
| --- |
|
|
| ## Money-Making Strategy (April 2026) |
|
|
| ### Sellable Assets |
|
|
| 1. **TurboQuant** β working implementation nobody else has publicly. Lead magnet for consulting. |
| 2. **Parameter Golf** β competition result (if top placement) = massive credibility signal |
| 3. **Fine-tuning expertise** β proven on H100, multiple model families |
| 4. **Inference optimization consulting** β directly from TurboQuant benchmarks |
|
|
| ### Immediate Plan |
|
|
| - Path to 10L: Freelancing/consulting β fine-tuning + inference optimization |
| - Path to 1Cr: Productized consulting at scale or AI startup |
| - Channel: X (Twitter) for distribution, direct DMs to founders for sales |
|
|
| ### X (Twitter) Growth Strategy |
|
|
| - Account: 10 followers currently, Premium purchased (213.50/month with 50% off) |
| - Strategy: 70% replies (to bigger accounts), 30% original posts |
| - Target: 15 strategic replies/day to accounts with 100-5000 followers |
| - Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu |
| - Pinned thread: TurboQuant benchmarks |
| - Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks |
|
|
| ### Cold Outreach Template |
|
|
| "I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?" |
|
|
| ### Target Clients |
|
|
| - Indian startups using LLMs in production (inc42 AI list) |
| - US startups from YC directory (AI/ML category, S24/W25 batches) |
| - Anyone on Twitter complaining about GPU costs / inference scaling |
| - Companies with >$10K/month GPU spend |
|
|