# Vivek Varikuti — Complete Profile & Project Portfolio ## Who I Am - 22 years old, AI Engineer & Startup Founder - GitHub: vivekvar-dl - Email: domainluther1234@gmail.com - Strong Python/PyTorch/LLM skills, deep transformer training experience - Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5) - CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2) - Transformers 5.4.0 --- ## Working Style - No AI fluff. No menus of options. Make the decision and execute. - Write like a human — no perfect grammar, no emojis, no "leveraging" or "seamless" - Any public text must read like a tired developer typed it at 2am - No co-authored-by Claude in git commits — public contributions look fully human - Verify before claiming. Test before shipping. Always run the actual code. --- ## Project 1: TurboQuant — KV Cache Compression **What:** Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference. **Why:** Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB). **Location:** /home/azureuser/turboquant/ **Status:** Working prototype. Google hasn't released their code publicly — this is one of the first working implementations. **Core Method:** Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning — drop-in replacement. **Key Discovery:** Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers. ### Benchmark Results (H100 NVL 96GB) #### Model Architecture Summary | Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity | |-------|-------------|----------|---------|----------------|-----------------| | Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact | | Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact | | Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact | | Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact | | Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact | | Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A | #### Memory Savings at 8K Context | Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction | |-------|-------------|----------------|-------|-------------------| | Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% | | Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% | | Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% | | LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% | | Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% | #### Memory Savings Scaling (LLaMA-3.1-8B) | Context Length | Default VRAM | TurboQuant VRAM | Saved | |---------------|-------------|----------------|-------| | 1K tokens | 6.00 GB | 5.91 GB | 93 MB | | 4K tokens | 6.67 GB | 6.27 GB | 417 MB | | 8K tokens | 7.71 GB | 6.84 GB | 890 MB | #### Full Memory Data Per Model **Qwen2.5-7B (5.45 GB model)** - Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x - Outlier layers: 0 (norm 273.84), 27 (norm 239.91) - 1K: 5.76→5.73 GB (37 MB saved) - 4K: 6.27→6.10 GB (176 MB saved) - 8K: 7.08→6.71 GB (380 MB saved) **LLaMA-3.1-8B (5.68 GB model)** - Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x - No outlier layers - 1K: 6.00→5.91 GB (93 MB saved, output match) - 4K: 6.67→6.27 GB (417 MB saved, output match) - 8K: 7.71→6.84 GB (890 MB saved, output match) **Gemma-2-9B (6.08 GB model)** - Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x - No outlier layers - 1K: 6.62→6.38 GB (244 MB saved) - 4K: 7.96→6.89 GB (1,096 MB saved) - 8K: 9.98→7.71 GB (2,323 MB saved) **Phi-4-14B (9.10 GB model)** - Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x - No outlier layers - 1K: 9.75→9.61 GB (146 MB saved) - 4K: 10.72→10.09 GB (650 MB saved) - 8K: 12.28→10.92 GB (1,392 MB saved) **Qwen2.5-32B (19.31 GB model)** - Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x - No outlier layers - 1K: 19.97→19.79 GB (186 MB saved) - 4K: 21.23→20.42 GB (833 MB saved) - 8K: 23.16→21.41 GB (1,791 MB saved) **LLaMA-3.3-70B** — failed with "No space left on device" #### Quality Verification All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?" - Prefill logit difference: 0.0 across ALL models - Same top-1 token prediction: 100% across ALL models - Output coherence: 100% — both default and TurboQuant outputs fully coherent - Token match rate varies (18-100%) due to natural autoregressive sampling divergence — both outputs equally valid **Detailed quality per model:** Qwen2.5-7B: token match 39%, 3%, 54% — both coherent all 3 prompts LLaMA-3.1-8B: token match 89.1%, 100%, 100% — 2/3 exact match Phi-4-14B: token match 100%, 44%, 100% — 2/3 exact match Gemma-2-9B: token match 100%, 100%, 18.8% — 2/3 exact match Qwen2.5-32B: token match 71%, 25%, 53% — both coherent all 3 prompts #### Infrastructure Notes - Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570) - PyTorch compiled for CUDA 13.0+ won't work — need cu128 wheel - Core quantizer verified (MSE matches paper bounds) - Cache integrates with HF Transformers v5.4.0 QuantizedLayer API --- ## Project 2: Parameter Golf Competition (OpenAI) **What:** OpenAI competition — train the best language model within a 16MB artifact, 10 minutes on 8xH100. **Metric:** Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096) **Timeline:** March 18 - April 30, 2026 **Current SOTA (merged):** 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon) ### Our Edge: sp4096 Vocabulary - sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 → 26.2% fewer tokens - Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB → -5.5% - #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB → -6.8% - Extrapolated SOTA: 1.1194 × 0.93 ≈ 1.04-1.06 BPB ### Architecture - 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2 - Factored embeddings: tok_emb(4096x256) + embed_up(256→512) + embed_down(512→256) - All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite ### Key Files - our_submission/train_gpt.py — modified #1 with sp4096 + factored embed + FA2 fallback - our_submission/train_gpt_original.py — unmodified #1 with FA2 fallback - train_sp4096.py — tokenizer training + data sharding script - data/tokenizers/fineweb_4096_bpe.model — trained sp4096 tokenizer - data/datasets/fineweb10B_sp4096/ — 80 train shards + 1 val shard ### N-gram Cache: CONFIRMED FAKE - 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement) - All sub-1.0 BPB claims are measurement artifacts from hash collisions - Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement ### Next Steps 1. Medium fidelity run (10min 1xH100) 2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom) 3. Get 8xH100 access for final submission (compute grant or RunPod) 4. Temperature scaling, document-isolated TTT for extra gains ### Hardware - Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128 - flash-attn 2.8.3 (FA2, not FA3) - Final submission needs 8xH100 --- ## Project 3: GSoC 2026 — DeepChem OLMo Wrapper **What:** Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation. **Org:** DeepChem (standalone first time in GSoC 2026) **Mentors:** Riya, Harindhar **Deadline:** March 31, 2026 18:00 UTC (submitted) ### What Was Built **PR #4913 (LIVE) — Bug Fix** - Fixed ChemBERTa broken import for transformers 5.x - `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x - 3 additions / 4 deletions - https://github.com/deepchem/deepchem/pull/4913 **Issue #4912 (LIVE) — Compat Report** - Broader transformers 5.x compatibility issues documented - https://github.com/deepchem/deepchem/issues/4912 **OLMo Wrapper (LOCAL ONLY — not pushed)** - Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py - Olmo2ForSequenceClassification — built from scratch (doesn't exist in HF) - OLMo wrapper class extending HuggingFaceModel - Added causal_lm task + generate() to base HuggingFaceModel - 8/8 tests pass in 27 seconds on CPU - Uses OLMo-2 (allenai/OLMo-2-1124-7B) ### Experiments Run - BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples) - ESOL regression: R² 0.37, MAE 1.27 - SMILES generation: 0% validity (proves pretraining is core challenge) - Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters ### Proposal - ~/gsoc_proposal_final.md — human-written version - ~/gsoc_proposal_content.md — raw technical reference ### Key Context - PR #4907 by Aditya-ad48 also adds causal LM generation — complementary not competing - DeepChem wants small PRs (<50 lines) for new contributors - rbharath is the main reviewer/maintainer - Office hours MWF 9am PST - Discord: https://discord.gg/RYTrUY8Ssn --- ## Project 4: Genesis — Artificial Life Simulation **What:** Virtual world where blank GRU neural net agents evolve survival behaviors from scratch — foraging, water-seeking, communication — on H100 GPU using JAX. **Location:** /home/azureuser/genesis/ (venv at ~/genesis_env/) ### World Setup - 512x512 grid with Perlin noise terrain - Food regrowth, water sources, day/night cycles, seasons - 1000 agents with GRU brains (~82K params each) - Tournament selection + Gaussian mutation (self-adaptive sigma) - Agents start with zero knowledge — must learn to survive ### Status (2026-04-01) Phases 1-3 complete. 500K step run finished successfully: - 86 generations evolved - Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79 - Signal entropy dropping (4.28→3.58) — indicating early communication structure - Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled) ### Key Fix food_growth_rate bumped from 0.005→0.02, food_eat_amount 0.05→0.03 to prevent ecological collapse at high generations. ### Architecture - World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py - Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py - Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py - Communication: signals.py (8-channel, spatial attenuation, top-4 reception) - Analysis: emergence.py (signal entropy, magnitude, R², diversity, clustering) - Visualization: renderer.py (dashboard, world map, zoom views) ### Run Data ~/genesis/runs/run_20260401_111309/ — metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json ### Next Steps - Phase 4: TRIBE v2 integration — compare evolved GRU representations to human brain activity via RSA - Phase 5: Scale to 5K+ agents, longer runs for 500+ generations - Checkpoints at 50K intervals allow comparing brain representations across evolutionary time --- ## Project 5: TRIBE v2 — AI-Brain Loop **What:** Closing the AI-brain comparison loop using Meta's TRIBE v2 — comparing AI encoder representations to predicted brain activity to find architectural gaps. **Location:** /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env) ### What's Built - Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py - 8 phases: load model → extract per-layer features → brain parcellation → layer-wise encoding → modality ablation → RSA → divergence mapping → visualization - Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4 - Results: /home/azureuser/loop_results_v2/ - Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B) ### Status LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times — all failed. ### Why Attempts Failed - Never had real brain targets — routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs - Evaluated on wrong metric (classification accuracy instead of robustness) - Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy ### What Would Actually Work (from RESEARCH_BRIEF.md) 1. Pre-compute real brain targets using TRIBE v2's full pipeline 2. Train student with classification + per-vertex Pearson correlation brain loss 3. Evaluate on corruption/adversarial robustness, shape bias, brain-score — NOT accuracy 4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions ### Key Infrastructure - Training scripts: /home/azureuser/brain_guided/train_*.py - UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos) - Results: /home/azureuser/brain_guided/results_final/ --- ## Project 6: Instagram Cinema **What:** AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth. **Setup:** LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel. **Format:** Instagram Reels — 9:16 portrait, 544x960 **Goal:** Create viral-quality cinematic content for Instagram Reels. --- ## Money-Making Strategy (April 2026) ### Sellable Assets 1. **TurboQuant** — working implementation nobody else has publicly. Lead magnet for consulting. 2. **Parameter Golf** — competition result (if top placement) = massive credibility signal 3. **Fine-tuning expertise** — proven on H100, multiple model families 4. **Inference optimization consulting** — directly from TurboQuant benchmarks ### Immediate Plan - Path to 10L: Freelancing/consulting — fine-tuning + inference optimization - Path to 1Cr: Productized consulting at scale or AI startup - Channel: X (Twitter) for distribution, direct DMs to founders for sales ### X (Twitter) Growth Strategy - Account: 10 followers currently, Premium purchased (213.50/month with 50% off) - Strategy: 70% replies (to bigger accounts), 30% original posts - Target: 15 strategic replies/day to accounts with 100-5000 followers - Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu - Pinned thread: TurboQuant benchmarks - Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks ### Cold Outreach Template "I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?" ### Target Clients - Indian startups using LLMs in production (inc42 AI list) - US startups from YC directory (AI/ML category, S24/W25 batches) - Anyone on Twitter complaining about GPU costs / inference scaling - Companies with >$10K/month GPU spend