File size: 14,842 Bytes

a70eb3d

# Vivek Varikuti — Complete Profile & Project Portfolio

## Who I Am

- 22 years old, AI Engineer & Startup Founder
- GitHub: vivekvar-dl
- Email: domainluther1234@gmail.com
- Strong Python/PyTorch/LLM skills, deep transformer training experience
- Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
- CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
- Transformers 5.4.0

---

## Working Style

- No AI fluff. No menus of options. Make the decision and execute.
- Write like a human — no perfect grammar, no emojis, no "leveraging" or "seamless"
- Any public text must read like a tired developer typed it at 2am
- No co-authored-by Claude in git commits — public contributions look fully human
- Verify before claiming. Test before shipping. Always run the actual code.

---

## Project 1: TurboQuant — KV Cache Compression

**What:** Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.

**Why:** Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).

**Location:** /home/azureuser/turboquant/

**Status:** Working prototype. Google hasn't released their code publicly — this is one of the first working implementations.

**Core Method:** Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning — drop-in replacement.

**Key Discovery:** Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.

### Benchmark Results (H100 NVL 96GB)

#### Model Architecture Summary

| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity |
|-------|-------------|----------|---------|----------------|-----------------|
| Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact |
| Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact |
| Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact |
| Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact |
| Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact |
| Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A |

#### Memory Savings at 8K Context

| Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction |
|-------|-------------|----------------|-------|-------------------|
| Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% |
| Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% |
| Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% |
| LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% |
| Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% |

#### Memory Savings Scaling (LLaMA-3.1-8B)

| Context Length | Default VRAM | TurboQuant VRAM | Saved |
|---------------|-------------|----------------|-------|
| 1K tokens | 6.00 GB | 5.91 GB | 93 MB |
| 4K tokens | 6.67 GB | 6.27 GB | 417 MB |
| 8K tokens | 7.71 GB | 6.84 GB | 890 MB |

#### Full Memory Data Per Model

**Qwen2.5-7B (5.45 GB model)**
- Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
- Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
- 1K: 5.76→5.73 GB (37 MB saved)
- 4K: 6.27→6.10 GB (176 MB saved)
- 8K: 7.08→6.71 GB (380 MB saved)

**LLaMA-3.1-8B (5.68 GB model)**
- Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
- No outlier layers
- 1K: 6.00→5.91 GB (93 MB saved, output match)
- 4K: 6.67→6.27 GB (417 MB saved, output match)
- 8K: 7.71→6.84 GB (890 MB saved, output match)

**Gemma-2-9B (6.08 GB model)**
- Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
- No outlier layers
- 1K: 6.62→6.38 GB (244 MB saved)
- 4K: 7.96→6.89 GB (1,096 MB saved)
- 8K: 9.98→7.71 GB (2,323 MB saved)

**Phi-4-14B (9.10 GB model)**
- Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
- No outlier layers
- 1K: 9.75→9.61 GB (146 MB saved)
- 4K: 10.72→10.09 GB (650 MB saved)
- 8K: 12.28→10.92 GB (1,392 MB saved)

**Qwen2.5-32B (19.31 GB model)**
- Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
- No outlier layers
- 1K: 19.97→19.79 GB (186 MB saved)
- 4K: 21.23→20.42 GB (833 MB saved)
- 8K: 23.16→21.41 GB (1,791 MB saved)

**LLaMA-3.3-70B** — failed with "No space left on device"

#### Quality Verification

All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"

- Prefill logit difference: 0.0 across ALL models
- Same top-1 token prediction: 100% across ALL models
- Output coherence: 100% — both default and TurboQuant outputs fully coherent
- Token match rate varies (18-100%) due to natural autoregressive sampling divergence — both outputs equally valid

**Detailed quality per model:**

Qwen2.5-7B: token match 39%, 3%, 54% — both coherent all 3 prompts
LLaMA-3.1-8B: token match 89.1%, 100%, 100% — 2/3 exact match
Phi-4-14B: token match 100%, 44%, 100% — 2/3 exact match
Gemma-2-9B: token match 100%, 100%, 18.8% — 2/3 exact match
Qwen2.5-32B: token match 71%, 25%, 53% — both coherent all 3 prompts

#### Infrastructure Notes
- Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
- PyTorch compiled for CUDA 13.0+ won't work — need cu128 wheel
- Core quantizer verified (MSE matches paper bounds)
- Cache integrates with HF Transformers v5.4.0 QuantizedLayer API

---

## Project 2: Parameter Golf Competition (OpenAI)

**What:** OpenAI competition — train the best language model within a 16MB artifact, 10 minutes on 8xH100.

**Metric:** Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)

**Timeline:** March 18 - April 30, 2026

**Current SOTA (merged):** 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)

### Our Edge: sp4096 Vocabulary

- sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 → 26.2% fewer tokens
- Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB → -5.5%
- #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB → -6.8%
- Extrapolated SOTA: 1.1194 × 0.93 ≈ 1.04-1.06 BPB

### Architecture

- 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
- Factored embeddings: tok_emb(4096x256) + embed_up(256→512) + embed_down(512→256)
- All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite

### Key Files

- our_submission/train_gpt.py — modified #1 with sp4096 + factored embed + FA2 fallback
- our_submission/train_gpt_original.py — unmodified #1 with FA2 fallback
- train_sp4096.py — tokenizer training + data sharding script
- data/tokenizers/fineweb_4096_bpe.model — trained sp4096 tokenizer
- data/datasets/fineweb10B_sp4096/ — 80 train shards + 1 val shard

### N-gram Cache: CONFIRMED FAKE

- 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
- All sub-1.0 BPB claims are measurement artifacts from hash collisions
- Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement

### Next Steps

1. Medium fidelity run (10min 1xH100)
2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
3. Get 8xH100 access for final submission (compute grant or RunPod)
4. Temperature scaling, document-isolated TTT for extra gains

### Hardware

- Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
- flash-attn 2.8.3 (FA2, not FA3)
- Final submission needs 8xH100

---

## Project 3: GSoC 2026 — DeepChem OLMo Wrapper

**What:** Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.

**Org:** DeepChem (standalone first time in GSoC 2026)
**Mentors:** Riya, Harindhar
**Deadline:** March 31, 2026 18:00 UTC (submitted)

### What Was Built

**PR #4913 (LIVE) — Bug Fix**
- Fixed ChemBERTa broken import for transformers 5.x
- `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x
- 3 additions / 4 deletions
- https://github.com/deepchem/deepchem/pull/4913

**Issue #4912 (LIVE) — Compat Report**
- Broader transformers 5.x compatibility issues documented
- https://github.com/deepchem/deepchem/issues/4912

**OLMo Wrapper (LOCAL ONLY — not pushed)**
- Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
- Olmo2ForSequenceClassification — built from scratch (doesn't exist in HF)
- OLMo wrapper class extending HuggingFaceModel
- Added causal_lm task + generate() to base HuggingFaceModel
- 8/8 tests pass in 27 seconds on CPU
- Uses OLMo-2 (allenai/OLMo-2-1124-7B)

### Experiments Run

- BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
- ESOL regression: R² 0.37, MAE 1.27
- SMILES generation: 0% validity (proves pretraining is core challenge)
- Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters

### Proposal

- ~/gsoc_proposal_final.md — human-written version
- ~/gsoc_proposal_content.md — raw technical reference

### Key Context

- PR #4907 by Aditya-ad48 also adds causal LM generation — complementary not competing
- DeepChem wants small PRs (<50 lines) for new contributors
- rbharath is the main reviewer/maintainer
- Office hours MWF 9am PST
- Discord: https://discord.gg/RYTrUY8Ssn

---

## Project 4: Genesis — Artificial Life Simulation

**What:** Virtual world where blank GRU neural net agents evolve survival behaviors from scratch — foraging, water-seeking, communication — on H100 GPU using JAX.

**Location:** /home/azureuser/genesis/ (venv at ~/genesis_env/)

### World Setup

- 512x512 grid with Perlin noise terrain
- Food regrowth, water sources, day/night cycles, seasons
- 1000 agents with GRU brains (~82K params each)
- Tournament selection + Gaussian mutation (self-adaptive sigma)
- Agents start with zero knowledge — must learn to survive

### Status (2026-04-01)

Phases 1-3 complete. 500K step run finished successfully:
- 86 generations evolved
- Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
- Signal entropy dropping (4.28→3.58) — indicating early communication structure
- Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)

### Key Fix

food_growth_rate bumped from 0.005→0.02, food_eat_amount 0.05→0.03 to prevent ecological collapse at high generations.

### Architecture

- World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
- Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
- Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
- Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
- Analysis: emergence.py (signal entropy, magnitude, R², diversity, clustering)
- Visualization: renderer.py (dashboard, world map, zoom views)

### Run Data

~/genesis/runs/run_20260401_111309/ — metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json

### Next Steps

- Phase 4: TRIBE v2 integration — compare evolved GRU representations to human brain activity via RSA
- Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
- Checkpoints at 50K intervals allow comparing brain representations across evolutionary time

---

## Project 5: TRIBE v2 — AI-Brain Loop

**What:** Closing the AI-brain comparison loop using Meta's TRIBE v2 — comparing AI encoder representations to predicted brain activity to find architectural gaps.

**Location:** /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)

### What's Built

- Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
- 8 phases: load model → extract per-layer features → brain parcellation → layer-wise encoding → modality ablation → RSA → divergence mapping → visualization
- Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
- Results: /home/azureuser/loop_results_v2/
- Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)

### Status

LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times — all failed.

### Why Attempts Failed

- Never had real brain targets — routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
- Evaluated on wrong metric (classification accuracy instead of robustness)
- Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy

### What Would Actually Work (from RESEARCH_BRIEF.md)

1. Pre-compute real brain targets using TRIBE v2's full pipeline
2. Train student with classification + per-vertex Pearson correlation brain loss
3. Evaluate on corruption/adversarial robustness, shape bias, brain-score — NOT accuracy
4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions

### Key Infrastructure

- Training scripts: /home/azureuser/brain_guided/train_*.py
- UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
- Results: /home/azureuser/brain_guided/results_final/

---

## Project 6: Instagram Cinema

**What:** AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.

**Setup:** LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.

**Format:** Instagram Reels — 9:16 portrait, 544x960

**Goal:** Create viral-quality cinematic content for Instagram Reels.

---

## Money-Making Strategy (April 2026)

### Sellable Assets

1. **TurboQuant** — working implementation nobody else has publicly. Lead magnet for consulting.
2. **Parameter Golf** — competition result (if top placement) = massive credibility signal
3. **Fine-tuning expertise** — proven on H100, multiple model families
4. **Inference optimization consulting** — directly from TurboQuant benchmarks

### Immediate Plan

- Path to 10L: Freelancing/consulting — fine-tuning + inference optimization
- Path to 1Cr: Productized consulting at scale or AI startup
- Channel: X (Twitter) for distribution, direct DMs to founders for sales

### X (Twitter) Growth Strategy

- Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
- Strategy: 70% replies (to bigger accounts), 30% original posts
- Target: 15 strategic replies/day to accounts with 100-5000 followers
- Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
- Pinned thread: TurboQuant benchmarks
- Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks

### Cold Outreach Template

"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"

### Target Clients

- Indian startups using LLMs in production (inc42 AI list)
- US startups from YC directory (AI/ML category, S24/W25 batches)
- Anyone on Twitter complaining about GPU costs / inference scaling
- Companies with >$10K/month GPU spend