azure-scripts / vivek_complete_profile.md
vivekvar's picture
azure home scripts: data gen, training, misc
a70eb3d verified

Vivek Varikuti β€” Complete Profile & Project Portfolio

Who I Am

  • 22 years old, AI Engineer & Startup Founder
  • GitHub: vivekvar-dl
  • Email: domainluther1234@gmail.com
  • Strong Python/PyTorch/LLM skills, deep transformer training experience
  • Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
  • CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
  • Transformers 5.4.0

Working Style

  • No AI fluff. No menus of options. Make the decision and execute.
  • Write like a human β€” no perfect grammar, no emojis, no "leveraging" or "seamless"
  • Any public text must read like a tired developer typed it at 2am
  • No co-authored-by Claude in git commits β€” public contributions look fully human
  • Verify before claiming. Test before shipping. Always run the actual code.

Project 1: TurboQuant β€” KV Cache Compression

What: Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.

Why: Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).

Location: /home/azureuser/turboquant/

Status: Working prototype. Google hasn't released their code publicly β€” this is one of the first working implementations.

Core Method: Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning β€” drop-in replacement.

Key Discovery: Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.

Benchmark Results (H100 NVL 96GB)

Model Architecture Summary

Model Architecture KV Heads head_dim Outlier Layers Prefill Fidelity
Qwen2.5-7B 28L, qwen2 4 128 layers 0, 27 exact
Llama-3.1-8B 32L, llama 8 128 none exact
Gemma-2-9B 42L, gemma2 8 256 none exact
Phi-4-14B 40L, phi3 10 128 none exact
Qwen2.5-32B 64L, qwen2 8 128 none exact
Llama-3.3-70B 80L, llama 8 128 N/A (disk full) N/A

Memory Savings at 8K Context

Model Default VRAM TurboQuant VRAM Saved KV Cache Reduction
Gemma-2-9B 9.98 GB 7.71 GB 2,323 MB ~59%
Qwen2.5-32B 23.16 GB 21.41 GB 1,791 MB ~47%
Phi-4-14B 12.28 GB 10.92 GB 1,392 MB ~44%
LLaMA-3.1-8B 7.71 GB 6.84 GB 890 MB ~44%
Qwen2.5-7B 7.08 GB 6.71 GB 380 MB ~44%

Memory Savings Scaling (LLaMA-3.1-8B)

Context Length Default VRAM TurboQuant VRAM Saved
1K tokens 6.00 GB 5.91 GB 93 MB
4K tokens 6.67 GB 6.27 GB 417 MB
8K tokens 7.71 GB 6.84 GB 890 MB

Full Memory Data Per Model

Qwen2.5-7B (5.45 GB model)

  • Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
  • Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
  • 1K: 5.76β†’5.73 GB (37 MB saved)
  • 4K: 6.27β†’6.10 GB (176 MB saved)
  • 8K: 7.08β†’6.71 GB (380 MB saved)

LLaMA-3.1-8B (5.68 GB model)

  • Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
  • No outlier layers
  • 1K: 6.00β†’5.91 GB (93 MB saved, output match)
  • 4K: 6.67β†’6.27 GB (417 MB saved, output match)
  • 8K: 7.71β†’6.84 GB (890 MB saved, output match)

Gemma-2-9B (6.08 GB model)

  • Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
  • No outlier layers
  • 1K: 6.62β†’6.38 GB (244 MB saved)
  • 4K: 7.96β†’6.89 GB (1,096 MB saved)
  • 8K: 9.98β†’7.71 GB (2,323 MB saved)

Phi-4-14B (9.10 GB model)

  • Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
  • No outlier layers
  • 1K: 9.75β†’9.61 GB (146 MB saved)
  • 4K: 10.72β†’10.09 GB (650 MB saved)
  • 8K: 12.28β†’10.92 GB (1,392 MB saved)

Qwen2.5-32B (19.31 GB model)

  • Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
  • No outlier layers
  • 1K: 19.97β†’19.79 GB (186 MB saved)
  • 4K: 21.23β†’20.42 GB (833 MB saved)
  • 8K: 23.16β†’21.41 GB (1,791 MB saved)

LLaMA-3.3-70B β€” failed with "No space left on device"

Quality Verification

All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"

  • Prefill logit difference: 0.0 across ALL models
  • Same top-1 token prediction: 100% across ALL models
  • Output coherence: 100% β€” both default and TurboQuant outputs fully coherent
  • Token match rate varies (18-100%) due to natural autoregressive sampling divergence β€” both outputs equally valid

Detailed quality per model:

Qwen2.5-7B: token match 39%, 3%, 54% β€” both coherent all 3 prompts LLaMA-3.1-8B: token match 89.1%, 100%, 100% β€” 2/3 exact match Phi-4-14B: token match 100%, 44%, 100% β€” 2/3 exact match Gemma-2-9B: token match 100%, 100%, 18.8% β€” 2/3 exact match Qwen2.5-32B: token match 71%, 25%, 53% β€” both coherent all 3 prompts

Infrastructure Notes

  • Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
  • PyTorch compiled for CUDA 13.0+ won't work β€” need cu128 wheel
  • Core quantizer verified (MSE matches paper bounds)
  • Cache integrates with HF Transformers v5.4.0 QuantizedLayer API

Project 2: Parameter Golf Competition (OpenAI)

What: OpenAI competition β€” train the best language model within a 16MB artifact, 10 minutes on 8xH100.

Metric: Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)

Timeline: March 18 - April 30, 2026

Current SOTA (merged): 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)

Our Edge: sp4096 Vocabulary

  • sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 β†’ 26.2% fewer tokens
  • Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB β†’ -5.5%
  • #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB β†’ -6.8%
  • Extrapolated SOTA: 1.1194 Γ— 0.93 β‰ˆ 1.04-1.06 BPB

Architecture

  • 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
  • Factored embeddings: tok_emb(4096x256) + embed_up(256β†’512) + embed_down(512β†’256)
  • All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite

Key Files

  • our_submission/train_gpt.py β€” modified #1 with sp4096 + factored embed + FA2 fallback
  • our_submission/train_gpt_original.py β€” unmodified #1 with FA2 fallback
  • train_sp4096.py β€” tokenizer training + data sharding script
  • data/tokenizers/fineweb_4096_bpe.model β€” trained sp4096 tokenizer
  • data/datasets/fineweb10B_sp4096/ β€” 80 train shards + 1 val shard

N-gram Cache: CONFIRMED FAKE

  • 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
  • All sub-1.0 BPB claims are measurement artifacts from hash collisions
  • Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement

Next Steps

  1. Medium fidelity run (10min 1xH100)
  2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
  3. Get 8xH100 access for final submission (compute grant or RunPod)
  4. Temperature scaling, document-isolated TTT for extra gains

Hardware

  • Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
  • flash-attn 2.8.3 (FA2, not FA3)
  • Final submission needs 8xH100

Project 3: GSoC 2026 β€” DeepChem OLMo Wrapper

What: Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.

Org: DeepChem (standalone first time in GSoC 2026) Mentors: Riya, Harindhar Deadline: March 31, 2026 18:00 UTC (submitted)

What Was Built

PR #4913 (LIVE) β€” Bug Fix

Issue #4912 (LIVE) β€” Compat Report

OLMo Wrapper (LOCAL ONLY β€” not pushed)

  • Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
  • Olmo2ForSequenceClassification β€” built from scratch (doesn't exist in HF)
  • OLMo wrapper class extending HuggingFaceModel
  • Added causal_lm task + generate() to base HuggingFaceModel
  • 8/8 tests pass in 27 seconds on CPU
  • Uses OLMo-2 (allenai/OLMo-2-1124-7B)

Experiments Run

  • BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
  • ESOL regression: RΒ² 0.37, MAE 1.27
  • SMILES generation: 0% validity (proves pretraining is core challenge)
  • Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters

Proposal

  • ~/gsoc_proposal_final.md β€” human-written version
  • ~/gsoc_proposal_content.md β€” raw technical reference

Key Context

  • PR #4907 by Aditya-ad48 also adds causal LM generation β€” complementary not competing
  • DeepChem wants small PRs (<50 lines) for new contributors
  • rbharath is the main reviewer/maintainer
  • Office hours MWF 9am PST
  • Discord: https://discord.gg/RYTrUY8Ssn

Project 4: Genesis β€” Artificial Life Simulation

What: Virtual world where blank GRU neural net agents evolve survival behaviors from scratch β€” foraging, water-seeking, communication β€” on H100 GPU using JAX.

Location: /home/azureuser/genesis/ (venv at ~/genesis_env/)

World Setup

  • 512x512 grid with Perlin noise terrain
  • Food regrowth, water sources, day/night cycles, seasons
  • 1000 agents with GRU brains (~82K params each)
  • Tournament selection + Gaussian mutation (self-adaptive sigma)
  • Agents start with zero knowledge β€” must learn to survive

Status (2026-04-01)

Phases 1-3 complete. 500K step run finished successfully:

  • 86 generations evolved
  • Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
  • Signal entropy dropping (4.28β†’3.58) β€” indicating early communication structure
  • Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)

Key Fix

food_growth_rate bumped from 0.005β†’0.02, food_eat_amount 0.05β†’0.03 to prevent ecological collapse at high generations.

Architecture

  • World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
  • Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
  • Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
  • Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
  • Analysis: emergence.py (signal entropy, magnitude, RΒ², diversity, clustering)
  • Visualization: renderer.py (dashboard, world map, zoom views)

Run Data

~/genesis/runs/run_20260401_111309/ β€” metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json

Next Steps

  • Phase 4: TRIBE v2 integration β€” compare evolved GRU representations to human brain activity via RSA
  • Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
  • Checkpoints at 50K intervals allow comparing brain representations across evolutionary time

Project 5: TRIBE v2 β€” AI-Brain Loop

What: Closing the AI-brain comparison loop using Meta's TRIBE v2 β€” comparing AI encoder representations to predicted brain activity to find architectural gaps.

Location: /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)

What's Built

  • Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
  • 8 phases: load model β†’ extract per-layer features β†’ brain parcellation β†’ layer-wise encoding β†’ modality ablation β†’ RSA β†’ divergence mapping β†’ visualization
  • Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
  • Results: /home/azureuser/loop_results_v2/
  • Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)

Status

LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times β€” all failed.

Why Attempts Failed

  • Never had real brain targets β€” routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
  • Evaluated on wrong metric (classification accuracy instead of robustness)
  • Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy

What Would Actually Work (from RESEARCH_BRIEF.md)

  1. Pre-compute real brain targets using TRIBE v2's full pipeline
  2. Train student with classification + per-vertex Pearson correlation brain loss
  3. Evaluate on corruption/adversarial robustness, shape bias, brain-score β€” NOT accuracy
  4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions

Key Infrastructure

  • Training scripts: /home/azureuser/brain_guided/train_*.py
  • UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
  • Results: /home/azureuser/brain_guided/results_final/

Project 6: Instagram Cinema

What: AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.

Setup: LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.

Format: Instagram Reels β€” 9:16 portrait, 544x960

Goal: Create viral-quality cinematic content for Instagram Reels.


Money-Making Strategy (April 2026)

Sellable Assets

  1. TurboQuant β€” working implementation nobody else has publicly. Lead magnet for consulting.
  2. Parameter Golf β€” competition result (if top placement) = massive credibility signal
  3. Fine-tuning expertise β€” proven on H100, multiple model families
  4. Inference optimization consulting β€” directly from TurboQuant benchmarks

Immediate Plan

  • Path to 10L: Freelancing/consulting β€” fine-tuning + inference optimization
  • Path to 1Cr: Productized consulting at scale or AI startup
  • Channel: X (Twitter) for distribution, direct DMs to founders for sales

X (Twitter) Growth Strategy

  • Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
  • Strategy: 70% replies (to bigger accounts), 30% original posts
  • Target: 15 strategic replies/day to accounts with 100-5000 followers
  • Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
  • Pinned thread: TurboQuant benchmarks
  • Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks

Cold Outreach Template

"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"

Target Clients

  • Indian startups using LLMs in production (inc42 AI list)
  • US startups from YC directory (AI/ML category, S24/W25 batches)
  • Anyone on Twitter complaining about GPU costs / inference scaling
  • Companies with >$10K/month GPU spend