azure-scripts / vivek_complete_profile.md

azure home scripts: data gen, training, misc

a70eb3d verified 25 days ago

14.8 kB

	# Vivek Varikuti — Complete Profile & Project Portfolio

	## Who I Am

	- 22 years old, AI Engineer & Startup Founder
	- GitHub: vivekvar-dl
	- Email: domainluther1234@gmail.com
	- Strong Python/PyTorch/LLM skills, deep transformer training experience
	- Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
	- CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
	- Transformers 5.4.0

	---

	## Working Style

	- No AI fluff. No menus of options. Make the decision and execute.
	- Write like a human — no perfect grammar, no emojis, no "leveraging" or "seamless"
	- Any public text must read like a tired developer typed it at 2am
	- No co-authored-by Claude in git commits — public contributions look fully human
	- Verify before claiming. Test before shipping. Always run the actual code.

	---

	## Project 1: TurboQuant — KV Cache Compression

	What: Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.

	Why: Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).

	Location: /home/azureuser/turboquant/

	Status: Working prototype. Google hasn't released their code publicly — this is one of the first working implementations.

	Core Method: Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning — drop-in replacement.

	Key Discovery: Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.

	### Benchmark Results (H100 NVL 96GB)

	#### Model Architecture Summary

	\| Model \| Architecture \| KV Heads \| head_dim \| Outlier Layers \| Prefill Fidelity \|
	\|-------\|-------------\|----------\|---------\|----------------\|-----------------\|
	\| Qwen2.5-7B \| 28L, qwen2 \| 4 \| 128 \| layers 0, 27 \| exact \|
	\| Llama-3.1-8B \| 32L, llama \| 8 \| 128 \| none \| exact \|
	\| Gemma-2-9B \| 42L, gemma2 \| 8 \| 256 \| none \| exact \|
	\| Phi-4-14B \| 40L, phi3 \| 10 \| 128 \| none \| exact \|
	\| Qwen2.5-32B \| 64L, qwen2 \| 8 \| 128 \| none \| exact \|
	\| Llama-3.3-70B \| 80L, llama \| 8 \| 128 \| N/A (disk full) \| N/A \|

	#### Memory Savings at 8K Context

	\| Model \| Default VRAM \| TurboQuant VRAM \| Saved \| KV Cache Reduction \|
	\|-------\|-------------\|----------------\|-------\|-------------------\|
	\| Gemma-2-9B \| 9.98 GB \| 7.71 GB \| 2,323 MB \| ~59% \|
	\| Qwen2.5-32B \| 23.16 GB \| 21.41 GB \| 1,791 MB \| ~47% \|
	\| Phi-4-14B \| 12.28 GB \| 10.92 GB \| 1,392 MB \| ~44% \|
	\| LLaMA-3.1-8B \| 7.71 GB \| 6.84 GB \| 890 MB \| ~44% \|
	\| Qwen2.5-7B \| 7.08 GB \| 6.71 GB \| 380 MB \| ~44% \|

	#### Memory Savings Scaling (LLaMA-3.1-8B)

	\| Context Length \| Default VRAM \| TurboQuant VRAM \| Saved \|
	\|---------------\|-------------\|----------------\|-------\|
	\| 1K tokens \| 6.00 GB \| 5.91 GB \| 93 MB \|
	\| 4K tokens \| 6.67 GB \| 6.27 GB \| 417 MB \|
	\| 8K tokens \| 7.71 GB \| 6.84 GB \| 890 MB \|

	#### Full Memory Data Per Model

	Qwen2.5-7B (5.45 GB model)
	- Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
	- Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
	- 1K: 5.76→5.73 GB (37 MB saved)
	- 4K: 6.27→6.10 GB (176 MB saved)
	- 8K: 7.08→6.71 GB (380 MB saved)

	LLaMA-3.1-8B (5.68 GB model)
	- Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
	- No outlier layers
	- 1K: 6.00→5.91 GB (93 MB saved, output match)
	- 4K: 6.67→6.27 GB (417 MB saved, output match)
	- 8K: 7.71→6.84 GB (890 MB saved, output match)

	Gemma-2-9B (6.08 GB model)
	- Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
	- No outlier layers
	- 1K: 6.62→6.38 GB (244 MB saved)
	- 4K: 7.96→6.89 GB (1,096 MB saved)
	- 8K: 9.98→7.71 GB (2,323 MB saved)

	Phi-4-14B (9.10 GB model)
	- Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
	- No outlier layers
	- 1K: 9.75→9.61 GB (146 MB saved)
	- 4K: 10.72→10.09 GB (650 MB saved)
	- 8K: 12.28→10.92 GB (1,392 MB saved)

	Qwen2.5-32B (19.31 GB model)
	- Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
	- No outlier layers
	- 1K: 19.97→19.79 GB (186 MB saved)
	- 4K: 21.23→20.42 GB (833 MB saved)
	- 8K: 23.16→21.41 GB (1,791 MB saved)

	LLaMA-3.3-70B — failed with "No space left on device"

	#### Quality Verification

	All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"

	- Prefill logit difference: 0.0 across ALL models
	- Same top-1 token prediction: 100% across ALL models
	- Output coherence: 100% — both default and TurboQuant outputs fully coherent
	- Token match rate varies (18-100%) due to natural autoregressive sampling divergence — both outputs equally valid

	Detailed quality per model:

	Qwen2.5-7B: token match 39%, 3%, 54% — both coherent all 3 prompts
	LLaMA-3.1-8B: token match 89.1%, 100%, 100% — 2/3 exact match
	Phi-4-14B: token match 100%, 44%, 100% — 2/3 exact match
	Gemma-2-9B: token match 100%, 100%, 18.8% — 2/3 exact match
	Qwen2.5-32B: token match 71%, 25%, 53% — both coherent all 3 prompts

	#### Infrastructure Notes
	- Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
	- PyTorch compiled for CUDA 13.0+ won't work — need cu128 wheel
	- Core quantizer verified (MSE matches paper bounds)
	- Cache integrates with HF Transformers v5.4.0 QuantizedLayer API

	---

	## Project 2: Parameter Golf Competition (OpenAI)

	What: OpenAI competition — train the best language model within a 16MB artifact, 10 minutes on 8xH100.

	Metric: Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)

	Timeline: March 18 - April 30, 2026

	Current SOTA (merged): 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)

	### Our Edge: sp4096 Vocabulary

	- sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 → 26.2% fewer tokens
	- Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB → -5.5%
	- #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB → -6.8%
	- Extrapolated SOTA: 1.1194 × 0.93 ≈ 1.04-1.06 BPB

	### Architecture

	- 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
	- Factored embeddings: tok_emb(4096x256) + embed_up(256→512) + embed_down(512→256)
	- All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite

	### Key Files

	- our_submission/train_gpt.py — modified #1 with sp4096 + factored embed + FA2 fallback
	- our_submission/train_gpt_original.py — unmodified #1 with FA2 fallback
	- train_sp4096.py — tokenizer training + data sharding script
	- data/tokenizers/fineweb_4096_bpe.model — trained sp4096 tokenizer
	- data/datasets/fineweb10B_sp4096/ — 80 train shards + 1 val shard

	### N-gram Cache: CONFIRMED FAKE

	- 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
	- All sub-1.0 BPB claims are measurement artifacts from hash collisions
	- Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement

	### Next Steps

	1. Medium fidelity run (10min 1xH100)
	2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
	3. Get 8xH100 access for final submission (compute grant or RunPod)
	4. Temperature scaling, document-isolated TTT for extra gains

	### Hardware

	- Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
	- flash-attn 2.8.3 (FA2, not FA3)
	- Final submission needs 8xH100

	---

	## Project 3: GSoC 2026 — DeepChem OLMo Wrapper

	What: Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.

	Org: DeepChem (standalone first time in GSoC 2026)
	Mentors: Riya, Harindhar
	Deadline: March 31, 2026 18:00 UTC (submitted)

	### What Was Built

	PR #4913 (LIVE) — Bug Fix
	- Fixed ChemBERTa broken import for transformers 5.x
	- `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x
	- 3 additions / 4 deletions
	- https://github.com/deepchem/deepchem/pull/4913

	Issue #4912 (LIVE) — Compat Report
	- Broader transformers 5.x compatibility issues documented
	- https://github.com/deepchem/deepchem/issues/4912

	OLMo Wrapper (LOCAL ONLY — not pushed)
	- Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
	- Olmo2ForSequenceClassification — built from scratch (doesn't exist in HF)
	- OLMo wrapper class extending HuggingFaceModel
	- Added causal_lm task + generate() to base HuggingFaceModel
	- 8/8 tests pass in 27 seconds on CPU
	- Uses OLMo-2 (allenai/OLMo-2-1124-7B)

	### Experiments Run

	- BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
	- ESOL regression: R² 0.37, MAE 1.27
	- SMILES generation: 0% validity (proves pretraining is core challenge)
	- Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters

	### Proposal

	- ~/gsoc_proposal_final.md — human-written version
	- ~/gsoc_proposal_content.md — raw technical reference

	### Key Context

	- PR #4907 by Aditya-ad48 also adds causal LM generation — complementary not competing
	- DeepChem wants small PRs (<50 lines) for new contributors
	- rbharath is the main reviewer/maintainer
	- Office hours MWF 9am PST
	- Discord: https://discord.gg/RYTrUY8Ssn

	---

	## Project 4: Genesis — Artificial Life Simulation

	What: Virtual world where blank GRU neural net agents evolve survival behaviors from scratch — foraging, water-seeking, communication — on H100 GPU using JAX.

	Location: /home/azureuser/genesis/ (venv at ~/genesis_env/)

	### World Setup

	- 512x512 grid with Perlin noise terrain
	- Food regrowth, water sources, day/night cycles, seasons
	- 1000 agents with GRU brains (~82K params each)
	- Tournament selection + Gaussian mutation (self-adaptive sigma)
	- Agents start with zero knowledge — must learn to survive

	### Status (2026-04-01)

	Phases 1-3 complete. 500K step run finished successfully:
	- 86 generations evolved
	- Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
	- Signal entropy dropping (4.28→3.58) — indicating early communication structure
	- Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)

	### Key Fix

	food_growth_rate bumped from 0.005→0.02, food_eat_amount 0.05→0.03 to prevent ecological collapse at high generations.

	### Architecture

	- World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
	- Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
	- Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
	- Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
	- Analysis: emergence.py (signal entropy, magnitude, R², diversity, clustering)
	- Visualization: renderer.py (dashboard, world map, zoom views)

	### Run Data

	~/genesis/runs/run_20260401_111309/ — metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json

	### Next Steps

	- Phase 4: TRIBE v2 integration — compare evolved GRU representations to human brain activity via RSA
	- Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
	- Checkpoints at 50K intervals allow comparing brain representations across evolutionary time

	---

	## Project 5: TRIBE v2 — AI-Brain Loop

	What: Closing the AI-brain comparison loop using Meta's TRIBE v2 — comparing AI encoder representations to predicted brain activity to find architectural gaps.

	Location: /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)

	### What's Built

	- Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
	- 8 phases: load model → extract per-layer features → brain parcellation → layer-wise encoding → modality ablation → RSA → divergence mapping → visualization
	- Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
	- Results: /home/azureuser/loop_results_v2/
	- Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)

	### Status

	LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times — all failed.

	### Why Attempts Failed

	- Never had real brain targets — routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
	- Evaluated on wrong metric (classification accuracy instead of robustness)
	- Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy

	### What Would Actually Work (from RESEARCH_BRIEF.md)

	1. Pre-compute real brain targets using TRIBE v2's full pipeline
	2. Train student with classification + per-vertex Pearson correlation brain loss
	3. Evaluate on corruption/adversarial robustness, shape bias, brain-score — NOT accuracy
	4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions

	### Key Infrastructure

	- Training scripts: /home/azureuser/brain_guided/train_*.py
	- UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
	- Results: /home/azureuser/brain_guided/results_final/

	---

	## Project 6: Instagram Cinema

	What: AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.

	Setup: LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.

	Format: Instagram Reels — 9:16 portrait, 544x960

	Goal: Create viral-quality cinematic content for Instagram Reels.

	---

	## Money-Making Strategy (April 2026)

	### Sellable Assets

	1. TurboQuant — working implementation nobody else has publicly. Lead magnet for consulting.
	2. Parameter Golf — competition result (if top placement) = massive credibility signal
	3. Fine-tuning expertise — proven on H100, multiple model families
	4. Inference optimization consulting — directly from TurboQuant benchmarks

	### Immediate Plan

	- Path to 10L: Freelancing/consulting — fine-tuning + inference optimization
	- Path to 1Cr: Productized consulting at scale or AI startup
	- Channel: X (Twitter) for distribution, direct DMs to founders for sales

	### X (Twitter) Growth Strategy

	- Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
	- Strategy: 70% replies (to bigger accounts), 30% original posts
	- Target: 15 strategic replies/day to accounts with 100-5000 followers
	- Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
	- Pinned thread: TurboQuant benchmarks
	- Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks

	### Cold Outreach Template

	"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"

	### Target Clients

	- Indian startups using LLMs in production (inc42 AI list)
	- US startups from YC directory (AI/ML category, S24/W25 batches)
	- Anyone on Twitter complaining about GPU costs / inference scaling
	- Companies with >$10K/month GPU spend