azure-scripts / vivek_complete_profile.md

azure home scripts: data gen, training, misc

a70eb3d verified 23 days ago

preview code

raw

history blame contribute delete

14.8 kB

Vivek Varikuti — Complete Profile & Project Portfolio

Who I Am

22 years old, AI Engineer & Startup Founder
GitHub: vivekvar-dl
Email: domainluther1234@gmail.com
Strong Python/PyTorch/LLM skills, deep transformer training experience
Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
Transformers 5.4.0

Working Style

No AI fluff. No menus of options. Make the decision and execute.
Write like a human — no perfect grammar, no emojis, no "leveraging" or "seamless"
Any public text must read like a tired developer typed it at 2am
No co-authored-by Claude in git commits — public contributions look fully human
Verify before claiming. Test before shipping. Always run the actual code.

Project 1: TurboQuant — KV Cache Compression

What: Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.

Why: Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).

Location: /home/azureuser/turboquant/

Status: Working prototype. Google hasn't released their code publicly — this is one of the first working implementations.

Core Method: Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning — drop-in replacement.

Key Discovery: Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.

Benchmark Results (H100 NVL 96GB)

Model Architecture Summary

Model	Architecture	KV Heads	head_dim	Outlier Layers	Prefill Fidelity
Qwen2.5-7B	28L, qwen2	4	128	layers 0, 27	exact
Llama-3.1-8B	32L, llama	8	128	none	exact
Gemma-2-9B	42L, gemma2	8	256	none	exact
Phi-4-14B	40L, phi3	10	128	none	exact
Qwen2.5-32B	64L, qwen2	8	128	none	exact
Llama-3.3-70B	80L, llama	8	128	N/A (disk full)	N/A

Memory Savings at 8K Context

Model	Default VRAM	TurboQuant VRAM	Saved	KV Cache Reduction
Gemma-2-9B	9.98 GB	7.71 GB	2,323 MB	~59%
Qwen2.5-32B	23.16 GB	21.41 GB	1,791 MB	~47%
Phi-4-14B	12.28 GB	10.92 GB	1,392 MB	~44%
LLaMA-3.1-8B	7.71 GB	6.84 GB	890 MB	~44%
Qwen2.5-7B	7.08 GB	6.71 GB	380 MB	~44%

Memory Savings Scaling (LLaMA-3.1-8B)

Context Length	Default VRAM	TurboQuant VRAM	Saved
1K tokens	6.00 GB	5.91 GB	93 MB
4K tokens	6.67 GB	6.27 GB	417 MB
8K tokens	7.71 GB	6.84 GB	890 MB

Full Memory Data Per Model

Qwen2.5-7B (5.45 GB model)

Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
1K: 5.76→5.73 GB (37 MB saved)
4K: 6.27→6.10 GB (176 MB saved)
8K: 7.08→6.71 GB (380 MB saved)

LLaMA-3.1-8B (5.68 GB model)

Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
No outlier layers
1K: 6.00→5.91 GB (93 MB saved, output match)
4K: 6.67→6.27 GB (417 MB saved, output match)
8K: 7.71→6.84 GB (890 MB saved, output match)

Gemma-2-9B (6.08 GB model)

Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
No outlier layers
1K: 6.62→6.38 GB (244 MB saved)
4K: 7.96→6.89 GB (1,096 MB saved)
8K: 9.98→7.71 GB (2,323 MB saved)

Phi-4-14B (9.10 GB model)

Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
No outlier layers
1K: 9.75→9.61 GB (146 MB saved)
4K: 10.72→10.09 GB (650 MB saved)
8K: 12.28→10.92 GB (1,392 MB saved)

Qwen2.5-32B (19.31 GB model)

Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
No outlier layers
1K: 19.97→19.79 GB (186 MB saved)
4K: 21.23→20.42 GB (833 MB saved)
8K: 23.16→21.41 GB (1,791 MB saved)

LLaMA-3.3-70B — failed with "No space left on device"

Quality Verification

All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"

Prefill logit difference: 0.0 across ALL models
Same top-1 token prediction: 100% across ALL models
Output coherence: 100% — both default and TurboQuant outputs fully coherent
Token match rate varies (18-100%) due to natural autoregressive sampling divergence — both outputs equally valid

Detailed quality per model:

Qwen2.5-7B: token match 39%, 3%, 54% — both coherent all 3 prompts LLaMA-3.1-8B: token match 89.1%, 100%, 100% — 2/3 exact match Phi-4-14B: token match 100%, 44%, 100% — 2/3 exact match Gemma-2-9B: token match 100%, 100%, 18.8% — 2/3 exact match Qwen2.5-32B: token match 71%, 25%, 53% — both coherent all 3 prompts

Infrastructure Notes

Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
PyTorch compiled for CUDA 13.0+ won't work — need cu128 wheel
Core quantizer verified (MSE matches paper bounds)
Cache integrates with HF Transformers v5.4.0 QuantizedLayer API

Project 2: Parameter Golf Competition (OpenAI)

What: OpenAI competition — train the best language model within a 16MB artifact, 10 minutes on 8xH100.

Metric: Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)

Timeline: March 18 - April 30, 2026

Current SOTA (merged): 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)

Our Edge: sp4096 Vocabulary

sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 → 26.2% fewer tokens
Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB → -5.5%
#1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB → -6.8%
Extrapolated SOTA: 1.1194 × 0.93 ≈ 1.04-1.06 BPB

Architecture

11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
Factored embeddings: tok_emb(4096x256) + embed_up(256→512) + embed_down(512→256)
All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite

Key Files

our_submission/train_gpt.py — modified #1 with sp4096 + factored embed + FA2 fallback
our_submission/train_gpt_original.py — unmodified #1 with FA2 fallback
train_sp4096.py — tokenizer training + data sharding script
data/tokenizers/fineweb_4096_bpe.model — trained sp4096 tokenizer
data/datasets/fineweb10B_sp4096/ — 80 train shards + 1 val shard

N-gram Cache: CONFIRMED FAKE

256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
All sub-1.0 BPB claims are measurement artifacts from hash collisions
Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement

Next Steps

Medium fidelity run (10min 1xH100)
Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
Get 8xH100 access for final submission (compute grant or RunPod)
Temperature scaling, document-isolated TTT for extra gains

Hardware

Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
flash-attn 2.8.3 (FA2, not FA3)
Final submission needs 8xH100

Project 3: GSoC 2026 — DeepChem OLMo Wrapper

What: Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.

Org: DeepChem (standalone first time in GSoC 2026) Mentors: Riya, Harindhar Deadline: March 31, 2026 18:00 UTC (submitted)

What Was Built

PR #4913 (LIVE) — Bug Fix

Fixed ChemBERTa broken import for transformers 5.x
transformers.models.roberta.tokenization_roberta_fast removed in 5.x
3 additions / 4 deletions
https://github.com/deepchem/deepchem/pull/4913

Issue #4912 (LIVE) — Compat Report

Broader transformers 5.x compatibility issues documented
https://github.com/deepchem/deepchem/issues/4912

OLMo Wrapper (LOCAL ONLY — not pushed)

Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
Olmo2ForSequenceClassification — built from scratch (doesn't exist in HF)
OLMo wrapper class extending HuggingFaceModel
Added causal_lm task + generate() to base HuggingFaceModel
8/8 tests pass in 27 seconds on CPU
Uses OLMo-2 (allenai/OLMo-2-1124-7B)

Experiments Run

BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
ESOL regression: R² 0.37, MAE 1.27
SMILES generation: 0% validity (proves pretraining is core challenge)
Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters

Proposal

~/gsoc_proposal_final.md — human-written version
~/gsoc_proposal_content.md — raw technical reference

Key Context

PR #4907 by Aditya-ad48 also adds causal LM generation — complementary not competing
DeepChem wants small PRs (<50 lines) for new contributors
rbharath is the main reviewer/maintainer
Office hours MWF 9am PST
Discord: https://discord.gg/RYTrUY8Ssn

Project 4: Genesis — Artificial Life Simulation

What: Virtual world where blank GRU neural net agents evolve survival behaviors from scratch — foraging, water-seeking, communication — on H100 GPU using JAX.

Location: /home/azureuser/genesis/ (venv at ~/genesis_env/)

World Setup

512x512 grid with Perlin noise terrain
Food regrowth, water sources, day/night cycles, seasons
1000 agents with GRU brains (~82K params each)
Tournament selection + Gaussian mutation (self-adaptive sigma)
Agents start with zero knowledge — must learn to survive

Status (2026-04-01)

Phases 1-3 complete. 500K step run finished successfully:

86 generations evolved
Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
Signal entropy dropping (4.28→3.58) — indicating early communication structure
Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)

Key Fix

food_growth_rate bumped from 0.005→0.02, food_eat_amount 0.05→0.03 to prevent ecological collapse at high generations.

Architecture

World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
Analysis: emergence.py (signal entropy, magnitude, R², diversity, clustering)
Visualization: renderer.py (dashboard, world map, zoom views)

Run Data

~/genesis/runs/run_20260401_111309/ — metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json

Next Steps

Phase 4: TRIBE v2 integration — compare evolved GRU representations to human brain activity via RSA
Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
Checkpoints at 50K intervals allow comparing brain representations across evolutionary time

Project 5: TRIBE v2 — AI-Brain Loop

What: Closing the AI-brain comparison loop using Meta's TRIBE v2 — comparing AI encoder representations to predicted brain activity to find architectural gaps.

Location: /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)

What's Built

Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
8 phases: load model → extract per-layer features → brain parcellation → layer-wise encoding → modality ablation → RSA → divergence mapping → visualization
Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
Results: /home/azureuser/loop_results_v2/
Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)

Status

LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times — all failed.

Why Attempts Failed

Never had real brain targets — routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
Evaluated on wrong metric (classification accuracy instead of robustness)
Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy

What Would Actually Work (from RESEARCH_BRIEF.md)

Pre-compute real brain targets using TRIBE v2's full pipeline
Train student with classification + per-vertex Pearson correlation brain loss
Evaluate on corruption/adversarial robustness, shape bias, brain-score — NOT accuracy
Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions

Key Infrastructure

Training scripts: /home/azureuser/brain_guided/train_*.py
UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
Results: /home/azureuser/brain_guided/results_final/

Project 6: Instagram Cinema

What: AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.

Setup: LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.

Format: Instagram Reels — 9:16 portrait, 544x960

Goal: Create viral-quality cinematic content for Instagram Reels.

Money-Making Strategy (April 2026)

Sellable Assets

TurboQuant — working implementation nobody else has publicly. Lead magnet for consulting.
Parameter Golf — competition result (if top placement) = massive credibility signal
Fine-tuning expertise — proven on H100, multiple model families
Inference optimization consulting — directly from TurboQuant benchmarks

Immediate Plan

Path to 10L: Freelancing/consulting — fine-tuning + inference optimization
Path to 1Cr: Productized consulting at scale or AI startup
Channel: X (Twitter) for distribution, direct DMs to founders for sales

X (Twitter) Growth Strategy

Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
Strategy: 70% replies (to bigger accounts), 30% original posts
Target: 15 strategic replies/day to accounts with 100-5000 followers
Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
Pinned thread: TurboQuant benchmarks
Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks

Cold Outreach Template

"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"

Target Clients

Indian startups using LLMs in production (inc42 AI list)
US startups from YC directory (AI/ML category, S24/W25 batches)
Anyone on Twitter complaining about GPU costs / inference scaling
Companies with >$10K/month GPU spend