File size: 14,842 Bytes
a70eb3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
# Vivek Varikuti β€” Complete Profile & Project Portfolio

## Who I Am

- 22 years old, AI Engineer & Startup Founder
- GitHub: vivekvar-dl
- Email: domainluther1234@gmail.com
- Strong Python/PyTorch/LLM skills, deep transformer training experience
- Hardware: 1x NVIDIA H100 NVL 96GB on Azure (NC40ads H100 v5)
- CUDA 12.8, PyTorch 2.7.0+cu128, flash-attn 2.8.3 (FA2)
- Transformers 5.4.0

---

## Working Style

- No AI fluff. No menus of options. Make the decision and execute.
- Write like a human β€” no perfect grammar, no emojis, no "leveraging" or "seamless"
- Any public text must read like a tired developer typed it at 2am
- No co-authored-by Claude in git commits β€” public contributions look fully human
- Verify before claiming. Test before shipping. Always run the actual code.

---

## Project 1: TurboQuant β€” KV Cache Compression

**What:** Implementing Google's TurboQuant paper (arXiv 2504.19874, Zandieh et al.) for KV cache compression during LLM inference.

**Why:** Compress KV cache ~4-7x on production LLMs to enable longer contexts and batching on H100 NVL (96GB).

**Location:** /home/azureuser/turboquant/

**Status:** Working prototype. Google hasn't released their code publicly β€” this is one of the first working implementations.

**Core Method:** Mixed-precision quantization of KV cache. Profile each layer's activation norms, identify outlier layers that need full precision, quantize the rest. No retraining, no fine-tuning β€” drop-in replacement.

**Key Discovery:** Layer 0 (and sometimes last layer) of Qwen models have anomalously large key norms (~16-50x median). These layers must be kept in BF16 (skip_layers). Auto-calibration function detects outlier layers.

### Benchmark Results (H100 NVL 96GB)

#### Model Architecture Summary

| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity |
|-------|-------------|----------|---------|----------------|-----------------|
| Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact |
| Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact |
| Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact |
| Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact |
| Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact |
| Llama-3.3-70B | 80L, llama | 8 | 128 | N/A (disk full) | N/A |

#### Memory Savings at 8K Context

| Model | Default VRAM | TurboQuant VRAM | Saved | KV Cache Reduction |
|-------|-------------|----------------|-------|-------------------|
| Gemma-2-9B | 9.98 GB | 7.71 GB | 2,323 MB | ~59% |
| Qwen2.5-32B | 23.16 GB | 21.41 GB | 1,791 MB | ~47% |
| Phi-4-14B | 12.28 GB | 10.92 GB | 1,392 MB | ~44% |
| LLaMA-3.1-8B | 7.71 GB | 6.84 GB | 890 MB | ~44% |
| Qwen2.5-7B | 7.08 GB | 6.71 GB | 380 MB | ~44% |

#### Memory Savings Scaling (LLaMA-3.1-8B)

| Context Length | Default VRAM | TurboQuant VRAM | Saved |
|---------------|-------------|----------------|-------|
| 1K tokens | 6.00 GB | 5.91 GB | 93 MB |
| 4K tokens | 6.67 GB | 6.27 GB | 417 MB |
| 8K tokens | 7.71 GB | 6.84 GB | 890 MB |

#### Full Memory Data Per Model

**Qwen2.5-7B (5.45 GB model)**
- Layer norms: median 16.86, max 273.84 (layer 0), ratio 16.24x
- Outlier layers: 0 (norm 273.84), 27 (norm 239.91)
- 1K: 5.76β†’5.73 GB (37 MB saved)
- 4K: 6.27β†’6.10 GB (176 MB saved)
- 8K: 7.08β†’6.71 GB (380 MB saved)

**LLaMA-3.1-8B (5.68 GB model)**
- Layer norms: median 17.90, max 21.05 (layer 7), ratio 1.18x
- No outlier layers
- 1K: 6.00β†’5.91 GB (93 MB saved, output match)
- 4K: 6.67β†’6.27 GB (417 MB saved, output match)
- 8K: 7.71β†’6.84 GB (890 MB saved, output match)

**Gemma-2-9B (6.08 GB model)**
- Layer norms: median 17.82, max 21.28 (layer 25), ratio 1.19x
- No outlier layers
- 1K: 6.62β†’6.38 GB (244 MB saved)
- 4K: 7.96β†’6.89 GB (1,096 MB saved)
- 8K: 9.98β†’7.71 GB (2,323 MB saved)

**Phi-4-14B (9.10 GB model)**
- Layer norms: median 19.21, max 26.46 (layer 0), ratio 1.38x
- No outlier layers
- 1K: 9.75β†’9.61 GB (146 MB saved)
- 4K: 10.72β†’10.09 GB (650 MB saved)
- 8K: 12.28β†’10.92 GB (1,392 MB saved)

**Qwen2.5-32B (19.31 GB model)**
- Layer norms: median 16.09, max 37.82 (layer 0), ratio 2.35x
- No outlier layers
- 1K: 19.97β†’19.79 GB (186 MB saved)
- 4K: 21.23β†’20.42 GB (833 MB saved)
- 8K: 23.16β†’21.41 GB (1,791 MB saved)

**LLaMA-3.3-70B** β€” failed with "No space left on device"

#### Quality Verification

All models tested with 3 prompts: "Explain quantum computing", "Write a Python prime checker", "What causes northern lights?"

- Prefill logit difference: 0.0 across ALL models
- Same top-1 token prediction: 100% across ALL models
- Output coherence: 100% β€” both default and TurboQuant outputs fully coherent
- Token match rate varies (18-100%) due to natural autoregressive sampling divergence β€” both outputs equally valid

**Detailed quality per model:**

Qwen2.5-7B: token match 39%, 3%, 54% β€” both coherent all 3 prompts
LLaMA-3.1-8B: token match 89.1%, 100%, 100% β€” 2/3 exact match
Phi-4-14B: token match 100%, 44%, 100% β€” 2/3 exact match
Gemma-2-9B: token match 100%, 100%, 18.8% β€” 2/3 exact match
Qwen2.5-32B: token match 71%, 25%, 53% β€” both coherent all 3 prompts

#### Infrastructure Notes
- Environment: torch 2.7.0+cu128, transformers 5.4.0, H100 NVL CUDA 12.8 (driver 570)
- PyTorch compiled for CUDA 13.0+ won't work β€” need cu128 wheel
- Core quantizer verified (MSE matches paper bounds)
- Cache integrates with HF Transformers v5.4.0 QuantizedLayer API

---

## Project 2: Parameter Golf Competition (OpenAI)

**What:** OpenAI competition β€” train the best language model within a 16MB artifact, 10 minutes on 8xH100.

**Metric:** Bits-per-byte (BPB) on FineWeb validation (62M tokens sp1024, 45.5M tokens sp4096)

**Timeline:** March 18 - April 30, 2026

**Current SOTA (merged):** 1.1194 BPP (PR #549, LeakyReLU^2 + TTT + Parallel Muon)

### Our Edge: sp4096 Vocabulary

- sp4096 tokens_per_byte: 0.3063 vs sp1024: 0.4149 β†’ 26.2% fewer tokens
- Baseline A/B test (400 steps): sp4096 = 1.6208 BPB vs sp1024 = 1.7144 BPB β†’ -5.5%
- #1 arch A/B test (400 steps, seed 42): sp4096+factored = 1.8693 BPB vs sp1024 = 2.0067 BPB β†’ -6.8%
- Extrapolated SOTA: 1.1194 Γ— 0.93 β‰ˆ 1.04-1.06 BPB

### Architecture

- 11L, 512d, 8H, 4KV, 3x MLP, LeakyReLU(0.5)^2
- Factored embeddings: tok_emb(4096x256) + embed_up(256β†’512) + embed_down(512β†’256)
- All tricks from #1 submission: XSA, Partial RoPE, LN Scale, SmearGate, BigramHash, EMA, TTT, GPTQ-lite

### Key Files

- our_submission/train_gpt.py β€” modified #1 with sp4096 + factored embed + FA2 fallback
- our_submission/train_gpt_original.py β€” unmodified #1 with FA2 fallback
- train_sp4096.py β€” tokenizer training + data sharding script
- data/tokenizers/fineweb_4096_bpe.model β€” trained sp4096 tokenizer
- data/datasets/fineweb10B_sp4096/ β€” 80 train shards + 1 val shard

### N-gram Cache: CONFIRMED FAKE

- 256M bucket experiment: collision-free hash tables give 1.11 BPB (no improvement)
- All sub-1.0 BPB claims are measurement artifacts from hash collisions
- Valid Dirichlet smoothing gives at most ~0.002-0.005 genuine improvement

### Next Steps

1. Medium fidelity run (10min 1xH100)
2. Int5 MLP quantization (saves ~1.86MB for artifact budget headroom)
3. Get 8xH100 access for final submission (compute grant or RunPod)
4. Temperature scaling, document-isolated TTT for extra gains

### Hardware

- Dev: 1xH100 NVL (Azure NC40ads H100 v5), 96GB VRAM, CUDA 12.8, PyTorch 2.9.1+cu128
- flash-attn 2.8.3 (FA2, not FA3)
- Final submission needs 8xH100

---

## Project 3: GSoC 2026 β€” DeepChem OLMo Wrapper

**What:** Adding OLMo-2 7B LLM support to DeepChem for molecular property prediction and SMILES generation.

**Org:** DeepChem (standalone first time in GSoC 2026)
**Mentors:** Riya, Harindhar
**Deadline:** March 31, 2026 18:00 UTC (submitted)

### What Was Built

**PR #4913 (LIVE) β€” Bug Fix**
- Fixed ChemBERTa broken import for transformers 5.x
- `transformers.models.roberta.tokenization_roberta_fast` removed in 5.x
- 3 additions / 4 deletions
- https://github.com/deepchem/deepchem/pull/4913

**Issue #4912 (LIVE) β€” Compat Report**
- Broader transformers 5.x compatibility issues documented
- https://github.com/deepchem/deepchem/issues/4912

**OLMo Wrapper (LOCAL ONLY β€” not pushed)**
- Files at ~/olmo_draft/olmo.py and ~/olmo_draft/test_olmo.py
- Olmo2ForSequenceClassification β€” built from scratch (doesn't exist in HF)
- OLMo wrapper class extending HuggingFaceModel
- Added causal_lm task + generate() to base HuggingFaceModel
- 8/8 tests pass in 27 seconds on CPU
- Uses OLMo-2 (allenai/OLMo-2-1124-7B)

### Experiments Run

- BBBP classification: ROC-AUC 0.67 (random init, 12.9M params, 200 samples)
- ESOL regression: RΒ² 0.37, MAE 1.27
- SMILES generation: 0% validity (proves pretraining is core challenge)
- Tokenization analysis: OLMo 0.9x tokens vs ChemBERTa, but fragments stereocenters

### Proposal

- ~/gsoc_proposal_final.md β€” human-written version
- ~/gsoc_proposal_content.md β€” raw technical reference

### Key Context

- PR #4907 by Aditya-ad48 also adds causal LM generation β€” complementary not competing
- DeepChem wants small PRs (<50 lines) for new contributors
- rbharath is the main reviewer/maintainer
- Office hours MWF 9am PST
- Discord: https://discord.gg/RYTrUY8Ssn

---

## Project 4: Genesis β€” Artificial Life Simulation

**What:** Virtual world where blank GRU neural net agents evolve survival behaviors from scratch β€” foraging, water-seeking, communication β€” on H100 GPU using JAX.

**Location:** /home/azureuser/genesis/ (venv at ~/genesis_env/)

### World Setup

- 512x512 grid with Perlin noise terrain
- Food regrowth, water sources, day/night cycles, seasons
- 1000 agents with GRU brains (~82K params each)
- Tournament selection + Gaussian mutation (self-adaptive sigma)
- Agents start with zero knowledge β€” must learn to survive

### Status (2026-04-01)

Phases 1-3 complete. 500K step run finished successfully:
- 86 generations evolved
- Agents sustain avg age 3,742 steps, energy 0.98, hydration 0.79
- Signal entropy dropping (4.28β†’3.58) β€” indicating early communication structure
- Simulation runs at ~1000 steps/s on H100 (JAX jit-compiled)

### Key Fix

food_growth_rate bumped from 0.005β†’0.02, food_eat_amount 0.05β†’0.03 to prevent ecological collapse at high generations.

### Architecture

- World: grid.py, resources.py, environment.py, physics.py, observations.py, spatial.py
- Agent: body.py (metabolism), brain.py (GRU + vmap batched), actions.py
- Evolution: fitness.py, selection.py, mutation.py (self-adaptive sigma), population.py
- Communication: signals.py (8-channel, spatial attenuation, top-4 reception)
- Analysis: emergence.py (signal entropy, magnitude, RΒ², diversity, clustering)
- Visualization: renderer.py (dashboard, world map, zoom views)

### Run Data

~/genesis/runs/run_20260401_111309/ β€” metrics.csv (500 rows), emergence.csv (100 rows), 50 viz frames, 10 checkpoints + FINAL, config.json

### Next Steps

- Phase 4: TRIBE v2 integration β€” compare evolved GRU representations to human brain activity via RSA
- Phase 5: Scale to 5K+ agents, longer runs for 500+ generations
- Checkpoints at 50K intervals allow comparing brain representations across evolutionary time

---

## Project 5: TRIBE v2 β€” AI-Brain Loop

**What:** Closing the AI-brain comparison loop using Meta's TRIBE v2 β€” comparing AI encoder representations to predicted brain activity to find architectural gaps.

**Location:** /home/azureuser/tribev2 (venv at /home/azureuser/tribev2_env)

### What's Built

- Full analysis script: /home/azureuser/tribev2/close_the_loop_v2.py
- 8 phases: load model β†’ extract per-layer features β†’ brain parcellation β†’ layer-wise encoding β†’ modality ablation β†’ RSA β†’ divergence mapping β†’ visualization
- Multimodal stimulus: /home/azureuser/multimodal_stimulus.mp4
- Results: /home/azureuser/loop_results_v2/
- Runs with video (V-JEPA2) + audio (Wav2Vec-BERT) + text (LLaMA 3.2-3B)

### Status

LLaMA 3.2 access granted. Full 3-modality analysis pipeline complete. Brain-guided ViT training attempted 5 times β€” all failed.

### Why Attempts Failed

- Never had real brain targets β€” routed ViT-Small features through TRIBE v2's projector (trained for V-JEPA2), producing random outputs
- Evaluated on wrong metric (classification accuracy instead of robustness)
- Literature shows brain-guided training helps ROBUSTNESS (+3-8%), not classification accuracy

### What Would Actually Work (from RESEARCH_BRIEF.md)

1. Pre-compute real brain targets using TRIBE v2's full pipeline
2. Train student with classification + per-vertex Pearson correlation brain loss
3. Evaluate on corruption/adversarial robustness, shape bias, brain-score β€” NOT accuracy
4. Or: use real fMRI data (Natural Scenes Dataset) instead of TRIBE v2 predictions

### Key Infrastructure

- Training scripts: /home/azureuser/brain_guided/train_*.py
- UCF-101 dataset: /home/azureuser/brain_guided/data/UCF-101 (13K videos)
- Results: /home/azureuser/brain_guided/results_final/

---

## Project 6: Instagram Cinema

**What:** AI-generated cinematic videos using LTX-2.3 22B on ComfyUI for Instagram growth.

**Setup:** LTX-2.3 22B dev model running on H100 via ComfyUI, exposed via cloudflared tunnel.

**Format:** Instagram Reels β€” 9:16 portrait, 544x960

**Goal:** Create viral-quality cinematic content for Instagram Reels.

---

## Money-Making Strategy (April 2026)

### Sellable Assets

1. **TurboQuant** β€” working implementation nobody else has publicly. Lead magnet for consulting.
2. **Parameter Golf** β€” competition result (if top placement) = massive credibility signal
3. **Fine-tuning expertise** β€” proven on H100, multiple model families
4. **Inference optimization consulting** β€” directly from TurboQuant benchmarks

### Immediate Plan

- Path to 10L: Freelancing/consulting β€” fine-tuning + inference optimization
- Path to 1Cr: Productized consulting at scale or AI startup
- Channel: X (Twitter) for distribution, direct DMs to founders for sales

### X (Twitter) Growth Strategy

- Account: 10 followers currently, Premium purchased (213.50/month with 50% off)
- Strategy: 70% replies (to bigger accounts), 30% original posts
- Target: 15 strategic replies/day to accounts with 100-5000 followers
- Post timing: 6:30 PM IST (9:00 AM EST) on Tue/Wed/Thu
- Pinned thread: TurboQuant benchmarks
- Goal: 500 followers in 4 weeks, first paid client in 2-4 weeks

### Cold Outreach Template

"I noticed you're using [X model]. I can cut your inference cost by 40%. Free 1-week proof. Interested?"

### Target Clients

- Indian startups using LLMs in production (inc42 AI list)
- US startups from YC directory (AI/ML category, S24/W25 batches)
- Anyone on Twitter complaining about GPU costs / inference scaling
- Companies with >$10K/month GPU spend