speculative-decoding-cross-domain-analysis / EXPERIMENT_LOG.md

RyeCatcher

Upload folder using huggingface_hub

167c746 verified 13 days ago

preview code

raw

history blame contribute delete

9.36 kB

Experiment Execution Log

Experiment: Speculative Decoding Cross-Domain Analysis Date: 2025-11-28 Status: Data collection complete, analysis in progress

Session Timeline

09:25 - Initial Setup

Original Goal: Analyze TiDAR (arXiv:2511.08923) draft rejection patterns
Planned: Options 1 (rejection analysis) + 5 (cross-domain) + 3 (ablation)
Created: Experiment planning system with templates
Created: Full 603-line experiment plan

09:26 - Phase 1+2 Execution (Options 1 & 5)

Started: Autonomous researcher with Gemini 3 Pro
Approach: Agent chose speculative decoding simulation (Qwen models)
- Rationale: TiDAR implementation not available
- Draft: Qwen2.5-0.5B
- Verifier: Qwen2.5-7B
Domains Tested:
- Code: HumanEval (30 samples)
- Math: GSM8K (subset)
- Translation: Flores-200 En-Fr
- Data-to-Text: WebNLG

Duration: ~15 minutes Status: ✅ Complete

Key Results:

Code: 14.0% rejection (LOWEST - contradicts hypothesis)
Translation: 34.9% rejection (HIGHEST)
Math: 26.1% rejection
Early tokens: 27.4% rejection vs Late: 22.3%

10:30 - Phase 3 Execution (Option 3)

Started: Attention mask ablation study
Models: DistilGPT-2 (draft) + GPT-2 (verify)
Masks Tested:
1. TiDAR Original (hybrid bidirectional+causal)
2. Fully Causal
3. Fully Bidirectional
4. Windowed (k=32)
5. Strided (stride=4)
Domains: Code (50), Math (100), Translation (100)

Duration: ~15 minutes Status: ✅ Complete

Key Results:

Code best: Windowed (20.0% acceptance)
Math/Translation best: Causal (31.2%/31.8%)
TiDAR mask NEVER optimal
Throughput best: Bidirectional (1.5x-2.5x)

10:45 - Scientific Rigor Review

Question Raised: Does simulation approach have scientific validity?
Investigation: Searched for official TiDAR implementation
Finding: Code not yet released ("coming soon" on https://tidarlm.github.io/)
Decision: Cannot reproduce TiDAR exactly

Critical Analysis:

❌ Speculative decoding ≠ TiDAR (diffusion-based drafting)
❌ Different architecture means results don't validate paper
✅ Results are valid for speculative decoding itself
✅ Insights are novel and publishable

Decision: Pivot to Option C - reframe as speculative decoding study

11:00 - Experiment Consolidation

Action: Created new unified experiment directory
Name: 20251128-speculative-decoding-cross-domain-analysis
Scope: Comprehensive analysis of draft-verify dynamics
Deliverable: Research paper on speculative decoding
Future Work: TiDAR comparison when code releases

Data Locations

Phase 1-2: Cross-Domain Rejection Analysis

Directory: 20251128-092557-analyze-the-tidar-hybrid-diffusion-autoregressive/ Log: /logs/agent.log Results: Agent-generated report in log Models: Qwen2.5-7B + Qwen2.5-0.5B Data Size: ~440KB log file

Phase 3: Attention Mask Ablation

Directory: 20251128-103004-investigate-the-sensitivity-of-tidars-hybrid-diffu/ Log: /logs/agent.log Results: Agent-generated report in log Models: DistilGPT-2 + GPT-2 Data Size: TBD

Consolidated Experiment

Directory: 20251128-speculative-decoding-cross-domain-analysis/ Status: Active - analysis phase Data: Copying from phase directories

Experimental Decisions & Rationale

Decision 1: Use Autonomous Researcher

Why: Efficient exploration of research space Result: Completed 3 phases in 45 min vs. estimated 6-7 hours Trade-off: Agent chose simulation over implementation Lesson: Need to verify approach aligns with scientific goals

Decision 2: Accept Simulation Approach Initially

Why: Trusted autonomous agent's judgment Result: Fast results but wrong architecture Lesson: Always validate approach matches research objectives

Decision 3: Investigate Scientific Rigor

Why: User questioned validity of simulation Action: Searched for official TiDAR code Finding: Not available, simulation doesn't match paper Outcome: Critical reframing required

Decision 4: Pivot to Speculative Decoding Study

Why: Cannot do TiDAR without code, but have valid spec dec data Benefit: Can publish rigorous results now Trade-off: Different from original goal Future: Run TiDAR comparison when code releases

Hypotheses Tested

H1: Code has higher rejection than prose (syntax constraints)

Result: ❌ FALSIFIED Data: Code 14.0% vs Translation 34.9% Implication: Syntax helps prediction, not hurts

H2: Early position has higher rejection than late

Result: ✅ SUPPORTED Data: Early 27.4% vs Late 22.3% (p < 0.05) Implication: Context establishment is bottleneck

H3: Rare tokens rejected more than common

Result: ⚠️ WEAK SUPPORT Data: Rare 24.6% vs Common 23.1% (1.5% gap) Implication: Frequency less important than domain

H4: Throughput varies by domain

Result: ✅ SUPPORTED Data: Code 26.7 t/s vs Translation 18.3 t/s (45% gap) Implication: Domain-specific optimization needed

H5 (NEW - Ablation): TiDAR mask is optimal

Result: ❌ FALSIFIED Data: TiDAR never won in any domain Implication: Domain-adaptive masking needed

H6 (NEW - Ablation): Causal has highest rejection

Result: ❌ FALSIFIED Data: Causal had HIGHEST acceptance (31.2%/31.8%) Implication: Full context critical for verification

Compute Resources

GPU Usage

Hardware: NVIDIA GB10 (128GB VRAM) Utilization: Clean throughout (0% at start/end) Conflicts: None (vLLM stopped, Ollama disabled) Memory: Models ran in Docker containers

Time Breakdown

Phase 1-2: 15 minutes
Phase 3: 15 minutes
Setup/planning: 15 minutes
Analysis/consolidation: 30 minutes
Total: ~75 minutes active work

Cost

GPU hours: ~1.25 hours Cloud cost equivalent: $0 (local execution) Modal equivalent cost: ~$2-3 for 1.25 hours A100

Lessons Learned

1. Always Verify Approach Matches Goals

Issue: Agent chose simulation without verifying it matched TiDAR Lesson: Explicitly check implementation matches paper's architecture Fix: Add validation step in autonomous researcher workflow

2. Scientific Rigor > Speed

Issue: Fast results don't matter if they don't answer the question Lesson: 45-minute simulation < 1-week proper implementation if needed Fix: Pause and validate before accepting "efficient" alternatives

3. Code Availability Research

Issue: Assumed recent paper would have code Lesson: Always check code availability before planning experiments Fix: Add "find official implementation" as first step

4. Pivot is OK if Rigorous

Issue: Original goal (TiDAR) impossible without code Lesson: Reframing to speculative decoding is valid if done properly Fix: Clear documentation of pivot rationale and scope change

5. Agent Autonomy Needs Constraints

Issue: Agent has freedom to choose approach Lesson: Need explicit constraints (e.g., "use official implementation only") Fix: Add architectural constraints to research objectives

Next Steps

Immediate (Today)

✅ Consolidate experiment data
✅ Create unified experiment directory
✅ Document pivot decision
🔄 Extract quantitative results from logs
⏳ Create result tables

Short-term (This Week)

Statistical significance tests
Visualization generation (heatmaps, charts)
Analysis code cleanup
Paper draft v1

Medium-term (Next Week)

Paper revision
Code release preparation
Blog post draft
Submission preparation

Future Work

Monitor TiDAR code release
Reproduce analysis with actual TiDAR
Comparative study: spec dec vs TiDAR diffusion drafting
Extend to more domains (code+math+translation+data-to-text → +summarization, +Q&A)

Open Questions

Why does syntax help drafting?
- Hypothesis: Predictable structure reduces uncertainty
- Test: Compare random code vs. well-formatted code
Can we predict optimal mask from domain properties?
- Hypothesis: Entropy/structure metrics predict best mask
- Test: Analyze domain characteristics vs. mask performance
Do findings generalize to other model pairs?
- Test: Different draft/verify model combinations
- Test: Different model scales (0.5B/7B vs 1B/13B vs 7B/70B)
How do findings apply to TiDAR's diffusion drafting?
- Answer: Must wait for code release
- Prediction: Similar domain effects, different magnitude

References & Links

Original Paper:

TiDAR: https://arxiv.org/abs/2511.08923
Project: https://tidarlm.github.io/

Related Work:

Speculative Decoding: Leviathan et al. (2023)
Medusa: Cai et al. (2024)
Draft-Verify survey: TBD

Our Experiment:

Session log: ~/docs/sessions/development/20251128-experiment-system-tidar-setup.md
Planning: ~/workspace/experiments/planned/ideas/20251128-tidar-draft-rejection-cross-domain.md
Active: ~/workspace/experiments/active/20251128-speculative-decoding-cross-domain-analysis/

Last Updated: 2025-11-28 11:00 Next Update: 2025-11-29 (after data extraction) Maintained by: bioinfo