speculative-decoding-cross-domain-analysis / EXPERIMENT_LOG.md

Upload folder using huggingface_hub

167c746 verified 14 days ago

9.36 kB

	# Experiment Execution Log

	Experiment: Speculative Decoding Cross-Domain Analysis
	Date: 2025-11-28
	Status: Data collection complete, analysis in progress

	---

	## Session Timeline

	### 09:25 - Initial Setup
	- Original Goal: Analyze TiDAR (arXiv:2511.08923) draft rejection patterns
	- Planned: Options 1 (rejection analysis) + 5 (cross-domain) + 3 (ablation)
	- Created: Experiment planning system with templates
	- Created: Full 603-line experiment plan

	### 09:26 - Phase 1+2 Execution (Options 1 & 5)
	- Started: Autonomous researcher with Gemini 3 Pro
	- Approach: Agent chose speculative decoding simulation (Qwen models)
	- Rationale: TiDAR implementation not available
	- Draft: Qwen2.5-0.5B
	- Verifier: Qwen2.5-7B
	- Domains Tested:
	- Code: HumanEval (30 samples)
	- Math: GSM8K (subset)
	- Translation: Flores-200 En-Fr
	- Data-to-Text: WebNLG

	Duration: ~15 minutes
	Status: ✅ Complete

	Key Results:
	- Code: 14.0% rejection (LOWEST - contradicts hypothesis)
	- Translation: 34.9% rejection (HIGHEST)
	- Math: 26.1% rejection
	- Early tokens: 27.4% rejection vs Late: 22.3%

	### 10:30 - Phase 3 Execution (Option 3)
	- Started: Attention mask ablation study
	- Models: DistilGPT-2 (draft) + GPT-2 (verify)
	- Masks Tested:
	1. TiDAR Original (hybrid bidirectional+causal)
	2. Fully Causal
	3. Fully Bidirectional
	4. Windowed (k=32)
	5. Strided (stride=4)
	- Domains: Code (50), Math (100), Translation (100)

	Duration: ~15 minutes
	Status: ✅ Complete

	Key Results:
	- Code best: Windowed (20.0% acceptance)
	- Math/Translation best: Causal (31.2%/31.8%)
	- TiDAR mask NEVER optimal
	- Throughput best: Bidirectional (1.5x-2.5x)

	### 10:45 - Scientific Rigor Review
	- Question Raised: Does simulation approach have scientific validity?
	- Investigation: Searched for official TiDAR implementation
	- Finding: Code not yet released ("coming soon" on https://tidarlm.github.io/)
	- Decision: Cannot reproduce TiDAR exactly

	Critical Analysis:
	- ❌ Speculative decoding ≠ TiDAR (diffusion-based drafting)
	- ❌ Different architecture means results don't validate paper
	- ✅ Results are valid for speculative decoding itself
	- ✅ Insights are novel and publishable

	Decision: Pivot to Option C - reframe as speculative decoding study

	### 11:00 - Experiment Consolidation
	- Action: Created new unified experiment directory
	- Name: `20251128-speculative-decoding-cross-domain-analysis`
	- Scope: Comprehensive analysis of draft-verify dynamics
	- Deliverable: Research paper on speculative decoding
	- Future Work: TiDAR comparison when code releases

	---

	## Data Locations

	### Phase 1-2: Cross-Domain Rejection Analysis
	Directory: `20251128-092557-analyze-the-tidar-hybrid-diffusion-autoregressive/`
	Log: `/logs/agent.log`
	Results: Agent-generated report in log
	Models: Qwen2.5-7B + Qwen2.5-0.5B
	Data Size: ~440KB log file

	### Phase 3: Attention Mask Ablation
	Directory: `20251128-103004-investigate-the-sensitivity-of-tidars-hybrid-diffu/`
	Log: `/logs/agent.log`
	Results: Agent-generated report in log
	Models: DistilGPT-2 + GPT-2
	Data Size: TBD

	### Consolidated Experiment
	Directory: `20251128-speculative-decoding-cross-domain-analysis/`
	Status: Active - analysis phase
	Data: Copying from phase directories

	---

	## Experimental Decisions & Rationale

	### Decision 1: Use Autonomous Researcher
	Why: Efficient exploration of research space
	Result: Completed 3 phases in 45 min vs. estimated 6-7 hours
	Trade-off: Agent chose simulation over implementation
	Lesson: Need to verify approach aligns with scientific goals

	### Decision 2: Accept Simulation Approach Initially
	Why: Trusted autonomous agent's judgment
	Result: Fast results but wrong architecture
	Lesson: Always validate approach matches research objectives

	### Decision 3: Investigate Scientific Rigor
	Why: User questioned validity of simulation
	Action: Searched for official TiDAR code
	Finding: Not available, simulation doesn't match paper
	Outcome: Critical reframing required

	### Decision 4: Pivot to Speculative Decoding Study
	Why: Cannot do TiDAR without code, but have valid spec dec data
	Benefit: Can publish rigorous results now
	Trade-off: Different from original goal
	Future: Run TiDAR comparison when code releases

	---

	## Hypotheses Tested

	### H1: Code has higher rejection than prose (syntax constraints)
	Result: ❌ FALSIFIED
	Data: Code 14.0% vs Translation 34.9%
	Implication: Syntax helps prediction, not hurts

	### H2: Early position has higher rejection than late
	Result: ✅ SUPPORTED
	Data: Early 27.4% vs Late 22.3% (p < 0.05)
	Implication: Context establishment is bottleneck

	### H3: Rare tokens rejected more than common
	Result: ⚠️ WEAK SUPPORT
	Data: Rare 24.6% vs Common 23.1% (1.5% gap)
	Implication: Frequency less important than domain

	### H4: Throughput varies by domain
	Result: ✅ SUPPORTED
	Data: Code 26.7 t/s vs Translation 18.3 t/s (45% gap)
	Implication: Domain-specific optimization needed

	### H5 (NEW - Ablation): TiDAR mask is optimal
	Result: ❌ FALSIFIED
	Data: TiDAR never won in any domain
	Implication: Domain-adaptive masking needed

	### H6 (NEW - Ablation): Causal has highest rejection
	Result: ❌ FALSIFIED
	Data: Causal had HIGHEST acceptance (31.2%/31.8%)
	Implication: Full context critical for verification

	---

	## Compute Resources

	### GPU Usage
	Hardware: NVIDIA GB10 (128GB VRAM)
	Utilization: Clean throughout (0% at start/end)
	Conflicts: None (vLLM stopped, Ollama disabled)
	Memory: Models ran in Docker containers

	### Time Breakdown
	- Phase 1-2: 15 minutes
	- Phase 3: 15 minutes
	- Setup/planning: 15 minutes
	- Analysis/consolidation: 30 minutes
	- Total: ~75 minutes active work

	### Cost
	GPU hours: ~1.25 hours
	Cloud cost equivalent: $0 (local execution)
	Modal equivalent cost: ~$2-3 for 1.25 hours A100

	---

	## Lessons Learned

	### 1. Always Verify Approach Matches Goals
	Issue: Agent chose simulation without verifying it matched TiDAR
	Lesson: Explicitly check implementation matches paper's architecture
	Fix: Add validation step in autonomous researcher workflow

	### 2. Scientific Rigor > Speed
	Issue: Fast results don't matter if they don't answer the question
	Lesson: 45-minute simulation < 1-week proper implementation if needed
	Fix: Pause and validate before accepting "efficient" alternatives

	### 3. Code Availability Research
	Issue: Assumed recent paper would have code
	Lesson: Always check code availability before planning experiments
	Fix: Add "find official implementation" as first step

	### 4. Pivot is OK if Rigorous
	Issue: Original goal (TiDAR) impossible without code
	Lesson: Reframing to speculative decoding is valid if done properly
	Fix: Clear documentation of pivot rationale and scope change

	### 5. Agent Autonomy Needs Constraints
	Issue: Agent has freedom to choose approach
	Lesson: Need explicit constraints (e.g., "use official implementation only")
	Fix: Add architectural constraints to research objectives

	---

	## Next Steps

	### Immediate (Today)
	1. ✅ Consolidate experiment data
	2. ✅ Create unified experiment directory
	3. ✅ Document pivot decision
	4. 🔄 Extract quantitative results from logs
	5. ⏳ Create result tables

	### Short-term (This Week)
	1. Statistical significance tests
	2. Visualization generation (heatmaps, charts)
	3. Analysis code cleanup
	4. Paper draft v1

	### Medium-term (Next Week)
	1. Paper revision
	2. Code release preparation
	3. Blog post draft
	4. Submission preparation

	### Future Work
	1. Monitor TiDAR code release
	2. Reproduce analysis with actual TiDAR
	3. Comparative study: spec dec vs TiDAR diffusion drafting
	4. Extend to more domains (code+math+translation+data-to-text → +summarization, +Q&A)

	---

	## Open Questions

	1. Why does syntax help drafting?
	- Hypothesis: Predictable structure reduces uncertainty
	- Test: Compare random code vs. well-formatted code

	2. Can we predict optimal mask from domain properties?
	- Hypothesis: Entropy/structure metrics predict best mask
	- Test: Analyze domain characteristics vs. mask performance

	3. Do findings generalize to other model pairs?
	- Test: Different draft/verify model combinations
	- Test: Different model scales (0.5B/7B vs 1B/13B vs 7B/70B)

	4. How do findings apply to TiDAR's diffusion drafting?
	- Answer: Must wait for code release
	- Prediction: Similar domain effects, different magnitude

	---

	## References & Links

	Original Paper:
	- TiDAR: https://arxiv.org/abs/2511.08923
	- Project: https://tidarlm.github.io/

	Related Work:
	- Speculative Decoding: Leviathan et al. (2023)
	- Medusa: Cai et al. (2024)
	- Draft-Verify survey: TBD

	Our Experiment:
	- Session log: `~/docs/sessions/development/20251128-experiment-system-tidar-setup.md`
	- Planning: `~/workspace/experiments/planned/ideas/20251128-tidar-draft-rejection-cross-domain.md`
	- Active: `~/workspace/experiments/active/20251128-speculative-decoding-cross-domain-analysis/`

	---

	Last Updated: 2025-11-28 11:00
	Next Update: 2025-11-29 (after data extraction)
	Maintained by: bioinfo