File size: 6,461 Bytes
f80360c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# Stack 2.9 HumanEval Evaluation Plan

> **Status**: Pending GPU availability | **Last Updated**: 2026-04-01

This document provides complete instructions for running HumanEval benchmark evaluation on Stack 2.9.

## Quick Start (When GPU Available)

```bash
# 1. Navigate to eval directory
cd /Users/walidsobhi/.openclaw/workspace/stack-2.9/stack-2.9-eval

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run quick test (1 sample)
python3 -m benchmarks.human_eval --max-problems 1 --provider ollama

# 4. Run full evaluation (20 problems - current dataset)
python3 -m benchmarks.human_eval --max-problems 20 --provider ollama

# 5. For full 164-problem benchmark, download dataset first
# See "Full HumanEval Dataset" section below
```

## Hardware Requirements

### Recommended
- **GPU**: NVIDIA A100 80GB (or H100 80GB)
- **RAM**: 128GB system memory
- **Storage**: 50GB free space

### Minimum
- **GPU**: NVIDIA RTX 4090 (24GB VRAM) with 4-bit quantization
- **RAM**: 64GB system memory
- **Storage**: 50GB free space

### This Machine (Insufficient)
- **GPU**: Apple Silicon (M-series) - no CUDA support
- **RAM**: 16-24GB unified memory
- **Status**: Cannot run 32B model inference

## Software Setup

### Ubuntu/Debian
```bash
# Install CUDA (if not already installed)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-1

# Install Python dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install vllm transformers human-eval openai
```

### macOS (Intel/NVIDIA only)
```bash
# Install Python 3.10+
brew install python@3.11

# Create venv
python3.11 -m venv .venv
source .venv/bin/activate

# Install dependencies (CPU-only, will be slow)
pip install torch transformers human-eval
# Note: vLLM requires CUDA - not available on macOS
```

## Running the Evaluation

### Option 1: Using Built-in Benchmark (Current)

The repo has a simplified 20-problem dataset built into `benchmarks/human_eval.py`:

```bash
cd /Users/walidsobhi/.openclaw/workspace/stack-2.9/stack-2.9-eval

# With Ollama
python3 -m benchmarks.human_eval \
  --provider ollama \
  --model qwen2.5-coder:32b \
  --max-problems 20

# With OpenAI
export OPENAI_API_KEY=your-key-here
python3 -m benchmarks.human_eval \
  --provider openai \
  --model gpt-4o \
  --max-problems 20

# With Anthropic
export ANTHROPIC_API_KEY=your-key-here
python3 -m benchmarks.human_eval \
  --provider anthropic \
  --model claude-sonnet-4-20250514 \
  --max-problems 20
```

### Option 2: Full HumanEval Dataset (164 Problems)

```bash
# Clone human-eval repository
git clone https://github.com/openai/human-eval.git
cd human-eval

# Install
pip install -e .

# Create evaluation script
cat > eval_full.py << 'EOF'
import human_eval
from human_eval.data import write_jsonl, read_jsonl
from human_eval.evaluator import evaluate

# Load problems
problems = read_jsonl("data/HumanEval.jsonl.gz")

# Generate completions (using your model)
# ... generation code ...

# Evaluate
results = evaluate("examples.jsonl")
print(f"Pass@1: {results['pass_at_1']}")
EOF
```

### Option 3: Using vLLM (Fastest)

```bash
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --dtype half \
  --tensor-parallel-size 2

# In another terminal, run evaluation
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def add(x, y):\n    \"\"\"\n    Add two numbers.\n    \"\"\"\n    pass",
    "max_tokens": 256
  }'
```

## Interpreting Results

### Expected Output Format
```json
{
  "pass_at_1": 14,
  "pass_at_3": 17,
  "pass_at_5": 18,
  "total_cases": 20,
  "accuracy": 0.70,
  "benchmark": "HumanEval",
  "model": "qwen2.5-coder:32b",
  "results": [
    {"task_id": 1, "passed": true, "error": null},
    {"task_id": 2, "passed": false, "error": "AssertionError"}
  ]
}
```

### Score Interpretation

| Pass@1 | Rating | Notes |
|--------|--------|-------|
| < 50% | Poor | Model struggles with basic functions |
| 50-70% | Fair | Basic competency, some gaps |
| 70-80% | Good | Solid coding ability |
| 80-90% | Excellent | Strong code generation |
| > 90% | Outstanding | Near-human performance |

### Expected Scores for Stack 2.9

| Model | Pass@1 | Pass@10 | Pass@100 |
|-------|--------|---------|----------|
| Qwen2.5-Coder-32B (baseline) | 76.8% | ~85% | ~93% |
| **Stack 2.9 (estimated)** | **78-82%** | **86-90%** | **93-95%** |

## Troubleshooting

### Out of Memory (OOM)
```
CUDA out of memory: Tried to allocate 40GB
```
**Solutions:**
1. Use quantization: `--quantization awq` or 4-bit
2. Reduce batch size: `--batch-size 1`
3. Use smaller model: Try 7B or 14B variant
4. Enable gradient checkpointing

### vLLM Errors
```
ValueError: Invalid model architecture
```
**Solutions:**
1. Update vLLM: `pip install -U vllm`
2. Check model support: https://docs.vllm.ai/en/latest/models/supported_models.html
3. Use HuggingFace backend instead

### Dataset Download Issues
```
HTTP 404: Not Found
```
**Solutions:**
1. Check URL: https://github.com/openai/human-eval/raw/main/data/HumanEval.jsonl.gz
2. Use mirror: https://huggingface.co/datasets/openai/human-eval

### Slow Inference
```
Tokens/second: < 5
```
**Solutions:**
1. Use A100/H100 GPU (10x faster than consumer cards)
2. Enable FlashAttention: `--enforce-eager` not set
3. Increase batch size for throughput testing

## Success Checklist

Before reporting results, verify:

- [ ] At least 20 problems evaluated
- [ ] Pass@1 calculated correctly (passed/total)
- [ ] Results saved to JSON file
- [ ] Model name documented
- [ ] Temperature and settings recorded
- [ ] Baseline comparison available (Qwen2.5-Coder-32B)

## Output Files

After running, these files should be created:

```
stack-2.9-eval/results/
β”œβ”€β”€ humaneval.json          # Final results
β”œβ”€β”€ humaneval_raw.json      # Raw model outputs
β”œβ”€β”€ humaneval_errors.json   # Failed attempts with errors
└── humaneval_log.txt       # Execution log
```

## Contact

For issues or questions:
- GitHub: https://github.com/my-ai-stack/stack-2.9/issues
- Docs: See `stack-2.9-eval/README.md`

---

**Note**: This machine cannot run the evaluation due to lack of NVIDIA GPU. Estimated results are based on Qwen2.5-Coder-32B published benchmarks.