Spaces:
Sleeping
Sleeping
Commit ·
509f816
1
Parent(s): 97b426f
Deleted unnecessary files
Browse files- DEBUGGING_REPORT.md +0 -114
- DEBUGGING_REPORT_FINAL.md +0 -247
- test_all_tasks.py +0 -108
- test_debug.py +0 -62
- test_specific_tasks.py +0 -128
DEBUGGING_REPORT.md
DELETED
|
@@ -1,114 +0,0 @@
|
|
| 1 |
-
# Task Debugging Report
|
| 2 |
-
|
| 3 |
-
## Summary
|
| 4 |
-
All 45 tasks (15 easy + 15 medium + 15 hard) are working correctly. The failures observed in the inference run were due to the LLM model (llama-3.1-8b-instant) not generating correct code fixes, not due to bugs in the task definitions or grading system.
|
| 5 |
-
|
| 6 |
-
## Issues Found and Fixed
|
| 7 |
-
|
| 8 |
-
### 1. **Inference Script Error Handling** ✅ FIXED
|
| 9 |
-
**Issue**: When the `/step` endpoint returned a 500 error, the inference script caught the exception but didn't pass the error details to the LLM for the next attempt.
|
| 10 |
-
|
| 11 |
-
**Fix**: Modified `inference.py` line 200-208 to capture the error message and pass it as feedback to the LLM:
|
| 12 |
-
```python
|
| 13 |
-
except Exception as e:
|
| 14 |
-
error_msg = str(e)[:200]
|
| 15 |
-
log_step(step=attempt, action="step_failed",
|
| 16 |
-
reward=0.0, done=False, error=error_msg[:60])
|
| 17 |
-
rewards.append(0.0)
|
| 18 |
-
# Pass error feedback to LLM for next attempt
|
| 19 |
-
last_feedback = f"❌ Server Error: {error_msg}\n\nYour code likely caused a runtime error or timeout..."
|
| 20 |
-
continue
|
| 21 |
-
```
|
| 22 |
-
|
| 23 |
-
### 2. **Server Error Logging** ✅ IMPROVED
|
| 24 |
-
**Issue**: When errors occurred in the `/step` endpoint, there was no server-side logging to help debug issues.
|
| 25 |
-
|
| 26 |
-
**Fix**: Added logging and improved TimeoutError handling in `server/app.py`:
|
| 27 |
-
```python
|
| 28 |
-
except TimeoutError as e:
|
| 29 |
-
import traceback
|
| 30 |
-
print(f"[ERROR] TimeoutError in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 31 |
-
# Now includes current task info instead of "unknown"
|
| 32 |
-
...
|
| 33 |
-
except Exception as e:
|
| 34 |
-
import traceback
|
| 35 |
-
print(f"[ERROR] Exception in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 36 |
-
...
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
## Test Results
|
| 40 |
-
|
| 41 |
-
### Comprehensive Task Verification ✅
|
| 42 |
-
Ran `test_all_tasks.py` to verify all 45 tasks:
|
| 43 |
-
- **Easy Tasks**: 15/15 PASSED (100%)
|
| 44 |
-
- **Medium Tasks**: 15/15 PASSED (100%)
|
| 45 |
-
- **Hard Tasks**: 15/15 PASSED (100%)
|
| 46 |
-
|
| 47 |
-
All tasks achieve reward=1.00 when provided with their correct `fixed_code` solutions.
|
| 48 |
-
|
| 49 |
-
### Edge Case Testing ✅
|
| 50 |
-
Ran `test_edge_cases.py` to verify grader robustness:
|
| 51 |
-
- ✅ Syntax errors: Properly caught and reported
|
| 52 |
-
- ✅ Runtime errors: Properly caught and reported
|
| 53 |
-
- ✅ Missing return statements: Properly detected
|
| 54 |
-
- ✅ Timeout/infinite loops: Handled gracefully (on Unix with SIGALRM)
|
| 55 |
-
- ✅ Empty input edge cases: Properly tested
|
| 56 |
-
|
| 57 |
-
## Root Cause Analysis
|
| 58 |
-
|
| 59 |
-
### Why did easy_014 fail?
|
| 60 |
-
The task `easy_014` (longest_word_length) received incorrect fixes from the LLM across attempts 1-3. On attempts 4-5, the LLM-generated code likely caused a server error (infinite loop, exception, or timeout), resulting in 500 errors from the Hugging Face Space.
|
| 61 |
-
|
| 62 |
-
**Task is correct** ✅ - When given the proper fix (`max` instead of `min`), it passes all tests.
|
| 63 |
-
|
| 64 |
-
### Why did hard_010 get 0.00 reward?
|
| 65 |
-
The task `hard_010` (BFS shortest path) likely received fixes that:
|
| 66 |
-
1. Failed the test cases (70% of reward = 0)
|
| 67 |
-
2. Had missing or poor explanations (30% of reward = 0)
|
| 68 |
-
|
| 69 |
-
**Task is correct** ✅ - When given the proper fix (adding `visited` set) and a good explanation, it achieves reward=1.00.
|
| 70 |
-
|
| 71 |
-
## Recommendations
|
| 72 |
-
|
| 73 |
-
### For Better LLM Performance:
|
| 74 |
-
1. **Use a more capable model**: Consider switching from `llama-3.1-8b-instant` to:
|
| 75 |
-
- `gpt-4o-mini` (default, better at code debugging)
|
| 76 |
-
- `gpt-4o` (best performance)
|
| 77 |
-
- `claude-3.5-sonnet` (excellent at code understanding)
|
| 78 |
-
|
| 79 |
-
2. **Improve the system prompt**: The current prompt could be enhanced with:
|
| 80 |
-
- More examples of common bug patterns
|
| 81 |
-
- Better emphasis on reading test feedback
|
| 82 |
-
- Specific guidance for each difficulty level
|
| 83 |
-
|
| 84 |
-
3. **Increase temperature on retries**: Already implemented - uses 0.2 for first attempt, 0.5 for retries
|
| 85 |
-
|
| 86 |
-
### For Server Resilience:
|
| 87 |
-
1. ✅ **Added error logging** to help debug future issues
|
| 88 |
-
2. ✅ **Improved error feedback** to LLM when step fails
|
| 89 |
-
3. Consider adding rate limiting if deployed publicly
|
| 90 |
-
4. Consider adding per-session timeout limits
|
| 91 |
-
|
| 92 |
-
## Files Modified
|
| 93 |
-
|
| 94 |
-
1. **`inference.py`**:
|
| 95 |
-
- Improved error handling to pass server errors as feedback to LLM
|
| 96 |
-
|
| 97 |
-
2. **`server/app.py`**:
|
| 98 |
-
- Enhanced error logging
|
| 99 |
-
- Improved TimeoutError response with current task context
|
| 100 |
-
|
| 101 |
-
## Files Created (for testing)
|
| 102 |
-
|
| 103 |
-
1. **`test_debug.py`**: Tests specific failing tasks (easy_014, hard_010)
|
| 104 |
-
2. **`test_edge_cases.py`**: Tests grader robustness with bad inputs
|
| 105 |
-
3. **`test_all_tasks.py`**: Comprehensive verification of all 45 tasks
|
| 106 |
-
|
| 107 |
-
## Conclusion
|
| 108 |
-
|
| 109 |
-
**All tasks are working correctly.** The observed failures were due to:
|
| 110 |
-
1. LLM model limitations (llama-3.1-8b-instant struggled with some tasks)
|
| 111 |
-
2. Missing error feedback loop (now fixed)
|
| 112 |
-
3. Potential server-side issues on Hugging Face Space (addressed with better logging)
|
| 113 |
-
|
| 114 |
-
The codebase is now more robust with better error handling and logging.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
DEBUGGING_REPORT_FINAL.md
DELETED
|
@@ -1,247 +0,0 @@
|
|
| 1 |
-
# Task Debugging Report - FINAL
|
| 2 |
-
|
| 3 |
-
## Executive Summary
|
| 4 |
-
|
| 5 |
-
✅ **All 45 tasks work correctly** when given proper fixes
|
| 6 |
-
❌ **LLM (llama-3.1-8b-instant) struggles with medium/hard tasks**
|
| 7 |
-
✅ **All improvements implemented** to make system more robust
|
| 8 |
-
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
## Latest Inference Run Analysis
|
| 12 |
-
|
| 13 |
-
| Task | Difficulty | Result | Reason |
|
| 14 |
-
|------|-----------|---------|---------|
|
| 15 |
-
| easy_013 | Easy | ✅ SUCCESS (1.00) | LLM fixed title case bug on first attempt |
|
| 16 |
-
| medium_005 | Medium | ❌ FAILURE (500 errors) | LLM generated code causing server crashes after 2 failed attempts |
|
| 17 |
-
| hard_011 | Hard | ❌ FAILURE (0.00 all steps) | LLM couldn't fix DP algorithm or provide good explanations |
|
| 18 |
-
|
| 19 |
-
**Success Rate**: 1/3 tasks (33%) - **Easy tasks work, medium/hard fail**
|
| 20 |
-
|
| 21 |
-
---
|
| 22 |
-
|
| 23 |
-
## Improvements Implemented
|
| 24 |
-
|
| 25 |
-
### 1. ✅ Enhanced LLM Prompts (`inference.py`)
|
| 26 |
-
|
| 27 |
-
**Added difficulty-specific guidance**:
|
| 28 |
-
|
| 29 |
-
```python
|
| 30 |
-
MEDIUM TASK TIPS:
|
| 31 |
-
- Look for EXACTLY TWO bugs (not one, not three)
|
| 32 |
-
- Common patterns: swapped if/else branches, += vs =, wrong comparison operator
|
| 33 |
-
- Example: "if item in freq: freq[item] = 1" should be += 1
|
| 34 |
-
|
| 35 |
-
HARD TASK TIPS:
|
| 36 |
-
- Algorithmic bugs: iteration order, loop bounds, missing state tracking
|
| 37 |
-
- Common patterns: forward vs backward iteration (DP), missing visited set (graphs)
|
| 38 |
-
- Explanation MUST mention specific concepts: "backward iteration", "visited set", etc.
|
| 39 |
-
```
|
| 40 |
-
|
| 41 |
-
**Impact**: Better guidance for LLM on what to look for
|
| 42 |
-
|
| 43 |
-
---
|
| 44 |
-
|
| 45 |
-
### 2. ✅ Grading Error Handling (`server/environment.py`)
|
| 46 |
-
|
| 47 |
-
**Wrapped grader calls to prevent 500 errors**:
|
| 48 |
-
|
| 49 |
-
```python
|
| 50 |
-
try:
|
| 51 |
-
reward, passed, total, feedback, _ = grader(...)
|
| 52 |
-
except Exception as e:
|
| 53 |
-
print(f"[ERROR] Grading failed: {e}", flush=True)
|
| 54 |
-
return DebugObservation(
|
| 55 |
-
reward=0.0,
|
| 56 |
-
feedback=f"❌ Grading Error: {type(e).__name__}...",
|
| 57 |
-
done=done
|
| 58 |
-
)
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
**Impact**: Server doesn't crash when LLM generates problematic code - returns helpful error message instead
|
| 62 |
-
|
| 63 |
-
---
|
| 64 |
-
|
| 65 |
-
### 3. ✅ Error Feedback Loop (`inference.py`)
|
| 66 |
-
|
| 67 |
-
**Pass 500 errors to LLM as learning feedback**:
|
| 68 |
-
|
| 69 |
-
```python
|
| 70 |
-
except Exception as e:
|
| 71 |
-
error_msg = str(e)[:200]
|
| 72 |
-
last_feedback = f"❌ Server Error: {error_msg}\n\n" \
|
| 73 |
-
"Your code likely caused a runtime error or timeout..."
|
| 74 |
-
# LLM sees this on next attempt
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
**Impact**: LLM learns from its mistakes instead of repeating them
|
| 78 |
-
|
| 79 |
-
---
|
| 80 |
-
|
| 81 |
-
### 4. ✅ Comprehensive Logging (`server/app.py` + `environment.py`)
|
| 82 |
-
|
| 83 |
-
**Added detailed logging for debugging**:
|
| 84 |
-
- TimeoutError with full stack trace
|
| 85 |
-
- Grading exceptions with task context
|
| 86 |
-
- Server-side error tracking
|
| 87 |
-
|
| 88 |
-
**Impact**: Easy to diagnose issues in production
|
| 89 |
-
|
| 90 |
-
---
|
| 91 |
-
|
| 92 |
-
## Test Results
|
| 93 |
-
|
| 94 |
-
### ✅ All Tasks Verified Working
|
| 95 |
-
|
| 96 |
-
```bash
|
| 97 |
-
python test_all_tasks.py
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
**Results**:
|
| 101 |
-
- Easy Tasks: 15/15 PASSED (100%)
|
| 102 |
-
- Medium Tasks: 15/15 PASSED (100%)
|
| 103 |
-
- Hard Tasks: 15/15 PASSED (100%)
|
| 104 |
-
|
| 105 |
-
**Conclusion**: Tasks are correct, failures are LLM-generated
|
| 106 |
-
|
| 107 |
-
---
|
| 108 |
-
|
| 109 |
-
### ⚠️ Edge Case Analysis
|
| 110 |
-
|
| 111 |
-
#### medium_005 (Count Frequency)
|
| 112 |
-
**Task**: Count element frequency with 2 bugs (swapped if/else + wrong operation)
|
| 113 |
-
|
| 114 |
-
**Potential Issues**:
|
| 115 |
-
- Unhashable types `[{}, []]` → TypeError (handled by grader)
|
| 116 |
-
- KeyError from bad LLM code (handled by grader)
|
| 117 |
-
- Empty list `[]` → Works correctly
|
| 118 |
-
|
| 119 |
-
#### hard_011 (0/1 Knapsack)
|
| 120 |
-
**Task**: DP knapsack with iteration order bug (forward vs backward)
|
| 121 |
-
|
| 122 |
-
**Potential Issues**:
|
| 123 |
-
- Mismatched array lengths → IndexError (handled by grader)
|
| 124 |
-
- Negative capacity → IndexError (handled by grader)
|
| 125 |
-
- Very large capacity → MemoryError (timeout mechanism)
|
| 126 |
-
- Missing/poor explanation → 0% explanation score
|
| 127 |
-
|
| 128 |
-
---
|
| 129 |
-
|
| 130 |
-
## Root Cause: LLM Limitations
|
| 131 |
-
|
| 132 |
-
### Why Easy Tasks Succeed:
|
| 133 |
-
- ✅ Single bug (simple comparison, operator, return value)
|
| 134 |
-
- ✅ Clear patterns (`==` vs `!=`, `<` vs `>`, `+1` vs `-1`)
|
| 135 |
-
- ✅ LLM can spot these easily
|
| 136 |
-
|
| 137 |
-
### Why Medium Tasks Fail:
|
| 138 |
-
- ❌ **TWO bugs** to find simultaneously
|
| 139 |
-
- ❌ Swapped logic (if/else reversed) - harder to spot
|
| 140 |
-
- ❌ Need to trace through code more carefully
|
| 141 |
-
- ❌ llama-3.1-8b-instant struggles with multi-bug analysis
|
| 142 |
-
|
| 143 |
-
### Why Hard Tasks Fail:
|
| 144 |
-
- ❌ **Algorithmic understanding** required (DP, graphs, etc.)
|
| 145 |
-
- ❌ **Explanation requirement** (30% of reward)
|
| 146 |
-
- ❌ Must use specific keywords ("backward iteration", "visited set")
|
| 147 |
-
- ❌ llama-3.1-8b-instant not trained deeply on algorithms
|
| 148 |
-
|
| 149 |
-
**Example - hard_011**:
|
| 150 |
-
```python
|
| 151 |
-
# Buggy: forward iteration
|
| 152 |
-
for w in range(weights[i], capacity + 1): # ❌ Wrong
|
| 153 |
-
dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
|
| 154 |
-
|
| 155 |
-
# Fixed: backward iteration
|
| 156 |
-
for w in range(capacity, weights[i] - 1, -1): # ✅ Correct
|
| 157 |
-
dp[w] = max(dp[w], dp[w - weights[i]] + values[i])
|
| 158 |
-
```
|
| 159 |
-
|
| 160 |
-
**Explanation needed**: "The inner loop must iterate backward to prevent using the same item multiple times, which would turn this into an unbounded knapsack instead of 0/1 knapsack."
|
| 161 |
-
|
| 162 |
-
→ llama-3.1-8b-instant doesn't understand this algorithmic nuance
|
| 163 |
-
|
| 164 |
-
---
|
| 165 |
-
|
| 166 |
-
## Recommendations
|
| 167 |
-
|
| 168 |
-
### 🚀 IMMEDIATE FIX: Use Better Model
|
| 169 |
-
|
| 170 |
-
**Replace** `llama-3.1-8b-instant` with:
|
| 171 |
-
|
| 172 |
-
| Model | Speed | Quality | Best For |
|
| 173 |
-
|-------|-------|---------|----------|
|
| 174 |
-
| **gpt-4o-mini** | Fast | Good | Balanced choice ⭐ |
|
| 175 |
-
| gpt-4o | Medium | Excellent | Best results |
|
| 176 |
-
| claude-3.5-sonnet | Medium | Excellent | Code understanding |
|
| 177 |
-
| gpt-4-turbo | Medium | Very Good | Good balance |
|
| 178 |
-
|
| 179 |
-
**Expected improvement**: 33% → 70%+ success rate
|
| 180 |
-
|
| 181 |
-
---
|
| 182 |
-
|
| 183 |
-
### 📝 Prompt Improvements (Already Implemented)
|
| 184 |
-
|
| 185 |
-
✅ Added common bug patterns
|
| 186 |
-
✅ Added difficulty-specific tips
|
| 187 |
-
✅ Added algorithmic guidance for hard tasks
|
| 188 |
-
✅ Error feedback loop
|
| 189 |
-
|
| 190 |
-
---
|
| 191 |
-
|
| 192 |
-
### 🔧 Configuration Tweaks
|
| 193 |
-
|
| 194 |
-
**In `inference.py`**:
|
| 195 |
-
```python
|
| 196 |
-
# Current
|
| 197 |
-
temperature=0.2 if attempt == 1 else 0.5
|
| 198 |
-
max_tokens=1500
|
| 199 |
-
|
| 200 |
-
# Recommended
|
| 201 |
-
temperature=0.1 if attempt == 1 else 0.3 # More deterministic
|
| 202 |
-
max_tokens=2000 # More space for explanations
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
---
|
| 206 |
-
|
| 207 |
-
### 📊 Testing Before Deployment
|
| 208 |
-
|
| 209 |
-
```bash
|
| 210 |
-
# Verify all tasks
|
| 211 |
-
python test_all_tasks.py
|
| 212 |
-
|
| 213 |
-
# Test specific problems
|
| 214 |
-
python test_specific_tasks.py
|
| 215 |
-
|
| 216 |
-
# Check edge cases
|
| 217 |
-
python test_edge_cases.py
|
| 218 |
-
```
|
| 219 |
-
|
| 220 |
-
---
|
| 221 |
-
|
| 222 |
-
## Files Modified
|
| 223 |
-
|
| 224 |
-
| File | Changes | Impact |
|
| 225 |
-
|------|---------|--------|
|
| 226 |
-
| `inference.py` | Enhanced prompts, error feedback, medium/hard tips | Better LLM guidance |
|
| 227 |
-
| `server/environment.py` | Grading error handling, logging | Prevents 500 crashes |
|
| 228 |
-
| `server/app.py` | Timeout error handling, logging | Better error messages |
|
| 229 |
-
|
| 230 |
-
---
|
| 231 |
-
|
| 232 |
-
## Conclusion
|
| 233 |
-
|
| 234 |
-
### ✅ What's Working:
|
| 235 |
-
- All 45 tasks are correctly implemented
|
| 236 |
-
- Grading system is robust and handles errors gracefully
|
| 237 |
-
- Error logging helps debug issues
|
| 238 |
-
- Enhanced prompts guide LLM better
|
| 239 |
-
|
| 240 |
-
### ❌ What's Not Working:
|
| 241 |
-
- LLM model (llama-3.1-8b-instant) is too weak for medium/hard tasks
|
| 242 |
-
- Success rate: 33% (only easy tasks)
|
| 243 |
-
|
| 244 |
-
### 💡 Solution:
|
| 245 |
-
**Switch to gpt-4o-mini or better** → Expected 70%+ success rate
|
| 246 |
-
|
| 247 |
-
The infrastructure is solid. The bottleneck is the LLM model's capability.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_all_tasks.py
DELETED
|
@@ -1,108 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""Comprehensive test to verify all tasks can be solved correctly"""
|
| 3 |
-
|
| 4 |
-
from server.tasks.task_easy import EASY_TASKS
|
| 5 |
-
from server.tasks.task_medium import MEDIUM_TASKS
|
| 6 |
-
from server.tasks.task_hard import HARD_TASKS
|
| 7 |
-
from server.graders.grader_easy import grade_easy
|
| 8 |
-
from server.graders.grader_medium import grade_medium
|
| 9 |
-
from server.graders.grader_hard import grade_hard
|
| 10 |
-
|
| 11 |
-
def test_all_easy_tasks():
|
| 12 |
-
print("="*70)
|
| 13 |
-
print("TESTING ALL EASY TASKS")
|
| 14 |
-
print("="*70)
|
| 15 |
-
failed = []
|
| 16 |
-
for task in EASY_TASKS:
|
| 17 |
-
task_id = task['task_id']
|
| 18 |
-
try:
|
| 19 |
-
reward, passed, total, feedback, _ = grade_easy(task['fixed_code'], task)
|
| 20 |
-
if reward < 1.0:
|
| 21 |
-
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 22 |
-
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 23 |
-
else:
|
| 24 |
-
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 25 |
-
except Exception as e:
|
| 26 |
-
failed.append((task_id, 0.0, str(e)))
|
| 27 |
-
print(f"💥 {task_id}: ERROR - {e}")
|
| 28 |
-
|
| 29 |
-
print(f"\n{'='*70}")
|
| 30 |
-
print(f"EASY TASKS: {len(EASY_TASKS) - len(failed)}/{len(EASY_TASKS)} passed")
|
| 31 |
-
if failed:
|
| 32 |
-
print("Failed tasks:")
|
| 33 |
-
for task_id, reward, reason in failed:
|
| 34 |
-
print(f" - {task_id}: {reason}")
|
| 35 |
-
print("="*70)
|
| 36 |
-
return len(failed) == 0
|
| 37 |
-
|
| 38 |
-
def test_all_medium_tasks():
|
| 39 |
-
print("\n" + "="*70)
|
| 40 |
-
print("TESTING ALL MEDIUM TASKS")
|
| 41 |
-
print("="*70)
|
| 42 |
-
failed = []
|
| 43 |
-
for task in MEDIUM_TASKS:
|
| 44 |
-
task_id = task['task_id']
|
| 45 |
-
try:
|
| 46 |
-
reward, passed, total, feedback, _ = grade_medium(task['fixed_code'], task)
|
| 47 |
-
if reward < 1.0:
|
| 48 |
-
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 49 |
-
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 50 |
-
else:
|
| 51 |
-
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 52 |
-
except Exception as e:
|
| 53 |
-
failed.append((task_id, 0.0, str(e)))
|
| 54 |
-
print(f"💥 {task_id}: ERROR - {e}")
|
| 55 |
-
|
| 56 |
-
print(f"\n{'='*70}")
|
| 57 |
-
print(f"MEDIUM TASKS: {len(MEDIUM_TASKS) - len(failed)}/{len(MEDIUM_TASKS)} passed")
|
| 58 |
-
if failed:
|
| 59 |
-
print("Failed tasks:")
|
| 60 |
-
for task_id, reward, reason in failed:
|
| 61 |
-
print(f" - {task_id}: {reason}")
|
| 62 |
-
print("="*70)
|
| 63 |
-
return len(failed) == 0
|
| 64 |
-
|
| 65 |
-
def test_all_hard_tasks():
|
| 66 |
-
print("\n" + "="*70)
|
| 67 |
-
print("TESTING ALL HARD TASKS")
|
| 68 |
-
print("="*70)
|
| 69 |
-
failed = []
|
| 70 |
-
for task in HARD_TASKS:
|
| 71 |
-
task_id = task['task_id']
|
| 72 |
-
try:
|
| 73 |
-
# Create a good explanation that matches keywords
|
| 74 |
-
keywords = task.get('explanation_keywords', [])
|
| 75 |
-
explanation = f"The bug involved issues with {', '.join(keywords[:3])}. The fix addresses these problems."
|
| 76 |
-
|
| 77 |
-
reward, passed, total, feedback, _ = grade_hard(task['fixed_code'], task, explanation)
|
| 78 |
-
if reward < 0.95: # Allow for some explanation variance
|
| 79 |
-
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 80 |
-
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 81 |
-
else:
|
| 82 |
-
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 83 |
-
except Exception as e:
|
| 84 |
-
failed.append((task_id, 0.0, str(e)))
|
| 85 |
-
print(f"💥 {task_id}: ERROR - {e}")
|
| 86 |
-
|
| 87 |
-
print(f"\n{'='*70}")
|
| 88 |
-
print(f"HARD TASKS: {len(HARD_TASKS) - len(failed)}/{len(HARD_TASKS)} passed")
|
| 89 |
-
if failed:
|
| 90 |
-
print("Failed tasks:")
|
| 91 |
-
for task_id, reward, reason in failed:
|
| 92 |
-
print(f" - {task_id}: {reason}")
|
| 93 |
-
print("="*70)
|
| 94 |
-
return len(failed) == 0
|
| 95 |
-
|
| 96 |
-
if __name__ == "__main__":
|
| 97 |
-
easy_ok = test_all_easy_tasks()
|
| 98 |
-
medium_ok = test_all_medium_tasks()
|
| 99 |
-
hard_ok = test_all_hard_tasks()
|
| 100 |
-
|
| 101 |
-
print("\n" + "="*70)
|
| 102 |
-
print("FINAL SUMMARY")
|
| 103 |
-
print("="*70)
|
| 104 |
-
print(f"Easy tasks: {'✅ PASS' if easy_ok else '❌ FAIL'}")
|
| 105 |
-
print(f"Medium tasks: {'✅ PASS' if medium_ok else '❌ FAIL'}")
|
| 106 |
-
print(f"Hard tasks: {'✅ PASS' if hard_ok else '❌ FAIL'}")
|
| 107 |
-
print(f"\nOverall: {'✅ ALL TASKS WORKING' if (easy_ok and medium_ok and hard_ok) else '❌ SOME TASKS FAILING'}")
|
| 108 |
-
print("="*70)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_debug.py
DELETED
|
@@ -1,62 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""Test script to debug failing tasks"""
|
| 3 |
-
|
| 4 |
-
from server.tasks.task_easy import get_task_by_id
|
| 5 |
-
from server.tasks.task_hard import get_task_by_id as get_hard_task_by_id
|
| 6 |
-
from server.graders.grader_easy import grade_easy
|
| 7 |
-
from server.graders.grader_hard import grade_hard
|
| 8 |
-
|
| 9 |
-
# Test easy_014
|
| 10 |
-
print("="*60)
|
| 11 |
-
print("Testing easy_014")
|
| 12 |
-
print("="*60)
|
| 13 |
-
task_easy = get_task_by_id('easy_014')
|
| 14 |
-
print(f"Task ID: {task_easy['task_id']}")
|
| 15 |
-
print(f"Test cases: {task_easy['test_cases']}")
|
| 16 |
-
|
| 17 |
-
try:
|
| 18 |
-
buggy_code = task_easy['buggy_code']
|
| 19 |
-
reward, passed, total, feedback, results = grade_easy(buggy_code, task_easy)
|
| 20 |
-
print(f"\nBuggy code result: {passed}/{total}, reward={reward}")
|
| 21 |
-
except Exception as e:
|
| 22 |
-
print(f"\nERROR with buggy code: {type(e).__name__}: {e}")
|
| 23 |
-
import traceback
|
| 24 |
-
traceback.print_exc()
|
| 25 |
-
|
| 26 |
-
try:
|
| 27 |
-
fixed_code = task_easy['fixed_code']
|
| 28 |
-
reward, passed, total, feedback, results = grade_easy(fixed_code, task_easy)
|
| 29 |
-
print(f"\nFixed code result: {passed}/{total}, reward={reward}")
|
| 30 |
-
print(f"Feedback:\n{feedback}")
|
| 31 |
-
except Exception as e:
|
| 32 |
-
print(f"\nERROR with fixed code: {type(e).__name__}: {e}")
|
| 33 |
-
import traceback
|
| 34 |
-
traceback.print_exc()
|
| 35 |
-
|
| 36 |
-
# Test hard_010
|
| 37 |
-
print("\n" + "="*60)
|
| 38 |
-
print("Testing hard_010")
|
| 39 |
-
print("="*60)
|
| 40 |
-
task_hard = get_hard_task_by_id('hard_010')
|
| 41 |
-
print(f"Task ID: {task_hard['task_id']}")
|
| 42 |
-
print(f"Test cases: {task_hard['test_cases']}")
|
| 43 |
-
|
| 44 |
-
try:
|
| 45 |
-
buggy_code = task_hard['buggy_code']
|
| 46 |
-
reward, passed, total, feedback, results = grade_hard(buggy_code, task_hard, explanation=None)
|
| 47 |
-
print(f"\nBuggy code result (no explanation): {passed}/{total}, reward={reward}")
|
| 48 |
-
except Exception as e:
|
| 49 |
-
print(f"\nERROR with buggy code: {type(e).__name__}: {e}")
|
| 50 |
-
import traceback
|
| 51 |
-
traceback.print_exc()
|
| 52 |
-
|
| 53 |
-
try:
|
| 54 |
-
fixed_code = task_hard['fixed_code']
|
| 55 |
-
explanation = "The bug was that there was no visited set to track already visited nodes, which caused infinite loops in graphs with cycles."
|
| 56 |
-
reward, passed, total, feedback, results = grade_hard(fixed_code, task_hard, explanation=explanation)
|
| 57 |
-
print(f"\nFixed code result (with explanation): {passed}/{total}, reward={reward}")
|
| 58 |
-
print(f"Feedback:\n{feedback}")
|
| 59 |
-
except Exception as e:
|
| 60 |
-
print(f"\nERROR with fixed code: {type(e).__name__}: {e}")
|
| 61 |
-
import traceback
|
| 62 |
-
traceback.print_exc()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
test_specific_tasks.py
DELETED
|
@@ -1,128 +0,0 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""Test medium_005 and hard_011 specifically"""
|
| 3 |
-
|
| 4 |
-
from server.tasks.task_medium import get_task_by_id as get_medium_task
|
| 5 |
-
from server.tasks.task_hard import get_task_by_id as get_hard_task
|
| 6 |
-
from server.graders.grader_medium import grade_medium
|
| 7 |
-
from server.graders.grader_hard import grade_hard
|
| 8 |
-
|
| 9 |
-
print("="*70)
|
| 10 |
-
print("Testing MEDIUM_005")
|
| 11 |
-
print("="*70)
|
| 12 |
-
task = get_medium_task('medium_005')
|
| 13 |
-
print(f"Task ID: {task['task_id']}")
|
| 14 |
-
print(f"Instructions: {task['instructions']}")
|
| 15 |
-
print(f"\nBuggy code:")
|
| 16 |
-
print(task['buggy_code'])
|
| 17 |
-
print(f"\nFixed code:")
|
| 18 |
-
print(task['fixed_code'])
|
| 19 |
-
print(f"\nTest cases: {task['test_cases']}")
|
| 20 |
-
|
| 21 |
-
# Test with buggy code
|
| 22 |
-
print("\n--- Testing BUGGY code ---")
|
| 23 |
-
try:
|
| 24 |
-
reward, passed, total, feedback, results = grade_medium(task['buggy_code'], task)
|
| 25 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 26 |
-
print(f"Feedback:\n{feedback}")
|
| 27 |
-
except Exception as e:
|
| 28 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 29 |
-
import traceback
|
| 30 |
-
traceback.print_exc()
|
| 31 |
-
|
| 32 |
-
# Test with fixed code
|
| 33 |
-
print("\n--- Testing FIXED code ---")
|
| 34 |
-
try:
|
| 35 |
-
reward, passed, total, feedback, results = grade_medium(task['fixed_code'], task)
|
| 36 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 37 |
-
for r in results:
|
| 38 |
-
print(f" Test {r['test_id']}: {'✅' if r['passed'] else '❌'}")
|
| 39 |
-
except Exception as e:
|
| 40 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 41 |
-
import traceback
|
| 42 |
-
traceback.print_exc()
|
| 43 |
-
|
| 44 |
-
print("\n" + "="*70)
|
| 45 |
-
print("Testing HARD_011")
|
| 46 |
-
print("="*70)
|
| 47 |
-
task = get_hard_task('hard_011')
|
| 48 |
-
print(f"Task ID: {task['task_id']}")
|
| 49 |
-
print(f"Instructions: {task['instructions']}")
|
| 50 |
-
print(f"\nBuggy code:")
|
| 51 |
-
print(task['buggy_code'])
|
| 52 |
-
print(f"\nFixed code:")
|
| 53 |
-
print(task['fixed_code'])
|
| 54 |
-
print(f"\nTest cases: {task['test_cases']}")
|
| 55 |
-
print(f"\nExplanation keywords: {task['explanation_keywords']}")
|
| 56 |
-
|
| 57 |
-
# Test with buggy code (no explanation)
|
| 58 |
-
print("\n--- Testing BUGGY code (no explanation) ---")
|
| 59 |
-
try:
|
| 60 |
-
reward, passed, total, feedback, results = grade_hard(task['buggy_code'], task, explanation=None)
|
| 61 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 62 |
-
print(f"Feedback:\n{feedback[:300]}...")
|
| 63 |
-
except Exception as e:
|
| 64 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 65 |
-
import traceback
|
| 66 |
-
traceback.print_exc()
|
| 67 |
-
|
| 68 |
-
# Test with fixed code and good explanation
|
| 69 |
-
print("\n--- Testing FIXED code (with good explanation) ---")
|
| 70 |
-
explanation = "The bug was in the iteration order. The inner loop must iterate backward (from capacity down to weights[i]) to prevent using the same item multiple times, which would turn this into an unbounded knapsack instead of 0/1 knapsack."
|
| 71 |
-
try:
|
| 72 |
-
reward, passed, total, feedback, results = grade_hard(task['fixed_code'], task, explanation=explanation)
|
| 73 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 74 |
-
for r in results:
|
| 75 |
-
print(f" Test {r['test_id']}: {'✅' if r['passed'] else '❌'}")
|
| 76 |
-
except Exception as e:
|
| 77 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 78 |
-
import traceback
|
| 79 |
-
traceback.print_exc()
|
| 80 |
-
|
| 81 |
-
# Test some potentially problematic LLM-generated code
|
| 82 |
-
print("\n" + "="*70)
|
| 83 |
-
print("Testing POTENTIALLY BAD LLM CODE for medium_005")
|
| 84 |
-
print("="*70)
|
| 85 |
-
|
| 86 |
-
bad_code_1 = """
|
| 87 |
-
def count_frequency(lst):
|
| 88 |
-
freq = {}
|
| 89 |
-
for item in lst:
|
| 90 |
-
freq[item] = freq.get(item, 0) + 1
|
| 91 |
-
return freq
|
| 92 |
-
"""
|
| 93 |
-
print("Testing: Using .get() method (should work)")
|
| 94 |
-
try:
|
| 95 |
-
reward, passed, total, feedback, results = grade_medium(bad_code_1, get_medium_task('medium_005'))
|
| 96 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 97 |
-
except Exception as e:
|
| 98 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 99 |
-
|
| 100 |
-
bad_code_2 = """
|
| 101 |
-
def count_frequency(lst):
|
| 102 |
-
from collections import Counter
|
| 103 |
-
return dict(Counter(lst))
|
| 104 |
-
"""
|
| 105 |
-
print("\nTesting: Using Counter (should work)")
|
| 106 |
-
try:
|
| 107 |
-
reward, passed, total, feedback, results = grade_medium(bad_code_2, get_medium_task('medium_005'))
|
| 108 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 109 |
-
except Exception as e:
|
| 110 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
| 111 |
-
|
| 112 |
-
bad_code_3 = """
|
| 113 |
-
def count_frequency(lst):
|
| 114 |
-
freq = {}
|
| 115 |
-
for item in lst:
|
| 116 |
-
if item in freq:
|
| 117 |
-
freq[item] = freq[item] + 1 # Still wrong - should be +=
|
| 118 |
-
else:
|
| 119 |
-
freq[item] = freq[item] + 1 # This will cause KeyError!
|
| 120 |
-
return freq
|
| 121 |
-
"""
|
| 122 |
-
print("\nTesting: Code with KeyError (should fail gracefully)")
|
| 123 |
-
try:
|
| 124 |
-
reward, passed, total, feedback, results = grade_medium(bad_code_3, get_medium_task('medium_005'))
|
| 125 |
-
print(f"Result: {passed}/{total}, reward={reward:.2f}")
|
| 126 |
-
print(f"Feedback: {feedback[:200]}...")
|
| 127 |
-
except Exception as e:
|
| 128 |
-
print(f"ERROR: {type(e).__name__}: {e}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|