Spaces:
Sleeping
Sleeping
Commit ·
cbda222
1
Parent(s): 2785b89
some fixes
Browse files- DEBUGGING_REPORT.md +114 -0
- inference.py +4 -1
- server/app.py +14 -6
- server/graders/__pycache__/grader_easy.cpython-310.pyc +0 -0
- test_all_tasks.py +108 -0
- test_debug.py +62 -0
- test_edge_cases.py +80 -0
DEBUGGING_REPORT.md
ADDED
|
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Task Debugging Report
|
| 2 |
+
|
| 3 |
+
## Summary
|
| 4 |
+
All 45 tasks (15 easy + 15 medium + 15 hard) are working correctly. The failures observed in the inference run were due to the LLM model (llama-3.1-8b-instant) not generating correct code fixes, not due to bugs in the task definitions or grading system.
|
| 5 |
+
|
| 6 |
+
## Issues Found and Fixed
|
| 7 |
+
|
| 8 |
+
### 1. **Inference Script Error Handling** ✅ FIXED
|
| 9 |
+
**Issue**: When the `/step` endpoint returned a 500 error, the inference script caught the exception but didn't pass the error details to the LLM for the next attempt.
|
| 10 |
+
|
| 11 |
+
**Fix**: Modified `inference.py` line 200-208 to capture the error message and pass it as feedback to the LLM:
|
| 12 |
+
```python
|
| 13 |
+
except Exception as e:
|
| 14 |
+
error_msg = str(e)[:200]
|
| 15 |
+
log_step(step=attempt, action="step_failed",
|
| 16 |
+
reward=0.0, done=False, error=error_msg[:60])
|
| 17 |
+
rewards.append(0.0)
|
| 18 |
+
# Pass error feedback to LLM for next attempt
|
| 19 |
+
last_feedback = f"❌ Server Error: {error_msg}\n\nYour code likely caused a runtime error or timeout..."
|
| 20 |
+
continue
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
### 2. **Server Error Logging** ✅ IMPROVED
|
| 24 |
+
**Issue**: When errors occurred in the `/step` endpoint, there was no server-side logging to help debug issues.
|
| 25 |
+
|
| 26 |
+
**Fix**: Added logging and improved TimeoutError handling in `server/app.py`:
|
| 27 |
+
```python
|
| 28 |
+
except TimeoutError as e:
|
| 29 |
+
import traceback
|
| 30 |
+
print(f"[ERROR] TimeoutError in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 31 |
+
# Now includes current task info instead of "unknown"
|
| 32 |
+
...
|
| 33 |
+
except Exception as e:
|
| 34 |
+
import traceback
|
| 35 |
+
print(f"[ERROR] Exception in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 36 |
+
...
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Test Results
|
| 40 |
+
|
| 41 |
+
### Comprehensive Task Verification ✅
|
| 42 |
+
Ran `test_all_tasks.py` to verify all 45 tasks:
|
| 43 |
+
- **Easy Tasks**: 15/15 PASSED (100%)
|
| 44 |
+
- **Medium Tasks**: 15/15 PASSED (100%)
|
| 45 |
+
- **Hard Tasks**: 15/15 PASSED (100%)
|
| 46 |
+
|
| 47 |
+
All tasks achieve reward=1.00 when provided with their correct `fixed_code` solutions.
|
| 48 |
+
|
| 49 |
+
### Edge Case Testing ✅
|
| 50 |
+
Ran `test_edge_cases.py` to verify grader robustness:
|
| 51 |
+
- ✅ Syntax errors: Properly caught and reported
|
| 52 |
+
- ✅ Runtime errors: Properly caught and reported
|
| 53 |
+
- ✅ Missing return statements: Properly detected
|
| 54 |
+
- ✅ Timeout/infinite loops: Handled gracefully (on Unix with SIGALRM)
|
| 55 |
+
- ✅ Empty input edge cases: Properly tested
|
| 56 |
+
|
| 57 |
+
## Root Cause Analysis
|
| 58 |
+
|
| 59 |
+
### Why did easy_014 fail?
|
| 60 |
+
The task `easy_014` (longest_word_length) received incorrect fixes from the LLM across attempts 1-3. On attempts 4-5, the LLM-generated code likely caused a server error (infinite loop, exception, or timeout), resulting in 500 errors from the Hugging Face Space.
|
| 61 |
+
|
| 62 |
+
**Task is correct** ✅ - When given the proper fix (`max` instead of `min`), it passes all tests.
|
| 63 |
+
|
| 64 |
+
### Why did hard_010 get 0.00 reward?
|
| 65 |
+
The task `hard_010` (BFS shortest path) likely received fixes that:
|
| 66 |
+
1. Failed the test cases (70% of reward = 0)
|
| 67 |
+
2. Had missing or poor explanations (30% of reward = 0)
|
| 68 |
+
|
| 69 |
+
**Task is correct** ✅ - When given the proper fix (adding `visited` set) and a good explanation, it achieves reward=1.00.
|
| 70 |
+
|
| 71 |
+
## Recommendations
|
| 72 |
+
|
| 73 |
+
### For Better LLM Performance:
|
| 74 |
+
1. **Use a more capable model**: Consider switching from `llama-3.1-8b-instant` to:
|
| 75 |
+
- `gpt-4o-mini` (default, better at code debugging)
|
| 76 |
+
- `gpt-4o` (best performance)
|
| 77 |
+
- `claude-3.5-sonnet` (excellent at code understanding)
|
| 78 |
+
|
| 79 |
+
2. **Improve the system prompt**: The current prompt could be enhanced with:
|
| 80 |
+
- More examples of common bug patterns
|
| 81 |
+
- Better emphasis on reading test feedback
|
| 82 |
+
- Specific guidance for each difficulty level
|
| 83 |
+
|
| 84 |
+
3. **Increase temperature on retries**: Already implemented - uses 0.2 for first attempt, 0.5 for retries
|
| 85 |
+
|
| 86 |
+
### For Server Resilience:
|
| 87 |
+
1. ✅ **Added error logging** to help debug future issues
|
| 88 |
+
2. ✅ **Improved error feedback** to LLM when step fails
|
| 89 |
+
3. Consider adding rate limiting if deployed publicly
|
| 90 |
+
4. Consider adding per-session timeout limits
|
| 91 |
+
|
| 92 |
+
## Files Modified
|
| 93 |
+
|
| 94 |
+
1. **`inference.py`**:
|
| 95 |
+
- Improved error handling to pass server errors as feedback to LLM
|
| 96 |
+
|
| 97 |
+
2. **`server/app.py`**:
|
| 98 |
+
- Enhanced error logging
|
| 99 |
+
- Improved TimeoutError response with current task context
|
| 100 |
+
|
| 101 |
+
## Files Created (for testing)
|
| 102 |
+
|
| 103 |
+
1. **`test_debug.py`**: Tests specific failing tasks (easy_014, hard_010)
|
| 104 |
+
2. **`test_edge_cases.py`**: Tests grader robustness with bad inputs
|
| 105 |
+
3. **`test_all_tasks.py`**: Comprehensive verification of all 45 tasks
|
| 106 |
+
|
| 107 |
+
## Conclusion
|
| 108 |
+
|
| 109 |
+
**All tasks are working correctly.** The observed failures were due to:
|
| 110 |
+
1. LLM model limitations (llama-3.1-8b-instant struggled with some tasks)
|
| 111 |
+
2. Missing error feedback loop (now fixed)
|
| 112 |
+
3. Potential server-side issues on Hugging Face Space (addressed with better logging)
|
| 113 |
+
|
| 114 |
+
The codebase is now more robust with better error handling and logging.
|
inference.py
CHANGED
|
@@ -201,9 +201,12 @@ def run_episode(env_url: str, difficulty: str) -> tuple:
|
|
| 201 |
result = env_step(env_url, fixed_code=fixed_code,
|
| 202 |
explanation=agent_action.get("explanation"))
|
| 203 |
except Exception as e:
|
|
|
|
| 204 |
log_step(step=attempt, action="step_failed",
|
| 205 |
-
reward=0.0, done=False, error=
|
| 206 |
rewards.append(0.0)
|
|
|
|
|
|
|
| 207 |
continue
|
| 208 |
|
| 209 |
reward = result.get("reward", 0.0)
|
|
|
|
| 201 |
result = env_step(env_url, fixed_code=fixed_code,
|
| 202 |
explanation=agent_action.get("explanation"))
|
| 203 |
except Exception as e:
|
| 204 |
+
error_msg = str(e)[:200]
|
| 205 |
log_step(step=attempt, action="step_failed",
|
| 206 |
+
reward=0.0, done=False, error=error_msg[:60])
|
| 207 |
rewards.append(0.0)
|
| 208 |
+
# Pass error feedback to LLM for next attempt
|
| 209 |
+
last_feedback = f"❌ Server Error: {error_msg}\n\nYour code likely caused a runtime error or timeout. Check for:\n- Infinite loops\n- Syntax errors\n- Runtime exceptions (IndexError, KeyError, etc.)\n- Edge cases not handled"
|
| 210 |
continue
|
| 211 |
|
| 212 |
reward = result.get("reward", 0.0)
|
server/app.py
CHANGED
|
@@ -105,19 +105,27 @@ async def step(request: StepRequest) -> StepResponse:
|
|
| 105 |
reward=observation.reward or 0.0,
|
| 106 |
done=observation.done,
|
| 107 |
)
|
| 108 |
-
except TimeoutError:
|
| 109 |
# Code execution timed out — return 0 reward instead of 500
|
|
|
|
|
|
|
| 110 |
return StepResponse(
|
| 111 |
-
observation={"task_id": "
|
| 112 |
-
"
|
| 113 |
-
"
|
| 114 |
-
"
|
| 115 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
"done": False},
|
| 117 |
reward=0.0,
|
| 118 |
done=False,
|
| 119 |
)
|
| 120 |
except Exception as e:
|
|
|
|
|
|
|
| 121 |
raise HTTPException(status_code=500, detail=f"Step failed: {str(e)}")
|
| 122 |
|
| 123 |
|
|
|
|
| 105 |
reward=observation.reward or 0.0,
|
| 106 |
done=observation.done,
|
| 107 |
)
|
| 108 |
+
except TimeoutError as e:
|
| 109 |
# Code execution timed out — return 0 reward instead of 500
|
| 110 |
+
import traceback
|
| 111 |
+
print(f"[ERROR] TimeoutError in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 112 |
return StepResponse(
|
| 113 |
+
observation={"task_id": env._current_task.get("task_id", "unknown") if env._current_task else "unknown",
|
| 114 |
+
"difficulty": env._difficulty,
|
| 115 |
+
"buggy_code": env._current_task.get("buggy_code", "") if env._current_task else "",
|
| 116 |
+
"instructions": env._current_task.get("instructions", "") if env._current_task else "",
|
| 117 |
+
"test_cases_description": env._current_task.get("test_cases_description", "") if env._current_task else "",
|
| 118 |
+
"reward": 0.0,
|
| 119 |
+
"passed_tests": 0,
|
| 120 |
+
"total_tests": len(env._current_task.get("test_cases", [])) if env._current_task else 3,
|
| 121 |
+
"feedback": "TimeoutError: Code execution timed out. Possible infinite loop or very slow code.",
|
| 122 |
"done": False},
|
| 123 |
reward=0.0,
|
| 124 |
done=False,
|
| 125 |
)
|
| 126 |
except Exception as e:
|
| 127 |
+
import traceback
|
| 128 |
+
print(f"[ERROR] Exception in step: {e}\n{traceback.format_exc()}", flush=True)
|
| 129 |
raise HTTPException(status_code=500, detail=f"Step failed: {str(e)}")
|
| 130 |
|
| 131 |
|
server/graders/__pycache__/grader_easy.cpython-310.pyc
CHANGED
|
Binary files a/server/graders/__pycache__/grader_easy.cpython-310.pyc and b/server/graders/__pycache__/grader_easy.cpython-310.pyc differ
|
|
|
test_all_tasks.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Comprehensive test to verify all tasks can be solved correctly"""
|
| 3 |
+
|
| 4 |
+
from server.tasks.task_easy import EASY_TASKS
|
| 5 |
+
from server.tasks.task_medium import MEDIUM_TASKS
|
| 6 |
+
from server.tasks.task_hard import HARD_TASKS
|
| 7 |
+
from server.graders.grader_easy import grade_easy
|
| 8 |
+
from server.graders.grader_medium import grade_medium
|
| 9 |
+
from server.graders.grader_hard import grade_hard
|
| 10 |
+
|
| 11 |
+
def test_all_easy_tasks():
|
| 12 |
+
print("="*70)
|
| 13 |
+
print("TESTING ALL EASY TASKS")
|
| 14 |
+
print("="*70)
|
| 15 |
+
failed = []
|
| 16 |
+
for task in EASY_TASKS:
|
| 17 |
+
task_id = task['task_id']
|
| 18 |
+
try:
|
| 19 |
+
reward, passed, total, feedback, _ = grade_easy(task['fixed_code'], task)
|
| 20 |
+
if reward < 1.0:
|
| 21 |
+
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 22 |
+
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 23 |
+
else:
|
| 24 |
+
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 25 |
+
except Exception as e:
|
| 26 |
+
failed.append((task_id, 0.0, str(e)))
|
| 27 |
+
print(f"💥 {task_id}: ERROR - {e}")
|
| 28 |
+
|
| 29 |
+
print(f"\n{'='*70}")
|
| 30 |
+
print(f"EASY TASKS: {len(EASY_TASKS) - len(failed)}/{len(EASY_TASKS)} passed")
|
| 31 |
+
if failed:
|
| 32 |
+
print("Failed tasks:")
|
| 33 |
+
for task_id, reward, reason in failed:
|
| 34 |
+
print(f" - {task_id}: {reason}")
|
| 35 |
+
print("="*70)
|
| 36 |
+
return len(failed) == 0
|
| 37 |
+
|
| 38 |
+
def test_all_medium_tasks():
|
| 39 |
+
print("\n" + "="*70)
|
| 40 |
+
print("TESTING ALL MEDIUM TASKS")
|
| 41 |
+
print("="*70)
|
| 42 |
+
failed = []
|
| 43 |
+
for task in MEDIUM_TASKS:
|
| 44 |
+
task_id = task['task_id']
|
| 45 |
+
try:
|
| 46 |
+
reward, passed, total, feedback, _ = grade_medium(task['fixed_code'], task)
|
| 47 |
+
if reward < 1.0:
|
| 48 |
+
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 49 |
+
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 50 |
+
else:
|
| 51 |
+
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 52 |
+
except Exception as e:
|
| 53 |
+
failed.append((task_id, 0.0, str(e)))
|
| 54 |
+
print(f"💥 {task_id}: ERROR - {e}")
|
| 55 |
+
|
| 56 |
+
print(f"\n{'='*70}")
|
| 57 |
+
print(f"MEDIUM TASKS: {len(MEDIUM_TASKS) - len(failed)}/{len(MEDIUM_TASKS)} passed")
|
| 58 |
+
if failed:
|
| 59 |
+
print("Failed tasks:")
|
| 60 |
+
for task_id, reward, reason in failed:
|
| 61 |
+
print(f" - {task_id}: {reason}")
|
| 62 |
+
print("="*70)
|
| 63 |
+
return len(failed) == 0
|
| 64 |
+
|
| 65 |
+
def test_all_hard_tasks():
|
| 66 |
+
print("\n" + "="*70)
|
| 67 |
+
print("TESTING ALL HARD TASKS")
|
| 68 |
+
print("="*70)
|
| 69 |
+
failed = []
|
| 70 |
+
for task in HARD_TASKS:
|
| 71 |
+
task_id = task['task_id']
|
| 72 |
+
try:
|
| 73 |
+
# Create a good explanation that matches keywords
|
| 74 |
+
keywords = task.get('explanation_keywords', [])
|
| 75 |
+
explanation = f"The bug involved issues with {', '.join(keywords[:3])}. The fix addresses these problems."
|
| 76 |
+
|
| 77 |
+
reward, passed, total, feedback, _ = grade_hard(task['fixed_code'], task, explanation)
|
| 78 |
+
if reward < 0.95: # Allow for some explanation variance
|
| 79 |
+
failed.append((task_id, reward, f"{passed}/{total} tests passed"))
|
| 80 |
+
print(f"❌ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 81 |
+
else:
|
| 82 |
+
print(f"✅ {task_id}: reward={reward:.2f} ({passed}/{total})")
|
| 83 |
+
except Exception as e:
|
| 84 |
+
failed.append((task_id, 0.0, str(e)))
|
| 85 |
+
print(f"💥 {task_id}: ERROR - {e}")
|
| 86 |
+
|
| 87 |
+
print(f"\n{'='*70}")
|
| 88 |
+
print(f"HARD TASKS: {len(HARD_TASKS) - len(failed)}/{len(HARD_TASKS)} passed")
|
| 89 |
+
if failed:
|
| 90 |
+
print("Failed tasks:")
|
| 91 |
+
for task_id, reward, reason in failed:
|
| 92 |
+
print(f" - {task_id}: {reason}")
|
| 93 |
+
print("="*70)
|
| 94 |
+
return len(failed) == 0
|
| 95 |
+
|
| 96 |
+
if __name__ == "__main__":
|
| 97 |
+
easy_ok = test_all_easy_tasks()
|
| 98 |
+
medium_ok = test_all_medium_tasks()
|
| 99 |
+
hard_ok = test_all_hard_tasks()
|
| 100 |
+
|
| 101 |
+
print("\n" + "="*70)
|
| 102 |
+
print("FINAL SUMMARY")
|
| 103 |
+
print("="*70)
|
| 104 |
+
print(f"Easy tasks: {'✅ PASS' if easy_ok else '❌ FAIL'}")
|
| 105 |
+
print(f"Medium tasks: {'✅ PASS' if medium_ok else '❌ FAIL'}")
|
| 106 |
+
print(f"Hard tasks: {'✅ PASS' if hard_ok else '❌ FAIL'}")
|
| 107 |
+
print(f"\nOverall: {'✅ ALL TASKS WORKING' if (easy_ok and medium_ok and hard_ok) else '❌ SOME TASKS FAILING'}")
|
| 108 |
+
print("="*70)
|
test_debug.py
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Test script to debug failing tasks"""
|
| 3 |
+
|
| 4 |
+
from server.tasks.task_easy import get_task_by_id
|
| 5 |
+
from server.tasks.task_hard import get_task_by_id as get_hard_task_by_id
|
| 6 |
+
from server.graders.grader_easy import grade_easy
|
| 7 |
+
from server.graders.grader_hard import grade_hard
|
| 8 |
+
|
| 9 |
+
# Test easy_014
|
| 10 |
+
print("="*60)
|
| 11 |
+
print("Testing easy_014")
|
| 12 |
+
print("="*60)
|
| 13 |
+
task_easy = get_task_by_id('easy_014')
|
| 14 |
+
print(f"Task ID: {task_easy['task_id']}")
|
| 15 |
+
print(f"Test cases: {task_easy['test_cases']}")
|
| 16 |
+
|
| 17 |
+
try:
|
| 18 |
+
buggy_code = task_easy['buggy_code']
|
| 19 |
+
reward, passed, total, feedback, results = grade_easy(buggy_code, task_easy)
|
| 20 |
+
print(f"\nBuggy code result: {passed}/{total}, reward={reward}")
|
| 21 |
+
except Exception as e:
|
| 22 |
+
print(f"\nERROR with buggy code: {type(e).__name__}: {e}")
|
| 23 |
+
import traceback
|
| 24 |
+
traceback.print_exc()
|
| 25 |
+
|
| 26 |
+
try:
|
| 27 |
+
fixed_code = task_easy['fixed_code']
|
| 28 |
+
reward, passed, total, feedback, results = grade_easy(fixed_code, task_easy)
|
| 29 |
+
print(f"\nFixed code result: {passed}/{total}, reward={reward}")
|
| 30 |
+
print(f"Feedback:\n{feedback}")
|
| 31 |
+
except Exception as e:
|
| 32 |
+
print(f"\nERROR with fixed code: {type(e).__name__}: {e}")
|
| 33 |
+
import traceback
|
| 34 |
+
traceback.print_exc()
|
| 35 |
+
|
| 36 |
+
# Test hard_010
|
| 37 |
+
print("\n" + "="*60)
|
| 38 |
+
print("Testing hard_010")
|
| 39 |
+
print("="*60)
|
| 40 |
+
task_hard = get_hard_task_by_id('hard_010')
|
| 41 |
+
print(f"Task ID: {task_hard['task_id']}")
|
| 42 |
+
print(f"Test cases: {task_hard['test_cases']}")
|
| 43 |
+
|
| 44 |
+
try:
|
| 45 |
+
buggy_code = task_hard['buggy_code']
|
| 46 |
+
reward, passed, total, feedback, results = grade_hard(buggy_code, task_hard, explanation=None)
|
| 47 |
+
print(f"\nBuggy code result (no explanation): {passed}/{total}, reward={reward}")
|
| 48 |
+
except Exception as e:
|
| 49 |
+
print(f"\nERROR with buggy code: {type(e).__name__}: {e}")
|
| 50 |
+
import traceback
|
| 51 |
+
traceback.print_exc()
|
| 52 |
+
|
| 53 |
+
try:
|
| 54 |
+
fixed_code = task_hard['fixed_code']
|
| 55 |
+
explanation = "The bug was that there was no visited set to track already visited nodes, which caused infinite loops in graphs with cycles."
|
| 56 |
+
reward, passed, total, feedback, results = grade_hard(fixed_code, task_hard, explanation=explanation)
|
| 57 |
+
print(f"\nFixed code result (with explanation): {passed}/{total}, reward={reward}")
|
| 58 |
+
print(f"Feedback:\n{feedback}")
|
| 59 |
+
except Exception as e:
|
| 60 |
+
print(f"\nERROR with fixed code: {type(e).__name__}: {e}")
|
| 61 |
+
import traceback
|
| 62 |
+
traceback.print_exc()
|
test_edge_cases.py
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Test edge cases that might cause 500 errors"""
|
| 3 |
+
|
| 4 |
+
from server.tasks.task_easy import get_task_by_id
|
| 5 |
+
from server.graders.grader_easy import grade_easy
|
| 6 |
+
|
| 7 |
+
# Test easy_014 with potentially bad code
|
| 8 |
+
task = get_task_by_id('easy_014')
|
| 9 |
+
|
| 10 |
+
print("="*60)
|
| 11 |
+
print("Testing easy_014 with various bad codes")
|
| 12 |
+
print("="*60)
|
| 13 |
+
|
| 14 |
+
# Test 1: Code with infinite loop
|
| 15 |
+
print("\n1. Testing with infinite loop code:")
|
| 16 |
+
bad_code1 = """
|
| 17 |
+
def longest_word_length(sentence):
|
| 18 |
+
while True:
|
| 19 |
+
pass
|
| 20 |
+
"""
|
| 21 |
+
try:
|
| 22 |
+
reward, passed, total, feedback, results = grade_easy(bad_code1, task)
|
| 23 |
+
print(f"Result: {passed}/{total}, reward={reward}")
|
| 24 |
+
print(f"Feedback: {feedback[:200]}...")
|
| 25 |
+
except Exception as e:
|
| 26 |
+
print(f"ERROR: {type(e).__name__}: {e}")
|
| 27 |
+
|
| 28 |
+
# Test 2: Code that doesn't return anything
|
| 29 |
+
print("\n2. Testing with code that returns None:")
|
| 30 |
+
bad_code2 = """
|
| 31 |
+
def longest_word_length(sentence):
|
| 32 |
+
words = sentence.split()
|
| 33 |
+
# forgot to return
|
| 34 |
+
"""
|
| 35 |
+
try:
|
| 36 |
+
reward, passed, total, feedback, results = grade_easy(bad_code2, task)
|
| 37 |
+
print(f"Result: {passed}/{total}, reward={reward}")
|
| 38 |
+
print(f"Feedback: {feedback[:200]}...")
|
| 39 |
+
except Exception as e:
|
| 40 |
+
print(f"ERROR: {type(e).__name__}: {e}")
|
| 41 |
+
|
| 42 |
+
# Test 3: Code with syntax error
|
| 43 |
+
print("\n3. Testing with syntax error:")
|
| 44 |
+
bad_code3 = """
|
| 45 |
+
def longest_word_length(sentence:
|
| 46 |
+
return max(len(w) for w in sentence.split())
|
| 47 |
+
"""
|
| 48 |
+
try:
|
| 49 |
+
reward, passed, total, feedback, results = grade_easy(bad_code3, task)
|
| 50 |
+
print(f"Result: {passed}/{total}, reward={reward}")
|
| 51 |
+
print(f"Feedback: {feedback[:200]}...")
|
| 52 |
+
except Exception as e:
|
| 53 |
+
print(f"ERROR: {type(e).__name__}: {e}")
|
| 54 |
+
|
| 55 |
+
# Test 4: Code with empty string input handling issue
|
| 56 |
+
print("\n4. Testing with code that might fail on empty string:")
|
| 57 |
+
bad_code4 = """
|
| 58 |
+
def longest_word_length(sentence):
|
| 59 |
+
words = sentence.split()
|
| 60 |
+
return max(len(w) for w in words) # This will fail if words is empty!
|
| 61 |
+
"""
|
| 62 |
+
try:
|
| 63 |
+
# Temporarily add an empty string test
|
| 64 |
+
task_copy = task.copy()
|
| 65 |
+
task_copy['test_cases'] = [{"input": "", "expected": 0}]
|
| 66 |
+
reward, passed, total, feedback, results = grade_easy(bad_code4, task_copy)
|
| 67 |
+
print(f"Result: {passed}/{total}, reward={reward}")
|
| 68 |
+
print(f"Feedback: {feedback[:200]}...")
|
| 69 |
+
except Exception as e:
|
| 70 |
+
print(f"ERROR: {type(e).__name__}: {e}")
|
| 71 |
+
|
| 72 |
+
# Test 5: Normal test cases
|
| 73 |
+
print("\n5. Testing with normal test cases:")
|
| 74 |
+
try:
|
| 75 |
+
reward, passed, total, feedback, results = grade_easy(bad_code4, task)
|
| 76 |
+
print(f"Result: {passed}/{total}, reward={reward}")
|
| 77 |
+
for result in results:
|
| 78 |
+
print(f" Test {result['test_id']}: {'✅' if result['passed'] else '❌'} - expected={result['expected']}, got={result['got']}")
|
| 79 |
+
except Exception as e:
|
| 80 |
+
print(f"ERROR: {type(e).__name__}: {e}")
|