Nitish commited on
Commit ·
c1316d3
1
Parent(s): f44f429
docs: finalize submission checklist and sign-off
Browse files
OPENENV_SUBMISSION_CHECKLIST.md
CHANGED
|
@@ -102,6 +102,7 @@ TASK=hard python inference.py # expected: score < 0.8
|
|
| 102 |
- [x] Easy task baseline score is ≥ 0.6.
|
| 103 |
- [x] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
|
| 104 |
- [x] Hard task baseline score is < 0.8 (if it's ≥ 0.8, make it harder).
|
|
|
|
| 105 |
|
| 106 |
---
|
| 107 |
|
|
@@ -306,9 +307,9 @@ TASK=hard python inference.py # expected: score < 0.8
|
|
| 306 |
|
| 307 |
| Task | Difficulty | Model | Score | Steps | Notes |
|
| 308 |
|------|-----------|-------|-------|-------|-------|
|
| 309 |
-
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.
|
| 310 |
-
| js-
|
| 311 |
-
| python-
|
| 312 |
|
| 313 |
- [x] The table is filled in with real numbers from a completed inference run.
|
| 314 |
- [x] The easy task score is ≥ 0.6.
|
|
@@ -423,7 +424,7 @@ done
|
|
| 423 |
|
| 424 |
Expected: Three complete runs, each emitting `[START]`, N×`[STEP]`, and `[END]` with no Python exceptions.
|
| 425 |
|
| 426 |
-
- [x] ✓ PASSED — Easy score: 0.
|
| 427 |
|
| 428 |
### Step 5 — Verify log format
|
| 429 |
|
|
@@ -514,12 +515,12 @@ When all items above are checked, fill in this block and attach it to your submi
|
|
| 514 |
Environment Name: Code Security Review
|
| 515 |
HF Space URL: https://huggingface.co/spaces/inmodel/code-review-env
|
| 516 |
Baseline Scores:
|
| 517 |
-
- Easy task: 0.
|
| 518 |
-
- Medium task: 0.
|
| 519 |
-
- Hard task: 0.
|
| 520 |
Inference runtime: < 1 minute
|
| 521 |
-
Docker image size:
|
| 522 |
-
Submitted by:
|
| 523 |
Date: 2026-04-08
|
| 524 |
|
| 525 |
I confirm all 18 disqualifying items are checked [yes/no]: yes
|
|
|
|
| 102 |
- [x] Easy task baseline score is ≥ 0.6.
|
| 103 |
- [x] Medium task baseline score is meaningfully lower than easy (at least 0.15 gap).
|
| 104 |
- [x] Hard task baseline score is < 0.8 (if it's ≥ 0.8, make it harder).
|
| 105 |
+
(Easy: 0.883 | Medium: 0.500 | Hard: 0.512)
|
| 106 |
|
| 107 |
---
|
| 108 |
|
|
|
|
| 307 |
|
| 308 |
| Task | Difficulty | Model | Score | Steps | Notes |
|
| 309 |
|------|-----------|-------|-------|-------|-------|
|
| 310 |
+
| python-off-by-one | easy | Llama-3.3-70B-Instruct | 0.883 | 2 | |
|
| 311 |
+
| js-idor-auth | medium | Llama-3.3-70B-Instruct | 0.500 | 2 | |
|
| 312 |
+
| python-pickle-deserialization | hard | Llama-3.3-70B-Instruct | 0.512 | 2 | |
|
| 313 |
|
| 314 |
- [x] The table is filled in with real numbers from a completed inference run.
|
| 315 |
- [x] The easy task score is ≥ 0.6.
|
|
|
|
| 424 |
|
| 425 |
Expected: Three complete runs, each emitting `[START]`, N×`[STEP]`, and `[END]` with no Python exceptions.
|
| 426 |
|
| 427 |
+
- [x] ✓ PASSED — Easy score: 0.883 Medium score: 0.500 Hard score: 0.512
|
| 428 |
|
| 429 |
### Step 5 — Verify log format
|
| 430 |
|
|
|
|
| 515 |
Environment Name: Code Security Review
|
| 516 |
HF Space URL: https://huggingface.co/spaces/inmodel/code-review-env
|
| 517 |
Baseline Scores:
|
| 518 |
+
- Easy task: 0.883 (task name: python-off-by-one)
|
| 519 |
+
- Medium task: 0.500 (task name: js-idor-auth)
|
| 520 |
+
- Hard task: 0.512 (task name: python-pickle-deserialization)
|
| 521 |
Inference runtime: < 1 minute
|
| 522 |
+
Docker image size: ~300 MB
|
| 523 |
+
Submitted by: Inmodel Labs
|
| 524 |
Date: 2026-04-08
|
| 525 |
|
| 526 |
I confirm all 18 disqualifying items are checked [yes/no]: yes
|