Spaces:

AshwinP
/

compounding-test

Sleeping

apingali Claude Opus 4.7 (1M context) commited on 12 days ago

Commit

e084f76

1 Parent(s): 75a8a07

fix(parser): bump quoted_span limit 200 → 400 chars

Discovered via the SC-005 re-baseline after the trust_remote_code fix
landed on the Space. Phi-4-mini-instruct, when asked for a 5-15 word
quoted_span, consistently generates 200-220 char spans:

⚠ The model returned malformed output.
Detail: Score decreasing_marginal_cost.quoted_span must be ≤200 chars,
got 215

Two ways to fix:
(a) tighten the prompt to enforce a hard word/char limit
(b) loosen the parser ceiling

Picked (b). The 200-char floor was always a soft constraint to prevent
runaway model output, not a hard contract. Bumping to 400 chars gives
generous headroom for typical model output while still catching
pathological cases (a single 1000-char span would still fail loud).

Changes (symmetric Python + TS, per the parser-parity invariant):
gradio-apps/compounding-test/app.py:
parse_response — quoted_span ceiling 200 → 400
src/lib/diagnose-parser.ts:
parseResponse — same change

Test updates:
test_diagnose.py:
test_quoted_span_over_200_chars_raises → test_quoted_span_over_400_chars_raises
+ new test_quoted_span_up_to_400_chars_accepted (confirms 250-char
typical-output passes)
diagnose-parser.test.ts:
'rejects quoted_span over 200 chars' → 'rejects quoted_span over 400 chars'
+ new 'accepts quoted_span up to 400 chars (typical Phi-4-mini output)'

Verified:
pytest test_diagnose.py 64 passed, 1 skipped (+1 from new acceptance test)
vitest diagnose-parser 22 passed (+1 from new acceptance test)
npm run build clean

The Space currently runs the OLD ceiling — deploy.sh push to the Space
will land this fix. Visitor whose first attempt hit the malformed-output
error can retry after the rebuild (~3-5 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

app.py +2 -2
test_diagnose.py +17 -2

app.py CHANGED Viewed

@@ -148,9 +148,9 @@ def parse_response(raw: str) -> Response:
             )
         if not isinstance(s["quoted_span"], str) or not s["quoted_span"]:
             raise MalformedResponseError(f"Score {key}.quoted_span must be a non-empty string")
-        if len(s["quoted_span"]) > 200:
             raise MalformedResponseError(
-                f"Score {key}.quoted_span must be ≤200 chars, got {len(s['quoted_span'])}"
             )
         scores[key] = Score(
             score=s["score"], rationale=s["rationale"], quoted_span=s["quoted_span"]

             )
         if not isinstance(s["quoted_span"], str) or not s["quoted_span"]:
             raise MalformedResponseError(f"Score {key}.quoted_span must be a non-empty string")
+        if len(s["quoted_span"]) > 400:
             raise MalformedResponseError(
+                f"Score {key}.quoted_span must be ≤400 chars, got {len(s['quoted_span'])}"
             )
         scores[key] = Score(
             score=s["score"], rationale=s["rationale"], quoted_span=s["quoted_span"]

test_diagnose.py CHANGED Viewed

@@ -165,8 +165,12 @@ def test_empty_quoted_span_raises():
         parse_response(raw)
-def test_quoted_span_over_200_chars_raises():
-    over_limit = "x" * 201
     raw = VALID_JSON_BLOCK.replace(
         '"quoted_span": "claim outcomes Progressive observes directly"',
         f'"quoted_span": "{over_limit}"',
@@ -175,6 +179,17 @@ def test_quoted_span_over_200_chars_raises():
         parse_response(raw)
 # --- Tolerance: forward-compat and whitespace ------------------------------

         parse_response(raw)
+def test_quoted_span_over_400_chars_raises():
+    """The 400-char limit is a generous ceiling — Phi-4-mini consistently
+    generates ~200-220 char quoted_spans when asked for 5-15 words, so
+    we bumped from 200 to 400 to accommodate normal model output without
+    losing the runaway-output guard."""
+    over_limit = "x" * 401
     raw = VALID_JSON_BLOCK.replace(
         '"quoted_span": "claim outcomes Progressive observes directly"',
         f'"quoted_span": "{over_limit}"',
         parse_response(raw)
+def test_quoted_span_up_to_400_chars_accepted():
+    """Confirms the new ceiling lets typical Phi-4-mini output through."""
+    at_limit = "x" * 250  # well above the prior 200-char cap
+    raw = VALID_JSON_BLOCK.replace(
+        '"quoted_span": "claim outcomes Progressive observes directly"',
+        f'"quoted_span": "{at_limit}"',
+    )
+    r = parse_response(raw)
+    assert len(r.scores["proprietary_data"].quoted_span) == 250
 # --- Tolerance: forward-compat and whitespace ------------------------------