Spaces:

echoboi
/

discovery-env

Running

echoboi Claude Sonnet 4.6 commited on Feb 23

Commit

77b9e25

1 Parent(s): 9e4cfa3

Strip docstrings from description length (parsimony scoring)

strip_comments() now does a two-pass strip:
1. AST pass: removes module/class/function docstring nodes
(Expr(Constant(str)) in first-statement position)
2. Tokenize pass: removes # inline comments

Before: agent code with large docstrings was penalised vs clean code
even when the actual algorithm was identical.
After: stripped_code_length(code_with_docs) == stripped_code_length(code_clean)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show

discovery_env/scoring.py +34 -2

discovery_env/scoring.py CHANGED Viewed

@@ -101,10 +101,42 @@ def functional_accuracy(
 def strip_comments(code: str) -> str:
-    """Strip comments from Python source code.
-    Returns the code with comments removed, blank lines stripped.
     """
     try:
         tokens = tokenize.generate_tokens(io.StringIO(code).readline)
         result = []

 def strip_comments(code: str) -> str:
+    """Strip comments AND docstrings from Python source code.
+    Two-pass approach:
+      1. AST pass — removes module/class/function docstrings
+         (string-expression nodes in first-statement position).
+         These are the triple-quoted blocks that bloat agent code
+         without contributing algorithmic complexity.
+      2. Tokenize pass — removes remaining # inline comments.
+    Blank lines are also removed.  The result is a fair proxy for
+    the algorithmic description length used in parsimony scoring.
     """
+    import ast as _ast
+    # ── Pass 1: remove docstrings via AST ─────────────────────────────────────
+    try:
+        tree = _ast.parse(code)
+        for node in _ast.walk(tree):
+            if not isinstance(node, (_ast.FunctionDef, _ast.AsyncFunctionDef,
+                                     _ast.ClassDef, _ast.Module)):
+                continue
+            if not node.body:
+                continue
+            first = node.body[0]
+            if (isinstance(first, _ast.Expr) and
+                    isinstance(first.value, _ast.Constant) and
+                    isinstance(first.value.value, str)):
+                node.body.pop(0)
+                # A body can't be empty — insert pass if needed
+                if not node.body:
+                    node.body.append(_ast.Pass())
+        code = _ast.unparse(tree)
+    except Exception:
+        pass  # malformed code — fall through to tokenize-only
+    # ── Pass 2: remove # comments via tokenize ────────────────────────────────
     try:
         tokens = tokenize.generate_tokens(io.StringIO(code).readline)
         result = []