Exp 164: upload constraint-propagation-code (Exp 151 artifact)

Browse files

Files changed (3) hide show

README.md +140 -0
config.json +38 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,140 @@

+---
+tags:
+  - energy-based-model
+  - ising-model
+  - constraint-satisfaction
+  - code-verification
+  - carnot
+license: apache-2.0
+---
+> **Research Artifact — Not Production-Ready**
+>
+> This model verifies code implementation responses using structural binary
+> features.  It achieves AUROC 0.867 on the held-out test set (Exp 89 reference:
+> 0.9096).  It detects common off-by-one, wrong-initialization, and wrong-logic
+> bugs — not arbitrary code errors.
+# constraint-propagation-code
+**Learned Ising constraint model for code implementation verification.**
+Trained via discriminative Contrastive Divergence (Exp 62/89) on 400 verified
+(question, correct, wrong) triples for code implementation tasks.  Assigns a
+scalar energy to binary-encoded code responses — lower energy means the code
+is more likely to be a correct implementation.
+## What It Is
+An Ising Energy-Based Model (EBM):
+```
+E(x) = -(b^T s + s^T J s)    s = 2x - 1 ∈ {-1, +1}^200
+```
+The coupling matrix J encodes co-occurrence patterns between structural code
+features that distinguish correct from buggy implementations (e.g., "has
+`range(1, n + 1)`" is strongly coupled to correctness in sum-range tasks).
+## How It Was Trained
+**Algorithm:** Discriminative Contrastive Divergence (full-batch)
+| Hyperparameter | Value |
+|---|---|
+| Training pairs | 400 (80% of 500 generated) |
+| Feature dimension | 200 binary features |
+| Learning rate | 0.01 |
+| L1 regularization | 0.0 |
+| Weight decay | 0.005 |
+| Epochs | 300 |
+| Source | Exp 62 (domain CD) + Exp 89 (self-bootstrap) |
+**Training data:** Ten code implementation templates with exactly one bug per
+wrong implementation:
+| Function | Correct pattern | Bug pattern |
+|----------|----------------|-------------|
+| sum_range | `range(1, n + 1)` | `range(1, n)` (off-by-one) |
+| find_max | `result = lst[0]` | `result = 0` (wrong init) |
+| is_even | `n % 2 == 0` | `n % 2 == 1` (inverted) |
+| factorial | base case `n == 0` | base case `n == 1` (misses 0!) |
+| reverse_string | `s[::-1]` | `s[::1]` (no-op) |
+| count_vowels | `s.lower()` | missing `lower()` |
+| fibonacci | base case `n <= 0` | base case `n == 1` |
+| binary_search | `lo = mid + 1` | `lo = mid` (infinite loop) |
+| is_palindrome | `s == s[::-1]` | `s == ''.join(sorted(s))` |
+| flatten | `result.extend(flatten(item))` | `result.extend(item)` (shallow) |
+## Benchmark Results
+| Metric | This export | Exp 89 reference |
+|--------|-------------|-----------------|
+| AUROC (test) | **0.8669** | 0.9096 |
+| Accuracy (test) | **88.0%** | 88.0% |
+| Test set size | 100 | 25 |
+| Baseline AUROC | 0.5 | 0.5 |
+The gap from Exp 89 reflects the richer constraint features (245-dim pipeline
+features vs 200-dim structural encoding used here).  Accuracy matches perfectly.
+## Usage
+```python
+import numpy as np
+from carnot.inference.constraint_models import ConstraintPropagationModel
+# Load model
+model = ConstraintPropagationModel.from_pretrained(
+    "exports/constraint-propagation-models/code"
+)
+# Encode a code response
+from scripts.export_constraint_models import encode_answer
+question = "Write a function that returns the sum of integers from 1 to n."
+correct_code = "def sum_range(n):\n    total = 0\n    for i in range(1, n + 1):\n        total += i\n    return total"
+buggy_code   = "def sum_range(n):\n    total = 0\n    for i in range(1, n):\n        total += i\n    return total"
+x_correct = encode_answer(question, correct_code)
+x_buggy   = encode_answer(question, buggy_code)
+print(f"Correct energy: {model.energy(x_correct):.2f}")  # should be lower
+print(f"Buggy energy:   {model.energy(x_buggy):.2f}")    # should be higher
+print(f"Correct score:  {model.score(x_correct):.3f}")   # should be higher
+print(f"Buggy score:    {model.score(x_buggy):.3f}")     # should be lower
+```
+## Limitations
+1. **Template bugs only**: Trained on 10 specific bug patterns. Novel bug types
+   (e.g., wrong algorithm choice, logic errors in business rules) are outside
+   scope.
+2. **Structural features only**: Detects structural signals like "has `range(1,
+   n + 1)`" or "has `isinstance`" — does not execute or parse the code.
+3. **Lowest AUROC of the three domains**: Code structure is less distinctive
+   than arithmetic or logic patterns. AUROC 0.867 vs 1.0 for logic/arithmetic.
+4. **No semantic understanding**: Two implementations with the same keywords
+   but different logic will have similar energies.
+## Files
+| File | Description |
+|------|-------------|
+| `model.safetensors` | Coupling matrix J (200×200) and bias b (200,) as float32 |
+| `config.json` | Training metadata and benchmark results |
+| `README.md` | This file |
+## Citation
+```bibtex
+@misc{carnot2026constraint_code,
+  title  = {Carnot Constraint Propagation Model: Code},
+  author = {Carnot-EBM},
+  year   = {2026},
+  url    = {https://github.com/ianblenke/carnot}
+}
+```
+## Spec
+- REQ-VERIFY-002, REQ-VERIFY-003, FR-11

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "model_type": "ising_constraint_model",
+  "domain": "code",
+  "feature_dim": 200,
+  "carnot_version": "0.1.0",
+  "spec": [
+    "REQ-VERIFY-002",
+    "REQ-VERIFY-003",
+    "FR-11"
+  ],
+  "training": {
+    "n_pairs": 500,
+    "n_train": 400,
+    "n_test": 100,
+    "algorithm": "discriminative_cd",
+    "lr": 0.01,
+    "l1_lambda": 0.0,
+    "weight_decay": 0.005,
+    "n_epochs": 300,
+    "source_experiments": [
+      "Exp-62",
+      "Exp-89"
+    ]
+  },
+  "benchmark": {
+    "auroc_reproduced": 0.8669,
+    "accuracy_reproduced": 0.88,
+    "auroc_exp89_reference": 0.9096,
+    "accuracy_exp89_reference": 0.88,
+    "n_test_exp89": 25
+  },
+  "limitations": [
+    "Feature encoder uses binary structural features \u2014 not embeddings.",
+    "Only learns from structural patterns, not semantics.",
+    "Exp 89 AUROC (0.9096) was for self-bootstrapped pipeline data; this export uses simpler Exp 62-style deterministic encoding.",
+    "factual and scheduling domains not included (near-zero AUROC in Exp 89)."
+  ]
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35656cb317dc6d5d116bbfbb52536530c720e47f5d4f1cf81ba888dc70363645
+size 160944