ianblenke's picture
Exp 164: upload constraint-propagation-code (Exp 151 artifact)
646c7cb verified
---
tags:
- energy-based-model
- ising-model
- constraint-satisfaction
- code-verification
- carnot
license: apache-2.0
---
> **Research Artifact β€” Not Production-Ready**
>
> This model verifies code implementation responses using structural binary
> features. It achieves AUROC 0.867 on the held-out test set (Exp 89 reference:
> 0.9096). It detects common off-by-one, wrong-initialization, and wrong-logic
> bugs β€” not arbitrary code errors.
# constraint-propagation-code
**Learned Ising constraint model for code implementation verification.**
Trained via discriminative Contrastive Divergence (Exp 62/89) on 400 verified
(question, correct, wrong) triples for code implementation tasks. Assigns a
scalar energy to binary-encoded code responses β€” lower energy means the code
is more likely to be a correct implementation.
## What It Is
An Ising Energy-Based Model (EBM):
```
E(x) = -(b^T s + s^T J s) s = 2x - 1 ∈ {-1, +1}^200
```
The coupling matrix J encodes co-occurrence patterns between structural code
features that distinguish correct from buggy implementations (e.g., "has
`range(1, n + 1)`" is strongly coupled to correctness in sum-range tasks).
## How It Was Trained
**Algorithm:** Discriminative Contrastive Divergence (full-batch)
| Hyperparameter | Value |
|---|---|
| Training pairs | 400 (80% of 500 generated) |
| Feature dimension | 200 binary features |
| Learning rate | 0.01 |
| L1 regularization | 0.0 |
| Weight decay | 0.005 |
| Epochs | 300 |
| Source | Exp 62 (domain CD) + Exp 89 (self-bootstrap) |
**Training data:** Ten code implementation templates with exactly one bug per
wrong implementation:
| Function | Correct pattern | Bug pattern |
|----------|----------------|-------------|
| sum_range | `range(1, n + 1)` | `range(1, n)` (off-by-one) |
| find_max | `result = lst[0]` | `result = 0` (wrong init) |
| is_even | `n % 2 == 0` | `n % 2 == 1` (inverted) |
| factorial | base case `n == 0` | base case `n == 1` (misses 0!) |
| reverse_string | `s[::-1]` | `s[::1]` (no-op) |
| count_vowels | `s.lower()` | missing `lower()` |
| fibonacci | base case `n <= 0` | base case `n == 1` |
| binary_search | `lo = mid + 1` | `lo = mid` (infinite loop) |
| is_palindrome | `s == s[::-1]` | `s == ''.join(sorted(s))` |
| flatten | `result.extend(flatten(item))` | `result.extend(item)` (shallow) |
## Benchmark Results
| Metric | This export | Exp 89 reference |
|--------|-------------|-----------------|
| AUROC (test) | **0.8669** | 0.9096 |
| Accuracy (test) | **88.0%** | 88.0% |
| Test set size | 100 | 25 |
| Baseline AUROC | 0.5 | 0.5 |
The gap from Exp 89 reflects the richer constraint features (245-dim pipeline
features vs 200-dim structural encoding used here). Accuracy matches perfectly.
## Usage
```python
import numpy as np
from carnot.inference.constraint_models import ConstraintPropagationModel
# Load model
model = ConstraintPropagationModel.from_pretrained(
"exports/constraint-propagation-models/code"
)
# Encode a code response
from scripts.export_constraint_models import encode_answer
question = "Write a function that returns the sum of integers from 1 to n."
correct_code = "def sum_range(n):\n total = 0\n for i in range(1, n + 1):\n total += i\n return total"
buggy_code = "def sum_range(n):\n total = 0\n for i in range(1, n):\n total += i\n return total"
x_correct = encode_answer(question, correct_code)
x_buggy = encode_answer(question, buggy_code)
print(f"Correct energy: {model.energy(x_correct):.2f}") # should be lower
print(f"Buggy energy: {model.energy(x_buggy):.2f}") # should be higher
print(f"Correct score: {model.score(x_correct):.3f}") # should be higher
print(f"Buggy score: {model.score(x_buggy):.3f}") # should be lower
```
## Limitations
1. **Template bugs only**: Trained on 10 specific bug patterns. Novel bug types
(e.g., wrong algorithm choice, logic errors in business rules) are outside
scope.
2. **Structural features only**: Detects structural signals like "has `range(1,
n + 1)`" or "has `isinstance`" β€” does not execute or parse the code.
3. **Lowest AUROC of the three domains**: Code structure is less distinctive
than arithmetic or logic patterns. AUROC 0.867 vs 1.0 for logic/arithmetic.
4. **No semantic understanding**: Two implementations with the same keywords
but different logic will have similar energies.
## Files
| File | Description |
|------|-------------|
| `model.safetensors` | Coupling matrix J (200Γ—200) and bias b (200,) as float32 |
| `config.json` | Training metadata and benchmark results |
| `README.md` | This file |
## Citation
```bibtex
@misc{carnot2026constraint_code,
title = {Carnot Constraint Propagation Model: Code},
author = {Carnot-EBM},
year = {2026},
url = {https://github.com/ianblenke/carnot}
}
```
## Spec
- REQ-VERIFY-002, REQ-VERIFY-003, FR-11