File size: 4,918 Bytes
646c7cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
tags:
  - energy-based-model
  - ising-model
  - constraint-satisfaction
  - code-verification
  - carnot
license: apache-2.0
---

> **Research Artifact — Not Production-Ready**
>
> This model verifies code implementation responses using structural binary
> features.  It achieves AUROC 0.867 on the held-out test set (Exp 89 reference:
> 0.9096).  It detects common off-by-one, wrong-initialization, and wrong-logic
> bugs — not arbitrary code errors.

# constraint-propagation-code

**Learned Ising constraint model for code implementation verification.**

Trained via discriminative Contrastive Divergence (Exp 62/89) on 400 verified
(question, correct, wrong) triples for code implementation tasks.  Assigns a
scalar energy to binary-encoded code responses — lower energy means the code
is more likely to be a correct implementation.

## What It Is

An Ising Energy-Based Model (EBM):

```
E(x) = -(b^T s + s^T J s)    s = 2x - 1 ∈ {-1, +1}^200
```

The coupling matrix J encodes co-occurrence patterns between structural code
features that distinguish correct from buggy implementations (e.g., "has
`range(1, n + 1)`" is strongly coupled to correctness in sum-range tasks).

## How It Was Trained

**Algorithm:** Discriminative Contrastive Divergence (full-batch)

| Hyperparameter | Value |
|---|---|
| Training pairs | 400 (80% of 500 generated) |
| Feature dimension | 200 binary features |
| Learning rate | 0.01 |
| L1 regularization | 0.0 |
| Weight decay | 0.005 |
| Epochs | 300 |
| Source | Exp 62 (domain CD) + Exp 89 (self-bootstrap) |

**Training data:** Ten code implementation templates with exactly one bug per
wrong implementation:

| Function | Correct pattern | Bug pattern |
|----------|----------------|-------------|
| sum_range | `range(1, n + 1)` | `range(1, n)` (off-by-one) |
| find_max | `result = lst[0]` | `result = 0` (wrong init) |
| is_even | `n % 2 == 0` | `n % 2 == 1` (inverted) |
| factorial | base case `n == 0` | base case `n == 1` (misses 0!) |
| reverse_string | `s[::-1]` | `s[::1]` (no-op) |
| count_vowels | `s.lower()` | missing `lower()` |
| fibonacci | base case `n <= 0` | base case `n == 1` |
| binary_search | `lo = mid + 1` | `lo = mid` (infinite loop) |
| is_palindrome | `s == s[::-1]` | `s == ''.join(sorted(s))` |
| flatten | `result.extend(flatten(item))` | `result.extend(item)` (shallow) |

## Benchmark Results

| Metric | This export | Exp 89 reference |
|--------|-------------|-----------------|
| AUROC (test) | **0.8669** | 0.9096 |
| Accuracy (test) | **88.0%** | 88.0% |
| Test set size | 100 | 25 |
| Baseline AUROC | 0.5 | 0.5 |

The gap from Exp 89 reflects the richer constraint features (245-dim pipeline
features vs 200-dim structural encoding used here).  Accuracy matches perfectly.

## Usage

```python
import numpy as np
from carnot.inference.constraint_models import ConstraintPropagationModel

# Load model
model = ConstraintPropagationModel.from_pretrained(
    "exports/constraint-propagation-models/code"
)

# Encode a code response
from scripts.export_constraint_models import encode_answer
question = "Write a function that returns the sum of integers from 1 to n."
correct_code = "def sum_range(n):\n    total = 0\n    for i in range(1, n + 1):\n        total += i\n    return total"
buggy_code   = "def sum_range(n):\n    total = 0\n    for i in range(1, n):\n        total += i\n    return total"

x_correct = encode_answer(question, correct_code)
x_buggy   = encode_answer(question, buggy_code)

print(f"Correct energy: {model.energy(x_correct):.2f}")  # should be lower
print(f"Buggy energy:   {model.energy(x_buggy):.2f}")    # should be higher
print(f"Correct score:  {model.score(x_correct):.3f}")   # should be higher
print(f"Buggy score:    {model.score(x_buggy):.3f}")     # should be lower
```

## Limitations

1. **Template bugs only**: Trained on 10 specific bug patterns. Novel bug types
   (e.g., wrong algorithm choice, logic errors in business rules) are outside
   scope.
2. **Structural features only**: Detects structural signals like "has `range(1,
   n + 1)`" or "has `isinstance`" — does not execute or parse the code.
3. **Lowest AUROC of the three domains**: Code structure is less distinctive
   than arithmetic or logic patterns. AUROC 0.867 vs 1.0 for logic/arithmetic.
4. **No semantic understanding**: Two implementations with the same keywords
   but different logic will have similar energies.

## Files

| File | Description |
|------|-------------|
| `model.safetensors` | Coupling matrix J (200×200) and bias b (200,) as float32 |
| `config.json` | Training metadata and benchmark results |
| `README.md` | This file |

## Citation

```bibtex
@misc{carnot2026constraint_code,
  title  = {Carnot Constraint Propagation Model: Code},
  author = {Carnot-EBM},
  year   = {2026},
  url    = {https://github.com/ianblenke/carnot}
}
```

## Spec

- REQ-VERIFY-002, REQ-VERIFY-003, FR-11