Exp 164: upload constraint-propagation-code (Exp 151 artifact)
Browse files- README.md +140 -0
- config.json +38 -0
- model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- energy-based-model
|
| 4 |
+
- ising-model
|
| 5 |
+
- constraint-satisfaction
|
| 6 |
+
- code-verification
|
| 7 |
+
- carnot
|
| 8 |
+
license: apache-2.0
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
> **Research Artifact — Not Production-Ready**
|
| 12 |
+
>
|
| 13 |
+
> This model verifies code implementation responses using structural binary
|
| 14 |
+
> features. It achieves AUROC 0.867 on the held-out test set (Exp 89 reference:
|
| 15 |
+
> 0.9096). It detects common off-by-one, wrong-initialization, and wrong-logic
|
| 16 |
+
> bugs — not arbitrary code errors.
|
| 17 |
+
|
| 18 |
+
# constraint-propagation-code
|
| 19 |
+
|
| 20 |
+
**Learned Ising constraint model for code implementation verification.**
|
| 21 |
+
|
| 22 |
+
Trained via discriminative Contrastive Divergence (Exp 62/89) on 400 verified
|
| 23 |
+
(question, correct, wrong) triples for code implementation tasks. Assigns a
|
| 24 |
+
scalar energy to binary-encoded code responses — lower energy means the code
|
| 25 |
+
is more likely to be a correct implementation.
|
| 26 |
+
|
| 27 |
+
## What It Is
|
| 28 |
+
|
| 29 |
+
An Ising Energy-Based Model (EBM):
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
E(x) = -(b^T s + s^T J s) s = 2x - 1 ∈ {-1, +1}^200
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
The coupling matrix J encodes co-occurrence patterns between structural code
|
| 36 |
+
features that distinguish correct from buggy implementations (e.g., "has
|
| 37 |
+
`range(1, n + 1)`" is strongly coupled to correctness in sum-range tasks).
|
| 38 |
+
|
| 39 |
+
## How It Was Trained
|
| 40 |
+
|
| 41 |
+
**Algorithm:** Discriminative Contrastive Divergence (full-batch)
|
| 42 |
+
|
| 43 |
+
| Hyperparameter | Value |
|
| 44 |
+
|---|---|
|
| 45 |
+
| Training pairs | 400 (80% of 500 generated) |
|
| 46 |
+
| Feature dimension | 200 binary features |
|
| 47 |
+
| Learning rate | 0.01 |
|
| 48 |
+
| L1 regularization | 0.0 |
|
| 49 |
+
| Weight decay | 0.005 |
|
| 50 |
+
| Epochs | 300 |
|
| 51 |
+
| Source | Exp 62 (domain CD) + Exp 89 (self-bootstrap) |
|
| 52 |
+
|
| 53 |
+
**Training data:** Ten code implementation templates with exactly one bug per
|
| 54 |
+
wrong implementation:
|
| 55 |
+
|
| 56 |
+
| Function | Correct pattern | Bug pattern |
|
| 57 |
+
|----------|----------------|-------------|
|
| 58 |
+
| sum_range | `range(1, n + 1)` | `range(1, n)` (off-by-one) |
|
| 59 |
+
| find_max | `result = lst[0]` | `result = 0` (wrong init) |
|
| 60 |
+
| is_even | `n % 2 == 0` | `n % 2 == 1` (inverted) |
|
| 61 |
+
| factorial | base case `n == 0` | base case `n == 1` (misses 0!) |
|
| 62 |
+
| reverse_string | `s[::-1]` | `s[::1]` (no-op) |
|
| 63 |
+
| count_vowels | `s.lower()` | missing `lower()` |
|
| 64 |
+
| fibonacci | base case `n <= 0` | base case `n == 1` |
|
| 65 |
+
| binary_search | `lo = mid + 1` | `lo = mid` (infinite loop) |
|
| 66 |
+
| is_palindrome | `s == s[::-1]` | `s == ''.join(sorted(s))` |
|
| 67 |
+
| flatten | `result.extend(flatten(item))` | `result.extend(item)` (shallow) |
|
| 68 |
+
|
| 69 |
+
## Benchmark Results
|
| 70 |
+
|
| 71 |
+
| Metric | This export | Exp 89 reference |
|
| 72 |
+
|--------|-------------|-----------------|
|
| 73 |
+
| AUROC (test) | **0.8669** | 0.9096 |
|
| 74 |
+
| Accuracy (test) | **88.0%** | 88.0% |
|
| 75 |
+
| Test set size | 100 | 25 |
|
| 76 |
+
| Baseline AUROC | 0.5 | 0.5 |
|
| 77 |
+
|
| 78 |
+
The gap from Exp 89 reflects the richer constraint features (245-dim pipeline
|
| 79 |
+
features vs 200-dim structural encoding used here). Accuracy matches perfectly.
|
| 80 |
+
|
| 81 |
+
## Usage
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
import numpy as np
|
| 85 |
+
from carnot.inference.constraint_models import ConstraintPropagationModel
|
| 86 |
+
|
| 87 |
+
# Load model
|
| 88 |
+
model = ConstraintPropagationModel.from_pretrained(
|
| 89 |
+
"exports/constraint-propagation-models/code"
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
# Encode a code response
|
| 93 |
+
from scripts.export_constraint_models import encode_answer
|
| 94 |
+
question = "Write a function that returns the sum of integers from 1 to n."
|
| 95 |
+
correct_code = "def sum_range(n):\n total = 0\n for i in range(1, n + 1):\n total += i\n return total"
|
| 96 |
+
buggy_code = "def sum_range(n):\n total = 0\n for i in range(1, n):\n total += i\n return total"
|
| 97 |
+
|
| 98 |
+
x_correct = encode_answer(question, correct_code)
|
| 99 |
+
x_buggy = encode_answer(question, buggy_code)
|
| 100 |
+
|
| 101 |
+
print(f"Correct energy: {model.energy(x_correct):.2f}") # should be lower
|
| 102 |
+
print(f"Buggy energy: {model.energy(x_buggy):.2f}") # should be higher
|
| 103 |
+
print(f"Correct score: {model.score(x_correct):.3f}") # should be higher
|
| 104 |
+
print(f"Buggy score: {model.score(x_buggy):.3f}") # should be lower
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Limitations
|
| 108 |
+
|
| 109 |
+
1. **Template bugs only**: Trained on 10 specific bug patterns. Novel bug types
|
| 110 |
+
(e.g., wrong algorithm choice, logic errors in business rules) are outside
|
| 111 |
+
scope.
|
| 112 |
+
2. **Structural features only**: Detects structural signals like "has `range(1,
|
| 113 |
+
n + 1)`" or "has `isinstance`" — does not execute or parse the code.
|
| 114 |
+
3. **Lowest AUROC of the three domains**: Code structure is less distinctive
|
| 115 |
+
than arithmetic or logic patterns. AUROC 0.867 vs 1.0 for logic/arithmetic.
|
| 116 |
+
4. **No semantic understanding**: Two implementations with the same keywords
|
| 117 |
+
but different logic will have similar energies.
|
| 118 |
+
|
| 119 |
+
## Files
|
| 120 |
+
|
| 121 |
+
| File | Description |
|
| 122 |
+
|------|-------------|
|
| 123 |
+
| `model.safetensors` | Coupling matrix J (200×200) and bias b (200,) as float32 |
|
| 124 |
+
| `config.json` | Training metadata and benchmark results |
|
| 125 |
+
| `README.md` | This file |
|
| 126 |
+
|
| 127 |
+
## Citation
|
| 128 |
+
|
| 129 |
+
```bibtex
|
| 130 |
+
@misc{carnot2026constraint_code,
|
| 131 |
+
title = {Carnot Constraint Propagation Model: Code},
|
| 132 |
+
author = {Carnot-EBM},
|
| 133 |
+
year = {2026},
|
| 134 |
+
url = {https://github.com/ianblenke/carnot}
|
| 135 |
+
}
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
## Spec
|
| 139 |
+
|
| 140 |
+
- REQ-VERIFY-002, REQ-VERIFY-003, FR-11
|
config.json
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model_type": "ising_constraint_model",
|
| 3 |
+
"domain": "code",
|
| 4 |
+
"feature_dim": 200,
|
| 5 |
+
"carnot_version": "0.1.0",
|
| 6 |
+
"spec": [
|
| 7 |
+
"REQ-VERIFY-002",
|
| 8 |
+
"REQ-VERIFY-003",
|
| 9 |
+
"FR-11"
|
| 10 |
+
],
|
| 11 |
+
"training": {
|
| 12 |
+
"n_pairs": 500,
|
| 13 |
+
"n_train": 400,
|
| 14 |
+
"n_test": 100,
|
| 15 |
+
"algorithm": "discriminative_cd",
|
| 16 |
+
"lr": 0.01,
|
| 17 |
+
"l1_lambda": 0.0,
|
| 18 |
+
"weight_decay": 0.005,
|
| 19 |
+
"n_epochs": 300,
|
| 20 |
+
"source_experiments": [
|
| 21 |
+
"Exp-62",
|
| 22 |
+
"Exp-89"
|
| 23 |
+
]
|
| 24 |
+
},
|
| 25 |
+
"benchmark": {
|
| 26 |
+
"auroc_reproduced": 0.8669,
|
| 27 |
+
"accuracy_reproduced": 0.88,
|
| 28 |
+
"auroc_exp89_reference": 0.9096,
|
| 29 |
+
"accuracy_exp89_reference": 0.88,
|
| 30 |
+
"n_test_exp89": 25
|
| 31 |
+
},
|
| 32 |
+
"limitations": [
|
| 33 |
+
"Feature encoder uses binary structural features \u2014 not embeddings.",
|
| 34 |
+
"Only learns from structural patterns, not semantics.",
|
| 35 |
+
"Exp 89 AUROC (0.9096) was for self-bootstrapped pipeline data; this export uses simpler Exp 62-style deterministic encoding.",
|
| 36 |
+
"factual and scheduling domains not included (near-zero AUROC in Exp 89)."
|
| 37 |
+
]
|
| 38 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:35656cb317dc6d5d116bbfbb52536530c720e47f5d4f1cf81ba888dc70363645
|
| 3 |
+
size 160944
|