ianblenke commited on
Commit
646c7cb
·
verified ·
1 Parent(s): 5f8f9dc

Exp 164: upload constraint-propagation-code (Exp 151 artifact)

Browse files
Files changed (3) hide show
  1. README.md +140 -0
  2. config.json +38 -0
  3. model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - energy-based-model
4
+ - ising-model
5
+ - constraint-satisfaction
6
+ - code-verification
7
+ - carnot
8
+ license: apache-2.0
9
+ ---
10
+
11
+ > **Research Artifact — Not Production-Ready**
12
+ >
13
+ > This model verifies code implementation responses using structural binary
14
+ > features. It achieves AUROC 0.867 on the held-out test set (Exp 89 reference:
15
+ > 0.9096). It detects common off-by-one, wrong-initialization, and wrong-logic
16
+ > bugs — not arbitrary code errors.
17
+
18
+ # constraint-propagation-code
19
+
20
+ **Learned Ising constraint model for code implementation verification.**
21
+
22
+ Trained via discriminative Contrastive Divergence (Exp 62/89) on 400 verified
23
+ (question, correct, wrong) triples for code implementation tasks. Assigns a
24
+ scalar energy to binary-encoded code responses — lower energy means the code
25
+ is more likely to be a correct implementation.
26
+
27
+ ## What It Is
28
+
29
+ An Ising Energy-Based Model (EBM):
30
+
31
+ ```
32
+ E(x) = -(b^T s + s^T J s) s = 2x - 1 ∈ {-1, +1}^200
33
+ ```
34
+
35
+ The coupling matrix J encodes co-occurrence patterns between structural code
36
+ features that distinguish correct from buggy implementations (e.g., "has
37
+ `range(1, n + 1)`" is strongly coupled to correctness in sum-range tasks).
38
+
39
+ ## How It Was Trained
40
+
41
+ **Algorithm:** Discriminative Contrastive Divergence (full-batch)
42
+
43
+ | Hyperparameter | Value |
44
+ |---|---|
45
+ | Training pairs | 400 (80% of 500 generated) |
46
+ | Feature dimension | 200 binary features |
47
+ | Learning rate | 0.01 |
48
+ | L1 regularization | 0.0 |
49
+ | Weight decay | 0.005 |
50
+ | Epochs | 300 |
51
+ | Source | Exp 62 (domain CD) + Exp 89 (self-bootstrap) |
52
+
53
+ **Training data:** Ten code implementation templates with exactly one bug per
54
+ wrong implementation:
55
+
56
+ | Function | Correct pattern | Bug pattern |
57
+ |----------|----------------|-------------|
58
+ | sum_range | `range(1, n + 1)` | `range(1, n)` (off-by-one) |
59
+ | find_max | `result = lst[0]` | `result = 0` (wrong init) |
60
+ | is_even | `n % 2 == 0` | `n % 2 == 1` (inverted) |
61
+ | factorial | base case `n == 0` | base case `n == 1` (misses 0!) |
62
+ | reverse_string | `s[::-1]` | `s[::1]` (no-op) |
63
+ | count_vowels | `s.lower()` | missing `lower()` |
64
+ | fibonacci | base case `n <= 0` | base case `n == 1` |
65
+ | binary_search | `lo = mid + 1` | `lo = mid` (infinite loop) |
66
+ | is_palindrome | `s == s[::-1]` | `s == ''.join(sorted(s))` |
67
+ | flatten | `result.extend(flatten(item))` | `result.extend(item)` (shallow) |
68
+
69
+ ## Benchmark Results
70
+
71
+ | Metric | This export | Exp 89 reference |
72
+ |--------|-------------|-----------------|
73
+ | AUROC (test) | **0.8669** | 0.9096 |
74
+ | Accuracy (test) | **88.0%** | 88.0% |
75
+ | Test set size | 100 | 25 |
76
+ | Baseline AUROC | 0.5 | 0.5 |
77
+
78
+ The gap from Exp 89 reflects the richer constraint features (245-dim pipeline
79
+ features vs 200-dim structural encoding used here). Accuracy matches perfectly.
80
+
81
+ ## Usage
82
+
83
+ ```python
84
+ import numpy as np
85
+ from carnot.inference.constraint_models import ConstraintPropagationModel
86
+
87
+ # Load model
88
+ model = ConstraintPropagationModel.from_pretrained(
89
+ "exports/constraint-propagation-models/code"
90
+ )
91
+
92
+ # Encode a code response
93
+ from scripts.export_constraint_models import encode_answer
94
+ question = "Write a function that returns the sum of integers from 1 to n."
95
+ correct_code = "def sum_range(n):\n total = 0\n for i in range(1, n + 1):\n total += i\n return total"
96
+ buggy_code = "def sum_range(n):\n total = 0\n for i in range(1, n):\n total += i\n return total"
97
+
98
+ x_correct = encode_answer(question, correct_code)
99
+ x_buggy = encode_answer(question, buggy_code)
100
+
101
+ print(f"Correct energy: {model.energy(x_correct):.2f}") # should be lower
102
+ print(f"Buggy energy: {model.energy(x_buggy):.2f}") # should be higher
103
+ print(f"Correct score: {model.score(x_correct):.3f}") # should be higher
104
+ print(f"Buggy score: {model.score(x_buggy):.3f}") # should be lower
105
+ ```
106
+
107
+ ## Limitations
108
+
109
+ 1. **Template bugs only**: Trained on 10 specific bug patterns. Novel bug types
110
+ (e.g., wrong algorithm choice, logic errors in business rules) are outside
111
+ scope.
112
+ 2. **Structural features only**: Detects structural signals like "has `range(1,
113
+ n + 1)`" or "has `isinstance`" — does not execute or parse the code.
114
+ 3. **Lowest AUROC of the three domains**: Code structure is less distinctive
115
+ than arithmetic or logic patterns. AUROC 0.867 vs 1.0 for logic/arithmetic.
116
+ 4. **No semantic understanding**: Two implementations with the same keywords
117
+ but different logic will have similar energies.
118
+
119
+ ## Files
120
+
121
+ | File | Description |
122
+ |------|-------------|
123
+ | `model.safetensors` | Coupling matrix J (200×200) and bias b (200,) as float32 |
124
+ | `config.json` | Training metadata and benchmark results |
125
+ | `README.md` | This file |
126
+
127
+ ## Citation
128
+
129
+ ```bibtex
130
+ @misc{carnot2026constraint_code,
131
+ title = {Carnot Constraint Propagation Model: Code},
132
+ author = {Carnot-EBM},
133
+ year = {2026},
134
+ url = {https://github.com/ianblenke/carnot}
135
+ }
136
+ ```
137
+
138
+ ## Spec
139
+
140
+ - REQ-VERIFY-002, REQ-VERIFY-003, FR-11
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "ising_constraint_model",
3
+ "domain": "code",
4
+ "feature_dim": 200,
5
+ "carnot_version": "0.1.0",
6
+ "spec": [
7
+ "REQ-VERIFY-002",
8
+ "REQ-VERIFY-003",
9
+ "FR-11"
10
+ ],
11
+ "training": {
12
+ "n_pairs": 500,
13
+ "n_train": 400,
14
+ "n_test": 100,
15
+ "algorithm": "discriminative_cd",
16
+ "lr": 0.01,
17
+ "l1_lambda": 0.0,
18
+ "weight_decay": 0.005,
19
+ "n_epochs": 300,
20
+ "source_experiments": [
21
+ "Exp-62",
22
+ "Exp-89"
23
+ ]
24
+ },
25
+ "benchmark": {
26
+ "auroc_reproduced": 0.8669,
27
+ "accuracy_reproduced": 0.88,
28
+ "auroc_exp89_reference": 0.9096,
29
+ "accuracy_exp89_reference": 0.88,
30
+ "n_test_exp89": 25
31
+ },
32
+ "limitations": [
33
+ "Feature encoder uses binary structural features \u2014 not embeddings.",
34
+ "Only learns from structural patterns, not semantics.",
35
+ "Exp 89 AUROC (0.9096) was for self-bootstrapped pipeline data; this export uses simpler Exp 62-style deterministic encoding.",
36
+ "factual and scheduling domains not included (near-zero AUROC in Exp 89)."
37
+ ]
38
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35656cb317dc6d5d116bbfbb52536530c720e47f5d4f1cf81ba888dc70363645
3
+ size 160944