Spaces:
Running on Zero
Running on Zero
File size: 7,159 Bytes
5ea8775 c4f3819 5ea8775 dc0c3a8 c4f3819 5ea8775 c4f3819 5ff9843 c4f3819 9f03f0f 5ea8775 c4f3819 5ff9843 c4f3819 4c69128 c4f3819 4c69128 c4f3819 4c69128 c4f3819 4c69128 c4f3819 4c69128 c4f3819 5ff9843 c4f3819 5ff9843 c4f3819 5ff9843 c4f3819 5ff9843 c4f3819 fd59e50 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
title: TestForge
emoji: "🔨"
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.18.0
python_version: "3.12"
app_file: app.py
pinned: false
license: mit
tags:
- build-small-hackathon
- track:backyard-ai
- badge:tiny-titan
- badge:best-agent
- badge:best-demo
- testing
- agent
- python
- pytest
- developer-tools
- track:backyard
- sponsor:openbmb
- achievement:offgrid
short_description: Freeze legacy Python behavior before refactoring
---
# 🔨 TestForge — Characterization tests for legacy Python code
**Track:** Backyard AI · **Bonus badges:** Tiny Titan, Best Agent, Best Demo
Untested legacy Python code is scary to change because you do not know what
behavior you are about to break. **TestForge** points at a small Python repo,
discovers its public functions, executes the real code on representative inputs,
and generates a green pytest suite that locks in current behavior. Then it
proves the suite is not fake confidence by mutating the code and showing a
headline **mutation score**.
> Built as **a coding agent built by a coding agent**. The goal is not "AI
> writes tests." The goal is "AI creates a runnable, inspectable, regression-
> catching safety net before you refactor."
---
## Why this is trustworthy: the code verifies itself
The core design choice is **capture-then-assert**.
For every discovered function, TestForge chooses a few inputs, **runs the real
function**, records the actual return value or exception, and only then emits
pytest assertions.
```python
def test_apply_discount_case_1():
assert apply_discount(100.0, 10.0) == 90.0
def test_with_tax_case_4():
with pytest.raises(ValueError):
with_tax(100.0, -1.0)
```
This means:
- every generated test is green on the current code by construction
- every test calls a real public function, so it cannot collapse into `assert True`
- the expected value comes from execution, not model guesswork
Then TestForge applies a fixed catalog of source mutations and reruns the same
generated suite. The result is a visible quality metric like:
```text
Mutation score: 7/9 behavior changes detected
```
That turns "the AI made tests" into "the AI made tests that demonstrably catch
regressions."
---
## Models
The core pipeline is **deterministic and model-free** so the demo works with
**no model and no GPU**: every test comes from executing the real code on
type-hint-driven inputs.
| Job | Current implementation | Fallback |
|---|---|---|
| Input generation | Type-hint-driven deterministic cases | Fixed mixed literals |
| Test expectation generation | Real code execution | None needed |
| Quality proof | Hand-rolled mutation scoring | None needed |
| Edge-case assist (optional) | `Qwen/Qwen2.5-Coder-0.5B-Instruct` via `@spaces.GPU` proposes extra argument tuples | Skipped if the model/GPU is unavailable; deterministic cases still run |
Checking **"Use small coder model"** sends each function's signature, docstring,
and already-tried inputs to a small (0.5B parameter) coder model running on
ZeroGPU, which proposes a couple of extra edge-case argument tuples (empty,
negative, boundary values, ...). Those tuples are **not trusted directly** --
they go through the same `capture`-then-assert path as the deterministic cases
(`generator.capture`), so the recorded expectation always comes from running
the real code. A bad or redundant suggestion just becomes a redundant (still
correct) test; it can never make a test pass on model guesswork alone.
This keeps the demo fast, inspectable, and stable under hackathon time
pressure while satisfying the **ZeroGPU** requirement via
`model_suggest.suggest_extra_inputs`.
---
## How it works
```text
Pick bundled legacy repo
-> Analyze : AST discovery of public functions and signatures
-> Generate : choose deterministic inputs and capture real outputs
-> Render : write pytest characterization tests
-> Run : execute pytest in a subprocess
-> Score : apply deterministic mutations and count killed mutants
-> Export : produce a PR-style patch adding tests/
```
## Run locally
```bash
python -m pip install -r requirements.txt
python -m pytest -q
python app.py
```
Then click **Forge Tests** to generate a full suite for the bundled sample repo,
inspect the generated tests, and see the mutation score. Click **Inject a
regression** to watch the same suite turn red on a controlled behavior change.
## Tests
```bash
python -m pytest -q
```
The project's own test suite currently covers:
- analyzer discovery and arity checks
- capture of returned values and raised exceptions
- deterministic suite generation staying green
- runner parsing for passing and failing suites
- deterministic mutation scoring
- PR patch export
- forge pipeline smoke path
- injected regression causing exactly one failure
- model-suggestion parsing (arity-matched argument tuples only)
## Project layout
```text
app.py Gradio UI + orchestration
agent.py forge pipeline for the bundled sample repo
analyzer.py AST discovery of public functions and methods
generator.py deterministic inputs, capture, and pytest rendering
runner.py subprocess pytest execution and parsing
mutator.py hand-rolled mutation catalog and mutation score
patch.py PR-style unified diff export
inject.py controlled one-click demo regression
model_suggest.py optional @spaces.GPU small-model edge-case assist
samples/legacy_repo/ bundled untested Python target repo
tests/test_testforge.py
```
## Demo arc
1. Click **Forge Tests**
2. Watch TestForge analyze the sample repo and generate tests from real execution
3. See a green pytest result and the mutation score
4. Click **Inject a regression**
5. Watch the same suite fail on exactly one controlled change
6. Download `testforge.patch` as a PR-style artifact
## Build log
This repo was built in milestone commits:
- `Add legacy sample and AST analyzer`
- `Add deterministic capture runner`
- `Add deterministic mutation scoring`
- `Add forge app orchestration`
- `Document hackathon submission`
The most important product decision was keeping the **source of truth in the
code itself**, not in the model. Each milestone built the tooling and loop
around that principle.
## Privacy
The bundled target repo is synthetic and local. The deterministic demo path
(default) uses no external model, no database, and no persistent user storage.
Checking "Use small coder model" downloads `Qwen/Qwen2.5-Coder-0.5B-Instruct`
from the Hugging Face Hub and runs it on ZeroGPU; no other data leaves the Space.
---
## Submission links
- 🤗 **Live Space:** https://huggingface.co/spaces/build-small-hackathon/TestForge
- 🎥 **Demo video:** https://drive.google.com/file/d/1Zx1EBmk_iEMxSOV6v867cDa7WwpKZZyB/view?usp=sharing
- 📣 **Social post:** https://www.linkedin.com/posts/aman-maurya-2a394924b_buildsmallhackathon-openaicodex-python-share-7472383067186421760-Jf0y/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAD3l3lsBvHlGmHXJP3WiWP5GwQFJQ2g9QZI
|