Spaces:

build-small-hackathon
/

TestForge

Running on Zero

App Files Files Community

TestForge / README.md

perceptron01

Update README.md

9f03f0f verified 12 days ago

preview code

Raw

History Blame Contribute Delete

7.16 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: TestForge
emoji: 🔨
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 6.18.0
python_version: '3.12'
app_file: app.py
pinned: false
license: mit
tags:
  - build-small-hackathon
  - track:backyard-ai
  - badge:tiny-titan
  - badge:best-agent
  - badge:best-demo
  - testing
  - agent
  - python
  - pytest
  - developer-tools
  - track:backyard
  - sponsor:openbmb
  - achievement:offgrid
short_description: Freeze legacy Python behavior before refactoring

🔨 TestForge — Characterization tests for legacy Python code

Track: Backyard AI · Bonus badges: Tiny Titan, Best Agent, Best Demo

Untested legacy Python code is scary to change because you do not know what behavior you are about to break. TestForge points at a small Python repo, discovers its public functions, executes the real code on representative inputs, and generates a green pytest suite that locks in current behavior. Then it proves the suite is not fake confidence by mutating the code and showing a headline mutation score.

Built as a coding agent built by a coding agent. The goal is not "AI writes tests." The goal is "AI creates a runnable, inspectable, regression- catching safety net before you refactor."

Why this is trustworthy: the code verifies itself

The core design choice is capture-then-assert.

For every discovered function, TestForge chooses a few inputs, runs the real function, records the actual return value or exception, and only then emits pytest assertions.

def test_apply_discount_case_1():
    assert apply_discount(100.0, 10.0) == 90.0

def test_with_tax_case_4():
    with pytest.raises(ValueError):
        with_tax(100.0, -1.0)

This means:

every generated test is green on the current code by construction
every test calls a real public function, so it cannot collapse into assert True
the expected value comes from execution, not model guesswork

Then TestForge applies a fixed catalog of source mutations and reruns the same generated suite. The result is a visible quality metric like:

Mutation score: 7/9 behavior changes detected

That turns "the AI made tests" into "the AI made tests that demonstrably catch regressions."

Models

The core pipeline is deterministic and model-free so the demo works with no model and no GPU: every test comes from executing the real code on type-hint-driven inputs.

Job	Current implementation	Fallback
Input generation	Type-hint-driven deterministic cases	Fixed mixed literals
Test expectation generation	Real code execution	None needed
Quality proof	Hand-rolled mutation scoring	None needed
Edge-case assist (optional)	`Qwen/Qwen2.5-Coder-0.5B-Instruct` via `@spaces.GPU` proposes extra argument tuples	Skipped if the model/GPU is unavailable; deterministic cases still run

Checking "Use small coder model" sends each function's signature, docstring, and already-tried inputs to a small (0.5B parameter) coder model running on ZeroGPU, which proposes a couple of extra edge-case argument tuples (empty, negative, boundary values, ...). Those tuples are not trusted directly -- they go through the same capture-then-assert path as the deterministic cases (generator.capture), so the recorded expectation always comes from running the real code. A bad or redundant suggestion just becomes a redundant (still correct) test; it can never make a test pass on model guesswork alone.

This keeps the demo fast, inspectable, and stable under hackathon time pressure while satisfying the ZeroGPU requirement via model_suggest.suggest_extra_inputs.

How it works

Pick bundled legacy repo
  -> Analyze        : AST discovery of public functions and signatures
  -> Generate       : choose deterministic inputs and capture real outputs
  -> Render         : write pytest characterization tests
  -> Run            : execute pytest in a subprocess
  -> Score          : apply deterministic mutations and count killed mutants
  -> Export         : produce a PR-style patch adding tests/

Run locally

python -m pip install -r requirements.txt
python -m pytest -q
python app.py

Then click Forge Tests to generate a full suite for the bundled sample repo, inspect the generated tests, and see the mutation score. Click Inject a regression to watch the same suite turn red on a controlled behavior change.

Tests

python -m pytest -q

The project's own test suite currently covers:

analyzer discovery and arity checks
capture of returned values and raised exceptions
deterministic suite generation staying green
runner parsing for passing and failing suites
deterministic mutation scoring
PR patch export
forge pipeline smoke path
injected regression causing exactly one failure
model-suggestion parsing (arity-matched argument tuples only)

Project layout

app.py                 Gradio UI + orchestration
agent.py               forge pipeline for the bundled sample repo
analyzer.py            AST discovery of public functions and methods
generator.py           deterministic inputs, capture, and pytest rendering
runner.py              subprocess pytest execution and parsing
mutator.py             hand-rolled mutation catalog and mutation score
patch.py               PR-style unified diff export
inject.py              controlled one-click demo regression
model_suggest.py       optional @spaces.GPU small-model edge-case assist
samples/legacy_repo/   bundled untested Python target repo
tests/test_testforge.py

Demo arc

Click Forge Tests
Watch TestForge analyze the sample repo and generate tests from real execution
See a green pytest result and the mutation score
Click Inject a regression
Watch the same suite fail on exactly one controlled change
Download testforge.patch as a PR-style artifact

Build log

This repo was built in milestone commits:

Add legacy sample and AST analyzer
Add deterministic capture runner
Add deterministic mutation scoring
Add forge app orchestration
Document hackathon submission

The most important product decision was keeping the source of truth in the code itself, not in the model. Each milestone built the tooling and loop around that principle.

Privacy

The bundled target repo is synthetic and local. The deterministic demo path (default) uses no external model, no database, and no persistent user storage. Checking "Use small coder model" downloads Qwen/Qwen2.5-Coder-0.5B-Instruct from the Hugging Face Hub and runs it on ZeroGPU; no other data leaves the Space.

Submission links

🤗 Live Space: https://huggingface.co/spaces/build-small-hackathon/TestForge
🎥 Demo video: https://drive.google.com/file/d/1Zx1EBmk_iEMxSOV6v867cDa7WwpKZZyB/view?usp=sharing
📣 Social post: https://www.linkedin.com/posts/aman-maurya-2a394924b_buildsmallhackathon-openaicodex-python-share-7472383067186421760-Jf0y/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAD3l3lsBvHlGmHXJP3WiWP5GwQFJQ2g9QZI