Spaces:
Sleeping
Sleeping
Development Guide
This guide is for contributors working inside the LedgerShield repo. It covers setup, validation, CI expectations, and a detailed file map so it is easy to find the right place to make changes.
Local Setup
Prerequisites
- Python 3.11 or 3.12
git- Docker if you want container smoke tests
- an OpenAI-compatible endpoint only if you plan to run the LLM-powered comparison scripts
Install
git clone https://github.com/BiradarScripts/Meta-s-LedgerShield.git
cd Meta-s-LedgerShield
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt
Start the server
python -m server.app
Run the test suite
python -m pytest tests/ -q
Useful focused runs:
python -m pytest tests/test_ledgershield_env.py -q
python -m pytest tests/test_grading.py tests/test_task_c_guardrails.py tests/test_task_d_guardrails.py -q
python -m pytest tests/test_currency_engine.py tests/test_compliance_engine.py tests/test_curriculum.py -q
Validate packaging and submission workflow
bash validate-submission.sh
docker build -t ledgershield:dev .
If openenv is installed:
openenv validate
CI Expectations
The repo includes ../.github/workflows/ci.yml, which currently runs:
- pytest on Python 3.11 and 3.12
- Docker build + container smoke test
openenv.yamlmetadata validation
Pytest configuration is centralized in ../pyproject.toml under [tool.pytest.ini_options]:
asyncio_mode = "strict"withasyncio_default_fixture_loop_scope = "function"- custom
testsmarker - deprecation-warning filters for
websockets.legacy
If you change APIs, packaging, or runtime behavior, assume CI should keep passing without special local context.
Repo Map
Root files
| Path | What it is for |
|---|---|
../README.md |
top-level benchmark overview and quick start |
../CHANGELOG.md |
human-readable project changes |
../Dockerfile |
container image definition for server deployment |
../pyproject.toml |
package metadata, dependencies, pytest config |
../requirements.txt |
pinned runtime dependencies |
../uv.lock |
lockfile for reproducible dependency installs |
../openenv.yaml |
OpenEnv metadata, novelty claims, published benchmark numbers |
../__init__.py |
package marker |
../client.py |
thin HTTP client wrapper for the environment |
../ledgershield_env.py |
compatibility re-export module for legacy imports |
../models.py |
shared dataclasses, Pydantic reward model, typed internal returns |
../openenv_compat.py |
adapter around openenv-core with local fallback server/client |
../inference.py |
submission-safe agent with ModelCapabilityProfile tiers, evidence grounding, and strict stdout contract |
../inference_improved.py |
experimental improved agent entrypoint |
../inference_llm_powered.py |
richer LLM-powered agent used for debugging and comparisons |
../llm_utils.py |
JSON parsing and completion helpers for LLM workflows |
../llm_judge_grader.py |
optional LLM-as-judge grading experiments |
../compare_models_live.py |
live multi-model comparison with capability profiles and monotonic strength checks |
../compare_all_models.py |
broader multi-model sweep helper with --models, --output, --timeout, and a 0.85-aligned pass threshold |
../benchmark_report.py |
public benchmark, holdout, and contrastive report generation |
../generate_branch_comparison_report.py |
legacy reporting helper for saved branch comparison JSONs |
../generate_comparison_report.py |
legacy reporting helper for multi-model JSON summaries |
../generate_final_report.py |
legacy reporting helper for final comparison JSONs |
../generate_sota_report.py |
legacy reporting helper for SOTA comparison JSONs |
../task_c_guardrails.py |
Task C sanitization, composite signal detection, and constructive PAY evidence |
../task_d_guardrails.py |
Task D sanitization, composite signal detection, and constructive PAY evidence |
../test_scoring.py |
local baseline scoring simulation helper |
../validate_grader.py |
end-to-end grader and environment validation script |
../validate_agent_grading.py |
score-separation validation helper |
../validate-submission.sh |
pre-submission validator for Docker, server health, and stdout contract |
../live_model_comparison.json |
saved live comparison summary artifact |
server/
| Path | What it is for |
|---|---|
../server/__init__.py |
package marker |
../server/app.py |
FastAPI app builder and endpoint registration |
../server/environment.py |
main environment loop, reward shaping, truncation logic, rendering |
../server/world_state.py |
hidden/public state, artifacts, readiness, pressure resistance |
../server/tools.py |
investigation tool implementations, email-thread payload construction, domain alignment inference |
../server/transition_engine.py |
intervention handling and signal extraction |
../server/grading.py |
task-specific grading rubrics |
../server/trajectory_grading.py |
trajectory-aware scoring components |
../server/outcome_simulator.py |
downstream operational/fraud outcome simulation |
../server/risk_rules.py |
risk bucket logic and heuristic submission-risk assessment |
../server/pressure_events.py |
adversarial pressure-event templates and scoring |
../server/vendor_simulator.py |
callback vendor-response simulation |
../server/data_loader.py |
fixture loading, indexing, and generated-case injection |
../server/case_factory.py |
challenge/holdout/benign-twin generation |
../server/attack_library.py |
16 adversarial AP fraud attack templates |
../server/schema.py |
canonical field/action/reason-code constants and normalizers |
../server/currency_engine.py |
multi-currency realism utilities |
../server/compliance_engine.py |
SOX-style internal-control evaluation |
../server/curriculum.py |
dynamic difficulty adaptation |
../server/dual_agent_mode.py |
watchdog-mode dual-agent novelty module |
server/fixtures/
| Path | What it stores |
|---|---|
../server/fixtures/cases.json |
the 21 curated benchmark cases |
../server/fixtures/vendors.json |
vendor master data |
../server/fixtures/vendor_history.json |
historical vendor changes and fraud history |
../server/fixtures/po_records.json |
purchase-order records |
../server/fixtures/receipts.json |
goods-receipt records |
../server/fixtures/ledger_index.json |
ledger/payment history used for duplicate detection |
../server/fixtures/email_threads.json |
structured email-thread records |
../server/fixtures/policy_rules.json |
policy rules used by lookup_policy |
tests/
| Path | What it validates |
|---|---|
../tests/conftest.py |
shared fixtures and suite-wide pytest marker setup |
../tests/test_api_smoke.py |
API endpoint smoke coverage |
../tests/test_benchmark_report.py |
public/holdout/contrastive reporting behavior |
../tests/test_compare_all_models.py |
score parsing helpers in broad model sweeps |
../tests/test_compare_models_live.py |
live comparison stats, capability profiles, and rendering helpers |
../tests/test_compliance_engine.py |
SOX compliance evaluation |
../tests/test_currency_engine.py |
FX/IBAN/SWIFT/aging-report utilities |
../tests/test_curriculum.py |
curriculum tiering and case selection |
../tests/test_grading.py |
degenerate evidence cap and grading edge cases |
../tests/test_inference_contract.py |
required stdout contract for inference.py |
../tests/test_inference_llm_powered.py |
derived thread reasoning in LLM-powered inference |
../tests/test_inference_runtime.py |
model capability profiles and runtime heuristics |
../tests/test_ledgershield_env.py |
environment transitions, scoring, and holdout generation |
../tests/test_schema_reason_codes.py |
reason-code normalization and aliasing |
../tests/test_task_c_guardrails.py |
Task C submission guardrails and PAY evidence |
../tests/test_task_d_guardrails.py |
Task D submission guardrails and PAY evidence |
docs/
| Path | What it covers |
|---|---|
../docs/README.md |
docs landing page |
../docs/index.md |
benchmark overview |
../docs/tasks.md |
task contracts and scoring |
../docs/api-reference.md |
REST API reference |
../docs/architecture.md |
architecture deep dive |
../docs/development.md |
this file |
../docs/deployment.md |
deployment and runtime configuration |
Common Workflows
Changing the environment
Touch at least these files:
server/environment.pyserver/world_state.py- relevant tests in
tests/test_ledgershield_env.py - docs in
docs/api-reference.mdordocs/architecture.mdif the contract changed
Changing grading
Touch at least these files:
server/grading.pyserver/trajectory_grading.py- any new utility modules such as
server/compliance_engine.py - tests in
tests/test_grading.pyand task-specific regression tests
Adding benchmark realism
Typical landing spots:
server/currency_engine.pyserver/compliance_engine.pyserver/attack_library.pyserver/case_factory.pyserver/fixtures/cases.json
Updating inference behavior
Touch at least these files:
inference.pyinference_llm_powered.pyif comparison/debug behavior must stay alignedtask_c_guardrails.py/task_d_guardrails.pyif structured output rules changedtests/test_inference_contract.pyand relevant inference tests
Extension Guidance
Adding a new tool
- Implement the tool in
../server/tools.py. - Add the action name to
../server/schema.py. - Add cost handling and dispatch in
../server/environment.py. - Add or update signal extraction in
../server/transition_engine.pyif needed. - Add tests and update docs.
Adding a new case
- Add it to
../server/fixtures/cases.json. - Ensure any needed vendor/PO/receipt/email/ledger fixtures exist.
- Confirm case IDs are unique.
- Update
./tasks.mdif the public case catalog changed. - Add regression coverage.
Adding a new attack pattern
- Extend
../server/attack_library.py. - Make sure the resulting reason codes and fraud flags are canonical.
- Add tests that prove the attack is reachable and meaningful.
Practical Notes
- The repo uses a mix of benchmark runtime code and historical helper scripts. Prefer editing the core runtime paths first.
- Some top-level report helpers are legacy utilities for saved JSON artifacts rather than part of the main runtime.
- Keep docs and tests in sync with any public contract changes.