# Development Guide This guide is for contributors working inside the LedgerShield repo. It covers setup, validation, CI expectations, and a detailed file map so it is easy to find the right place to make changes. ## Local Setup ### Prerequisites - Python 3.11 or 3.12 - `git` - Docker if you want container smoke tests - an OpenAI-compatible endpoint only if you plan to run the LLM-powered comparison scripts ### Install ```bash git clone https://github.com/BiradarScripts/Meta-s-LedgerShield.git cd Meta-s-LedgerShield python -m venv .venv source .venv/bin/activate pip install -e . pip install -r requirements.txt ``` ### Start the server ```bash python -m server.app ``` ### Run the test suite ```bash python -m pytest tests/ -q ``` Useful focused runs: ```bash python -m pytest tests/test_ledgershield_env.py -q python -m pytest tests/test_grading.py tests/test_task_c_guardrails.py tests/test_task_d_guardrails.py -q python -m pytest tests/test_currency_engine.py tests/test_compliance_engine.py tests/test_curriculum.py -q ``` ### Validate packaging and submission workflow ```bash bash validate-submission.sh docker build -t ledgershield:dev . ``` If `openenv` is installed: ```bash openenv validate ``` ## CI Expectations The repo includes [`../.github/workflows/ci.yml`](../.github/workflows/ci.yml), which currently runs: - pytest on Python 3.11 and 3.12 - Docker build + container smoke test - `openenv.yaml` metadata validation Pytest configuration is centralized in [`../pyproject.toml`](../pyproject.toml) under `[tool.pytest.ini_options]`: - `asyncio_mode = "strict"` with `asyncio_default_fixture_loop_scope = "function"` - custom `tests` marker - deprecation-warning filters for `websockets.legacy` If you change APIs, packaging, or runtime behavior, assume CI should keep passing without special local context. ## Repo Map ### Root files | Path | What it is for | |---|---| | [`../README.md`](../README.md) | top-level benchmark overview and quick start | | [`../CHANGELOG.md`](../CHANGELOG.md) | human-readable project changes | | [`../Dockerfile`](../Dockerfile) | container image definition for server deployment | | [`../pyproject.toml`](../pyproject.toml) | package metadata, dependencies, pytest config | | [`../requirements.txt`](../requirements.txt) | pinned runtime dependencies | | [`../uv.lock`](../uv.lock) | lockfile for reproducible dependency installs | | [`../openenv.yaml`](../openenv.yaml) | OpenEnv metadata, novelty claims, published benchmark numbers | | [`../__init__.py`](../__init__.py) | package marker | | [`../client.py`](../client.py) | thin HTTP client wrapper for the environment | | [`../ledgershield_env.py`](../ledgershield_env.py) | compatibility re-export module for legacy imports | | [`../models.py`](../models.py) | shared dataclasses, Pydantic reward model, typed internal returns | | [`../openenv_compat.py`](../openenv_compat.py) | adapter around `openenv-core` with local fallback server/client | | [`../inference.py`](../inference.py) | submission-safe agent with `ModelCapabilityProfile` tiers, evidence grounding, and strict stdout contract | | [`../inference_improved.py`](../inference_improved.py) | experimental improved agent entrypoint | | [`../inference_llm_powered.py`](../inference_llm_powered.py) | richer LLM-powered agent used for debugging and comparisons | | [`../llm_utils.py`](../llm_utils.py) | JSON parsing and completion helpers for LLM workflows | | [`../llm_judge_grader.py`](../llm_judge_grader.py) | optional LLM-as-judge grading experiments | | [`../compare_models_live.py`](../compare_models_live.py) | live multi-model comparison with capability profiles and monotonic strength checks | | [`../compare_all_models.py`](../compare_all_models.py) | broader multi-model sweep helper with `--models`, `--output`, `--timeout`, and a `0.85`-aligned pass threshold | | [`../benchmark_report.py`](../benchmark_report.py) | public benchmark, holdout, and contrastive report generation | | [`../generate_branch_comparison_report.py`](../generate_branch_comparison_report.py) | legacy reporting helper for saved branch comparison JSONs | | [`../generate_comparison_report.py`](../generate_comparison_report.py) | legacy reporting helper for multi-model JSON summaries | | [`../generate_final_report.py`](../generate_final_report.py) | legacy reporting helper for final comparison JSONs | | [`../generate_sota_report.py`](../generate_sota_report.py) | legacy reporting helper for SOTA comparison JSONs | | [`../task_c_guardrails.py`](../task_c_guardrails.py) | Task C sanitization, composite signal detection, and constructive PAY evidence | | [`../task_d_guardrails.py`](../task_d_guardrails.py) | Task D sanitization, composite signal detection, and constructive PAY evidence | | [`../test_scoring.py`](../test_scoring.py) | local baseline scoring simulation helper | | [`../validate_grader.py`](../validate_grader.py) | end-to-end grader and environment validation script | | [`../validate_agent_grading.py`](../validate_agent_grading.py) | score-separation validation helper | | [`../validate-submission.sh`](../validate-submission.sh) | pre-submission validator for Docker, server health, and stdout contract | | [`../live_model_comparison.json`](../live_model_comparison.json) | saved live comparison summary artifact | ### `server/` | Path | What it is for | |---|---| | [`../server/__init__.py`](../server/__init__.py) | package marker | | [`../server/app.py`](../server/app.py) | FastAPI app builder and endpoint registration | | [`../server/environment.py`](../server/environment.py) | main environment loop, reward shaping, truncation logic, rendering | | [`../server/world_state.py`](../server/world_state.py) | hidden/public state, artifacts, readiness, pressure resistance | | [`../server/tools.py`](../server/tools.py) | investigation tool implementations, email-thread payload construction, domain alignment inference | | [`../server/transition_engine.py`](../server/transition_engine.py) | intervention handling and signal extraction | | [`../server/grading.py`](../server/grading.py) | task-specific grading rubrics | | [`../server/trajectory_grading.py`](../server/trajectory_grading.py) | trajectory-aware scoring components | | [`../server/outcome_simulator.py`](../server/outcome_simulator.py) | downstream operational/fraud outcome simulation | | [`../server/risk_rules.py`](../server/risk_rules.py) | risk bucket logic and heuristic submission-risk assessment | | [`../server/pressure_events.py`](../server/pressure_events.py) | adversarial pressure-event templates and scoring | | [`../server/vendor_simulator.py`](../server/vendor_simulator.py) | callback vendor-response simulation | | [`../server/data_loader.py`](../server/data_loader.py) | fixture loading, indexing, and generated-case injection | | [`../server/case_factory.py`](../server/case_factory.py) | challenge/holdout/benign-twin generation | | [`../server/attack_library.py`](../server/attack_library.py) | 16 adversarial AP fraud attack templates | | [`../server/schema.py`](../server/schema.py) | canonical field/action/reason-code constants and normalizers | | [`../server/currency_engine.py`](../server/currency_engine.py) | multi-currency realism utilities | | [`../server/compliance_engine.py`](../server/compliance_engine.py) | SOX-style internal-control evaluation | | [`../server/curriculum.py`](../server/curriculum.py) | dynamic difficulty adaptation | | [`../server/dual_agent_mode.py`](../server/dual_agent_mode.py) | watchdog-mode dual-agent novelty module | ### `server/fixtures/` | Path | What it stores | |---|---| | [`../server/fixtures/cases.json`](../server/fixtures/cases.json) | the 21 curated benchmark cases | | [`../server/fixtures/vendors.json`](../server/fixtures/vendors.json) | vendor master data | | [`../server/fixtures/vendor_history.json`](../server/fixtures/vendor_history.json) | historical vendor changes and fraud history | | [`../server/fixtures/po_records.json`](../server/fixtures/po_records.json) | purchase-order records | | [`../server/fixtures/receipts.json`](../server/fixtures/receipts.json) | goods-receipt records | | [`../server/fixtures/ledger_index.json`](../server/fixtures/ledger_index.json) | ledger/payment history used for duplicate detection | | [`../server/fixtures/email_threads.json`](../server/fixtures/email_threads.json) | structured email-thread records | | [`../server/fixtures/policy_rules.json`](../server/fixtures/policy_rules.json) | policy rules used by `lookup_policy` | ### `tests/` | Path | What it validates | |---|---| | [`../tests/conftest.py`](../tests/conftest.py) | shared fixtures and suite-wide pytest marker setup | | [`../tests/test_api_smoke.py`](../tests/test_api_smoke.py) | API endpoint smoke coverage | | [`../tests/test_benchmark_report.py`](../tests/test_benchmark_report.py) | public/holdout/contrastive reporting behavior | | [`../tests/test_compare_all_models.py`](../tests/test_compare_all_models.py) | score parsing helpers in broad model sweeps | | [`../tests/test_compare_models_live.py`](../tests/test_compare_models_live.py) | live comparison stats, capability profiles, and rendering helpers | | [`../tests/test_compliance_engine.py`](../tests/test_compliance_engine.py) | SOX compliance evaluation | | [`../tests/test_currency_engine.py`](../tests/test_currency_engine.py) | FX/IBAN/SWIFT/aging-report utilities | | [`../tests/test_curriculum.py`](../tests/test_curriculum.py) | curriculum tiering and case selection | | [`../tests/test_grading.py`](../tests/test_grading.py) | degenerate evidence cap and grading edge cases | | [`../tests/test_inference_contract.py`](../tests/test_inference_contract.py) | required stdout contract for `inference.py` | | [`../tests/test_inference_llm_powered.py`](../tests/test_inference_llm_powered.py) | derived thread reasoning in LLM-powered inference | | [`../tests/test_inference_runtime.py`](../tests/test_inference_runtime.py) | model capability profiles and runtime heuristics | | [`../tests/test_ledgershield_env.py`](../tests/test_ledgershield_env.py) | environment transitions, scoring, and holdout generation | | [`../tests/test_schema_reason_codes.py`](../tests/test_schema_reason_codes.py) | reason-code normalization and aliasing | | [`../tests/test_task_c_guardrails.py`](../tests/test_task_c_guardrails.py) | Task C submission guardrails and PAY evidence | | [`../tests/test_task_d_guardrails.py`](../tests/test_task_d_guardrails.py) | Task D submission guardrails and PAY evidence | ### `docs/` | Path | What it covers | |---|---| | [`../docs/README.md`](../docs/README.md) | docs landing page | | [`../docs/index.md`](../docs/index.md) | benchmark overview | | [`../docs/tasks.md`](../docs/tasks.md) | task contracts and scoring | | [`../docs/api-reference.md`](../docs/api-reference.md) | REST API reference | | [`../docs/architecture.md`](../docs/architecture.md) | architecture deep dive | | [`../docs/development.md`](../docs/development.md) | this file | | [`../docs/deployment.md`](../docs/deployment.md) | deployment and runtime configuration | ## Common Workflows ### Changing the environment Touch at least these files: - `server/environment.py` - `server/world_state.py` - relevant tests in `tests/test_ledgershield_env.py` - docs in `docs/api-reference.md` or `docs/architecture.md` if the contract changed ### Changing grading Touch at least these files: - `server/grading.py` - `server/trajectory_grading.py` - any new utility modules such as `server/compliance_engine.py` - tests in `tests/test_grading.py` and task-specific regression tests ### Adding benchmark realism Typical landing spots: - `server/currency_engine.py` - `server/compliance_engine.py` - `server/attack_library.py` - `server/case_factory.py` - `server/fixtures/cases.json` ### Updating inference behavior Touch at least these files: - `inference.py` - `inference_llm_powered.py` if comparison/debug behavior must stay aligned - `task_c_guardrails.py` / `task_d_guardrails.py` if structured output rules changed - `tests/test_inference_contract.py` and relevant inference tests ## Extension Guidance ### Adding a new tool 1. Implement the tool in [`../server/tools.py`](../server/tools.py). 2. Add the action name to [`../server/schema.py`](../server/schema.py). 3. Add cost handling and dispatch in [`../server/environment.py`](../server/environment.py). 4. Add or update signal extraction in [`../server/transition_engine.py`](../server/transition_engine.py) if needed. 5. Add tests and update docs. ### Adding a new case 1. Add it to [`../server/fixtures/cases.json`](../server/fixtures/cases.json). 2. Ensure any needed vendor/PO/receipt/email/ledger fixtures exist. 3. Confirm case IDs are unique. 4. Update [`./tasks.md`](./tasks.md) if the public case catalog changed. 5. Add regression coverage. ### Adding a new attack pattern 1. Extend [`../server/attack_library.py`](../server/attack_library.py). 2. Make sure the resulting reason codes and fraud flags are canonical. 3. Add tests that prove the attack is reachable and meaningful. ## Practical Notes - The repo uses a mix of benchmark runtime code and historical helper scripts. Prefer editing the core runtime paths first. - Some top-level report helpers are legacy utilities for saved JSON artifacts rather than part of the main runtime. - Keep docs and tests in sync with any public contract changes.