ledgershield / docs /development.md
king673134's picture
Upload folder using huggingface_hub
1ed9b86 verified
|
Raw
History Blame Contribute Delete
13.4 kB

Development Guide

This guide is for contributors working inside the LedgerShield repo. It covers setup, validation, CI expectations, and a detailed file map so it is easy to find the right place to make changes.

Local Setup

Prerequisites

  • Python 3.11 or 3.12
  • git
  • Docker if you want container smoke tests
  • an OpenAI-compatible endpoint only if you plan to run the LLM-powered comparison scripts

Install

git clone https://github.com/BiradarScripts/Meta-s-LedgerShield.git
cd Meta-s-LedgerShield

python -m venv .venv
source .venv/bin/activate

pip install -e .
pip install -r requirements.txt

Start the server

python -m server.app

Run the test suite

python -m pytest tests/ -q

Useful focused runs:

python -m pytest tests/test_ledgershield_env.py -q
python -m pytest tests/test_grading.py tests/test_task_c_guardrails.py tests/test_task_d_guardrails.py -q
python -m pytest tests/test_currency_engine.py tests/test_compliance_engine.py tests/test_curriculum.py -q

Validate packaging and submission workflow

bash validate-submission.sh
docker build -t ledgershield:dev .

If openenv is installed:

openenv validate

CI Expectations

The repo includes ../.github/workflows/ci.yml, which currently runs:

  • pytest on Python 3.11 and 3.12
  • Docker build + container smoke test
  • openenv.yaml metadata validation

Pytest configuration is centralized in ../pyproject.toml under [tool.pytest.ini_options]:

  • asyncio_mode = "strict" with asyncio_default_fixture_loop_scope = "function"
  • custom tests marker
  • deprecation-warning filters for websockets.legacy

If you change APIs, packaging, or runtime behavior, assume CI should keep passing without special local context.

Repo Map

Root files

Path What it is for
../README.md top-level benchmark overview and quick start
../CHANGELOG.md human-readable project changes
../Dockerfile container image definition for server deployment
../pyproject.toml package metadata, dependencies, pytest config
../requirements.txt pinned runtime dependencies
../uv.lock lockfile for reproducible dependency installs
../openenv.yaml OpenEnv metadata, novelty claims, published benchmark numbers
../__init__.py package marker
../client.py thin HTTP client wrapper for the environment
../ledgershield_env.py compatibility re-export module for legacy imports
../models.py shared dataclasses, Pydantic reward model, typed internal returns
../openenv_compat.py adapter around openenv-core with local fallback server/client
../inference.py submission-safe agent with ModelCapabilityProfile tiers, evidence grounding, and strict stdout contract
../inference_improved.py experimental improved agent entrypoint
../inference_llm_powered.py richer LLM-powered agent used for debugging and comparisons
../llm_utils.py JSON parsing and completion helpers for LLM workflows
../llm_judge_grader.py optional LLM-as-judge grading experiments
../compare_models_live.py live multi-model comparison with capability profiles and monotonic strength checks
../compare_all_models.py broader multi-model sweep helper with --models, --output, --timeout, and a 0.85-aligned pass threshold
../benchmark_report.py public benchmark, holdout, and contrastive report generation
../generate_branch_comparison_report.py legacy reporting helper for saved branch comparison JSONs
../generate_comparison_report.py legacy reporting helper for multi-model JSON summaries
../generate_final_report.py legacy reporting helper for final comparison JSONs
../generate_sota_report.py legacy reporting helper for SOTA comparison JSONs
../task_c_guardrails.py Task C sanitization, composite signal detection, and constructive PAY evidence
../task_d_guardrails.py Task D sanitization, composite signal detection, and constructive PAY evidence
../test_scoring.py local baseline scoring simulation helper
../validate_grader.py end-to-end grader and environment validation script
../validate_agent_grading.py score-separation validation helper
../validate-submission.sh pre-submission validator for Docker, server health, and stdout contract
../live_model_comparison.json saved live comparison summary artifact

server/

Path What it is for
../server/__init__.py package marker
../server/app.py FastAPI app builder and endpoint registration
../server/environment.py main environment loop, reward shaping, truncation logic, rendering
../server/world_state.py hidden/public state, artifacts, readiness, pressure resistance
../server/tools.py investigation tool implementations, email-thread payload construction, domain alignment inference
../server/transition_engine.py intervention handling and signal extraction
../server/grading.py task-specific grading rubrics
../server/trajectory_grading.py trajectory-aware scoring components
../server/outcome_simulator.py downstream operational/fraud outcome simulation
../server/risk_rules.py risk bucket logic and heuristic submission-risk assessment
../server/pressure_events.py adversarial pressure-event templates and scoring
../server/vendor_simulator.py callback vendor-response simulation
../server/data_loader.py fixture loading, indexing, and generated-case injection
../server/case_factory.py challenge/holdout/benign-twin generation
../server/attack_library.py 16 adversarial AP fraud attack templates
../server/schema.py canonical field/action/reason-code constants and normalizers
../server/currency_engine.py multi-currency realism utilities
../server/compliance_engine.py SOX-style internal-control evaluation
../server/curriculum.py dynamic difficulty adaptation
../server/dual_agent_mode.py watchdog-mode dual-agent novelty module

server/fixtures/

Path What it stores
../server/fixtures/cases.json the 21 curated benchmark cases
../server/fixtures/vendors.json vendor master data
../server/fixtures/vendor_history.json historical vendor changes and fraud history
../server/fixtures/po_records.json purchase-order records
../server/fixtures/receipts.json goods-receipt records
../server/fixtures/ledger_index.json ledger/payment history used for duplicate detection
../server/fixtures/email_threads.json structured email-thread records
../server/fixtures/policy_rules.json policy rules used by lookup_policy

tests/

Path What it validates
../tests/conftest.py shared fixtures and suite-wide pytest marker setup
../tests/test_api_smoke.py API endpoint smoke coverage
../tests/test_benchmark_report.py public/holdout/contrastive reporting behavior
../tests/test_compare_all_models.py score parsing helpers in broad model sweeps
../tests/test_compare_models_live.py live comparison stats, capability profiles, and rendering helpers
../tests/test_compliance_engine.py SOX compliance evaluation
../tests/test_currency_engine.py FX/IBAN/SWIFT/aging-report utilities
../tests/test_curriculum.py curriculum tiering and case selection
../tests/test_grading.py degenerate evidence cap and grading edge cases
../tests/test_inference_contract.py required stdout contract for inference.py
../tests/test_inference_llm_powered.py derived thread reasoning in LLM-powered inference
../tests/test_inference_runtime.py model capability profiles and runtime heuristics
../tests/test_ledgershield_env.py environment transitions, scoring, and holdout generation
../tests/test_schema_reason_codes.py reason-code normalization and aliasing
../tests/test_task_c_guardrails.py Task C submission guardrails and PAY evidence
../tests/test_task_d_guardrails.py Task D submission guardrails and PAY evidence

docs/

Path What it covers
../docs/README.md docs landing page
../docs/index.md benchmark overview
../docs/tasks.md task contracts and scoring
../docs/api-reference.md REST API reference
../docs/architecture.md architecture deep dive
../docs/development.md this file
../docs/deployment.md deployment and runtime configuration

Common Workflows

Changing the environment

Touch at least these files:

  • server/environment.py
  • server/world_state.py
  • relevant tests in tests/test_ledgershield_env.py
  • docs in docs/api-reference.md or docs/architecture.md if the contract changed

Changing grading

Touch at least these files:

  • server/grading.py
  • server/trajectory_grading.py
  • any new utility modules such as server/compliance_engine.py
  • tests in tests/test_grading.py and task-specific regression tests

Adding benchmark realism

Typical landing spots:

  • server/currency_engine.py
  • server/compliance_engine.py
  • server/attack_library.py
  • server/case_factory.py
  • server/fixtures/cases.json

Updating inference behavior

Touch at least these files:

  • inference.py
  • inference_llm_powered.py if comparison/debug behavior must stay aligned
  • task_c_guardrails.py / task_d_guardrails.py if structured output rules changed
  • tests/test_inference_contract.py and relevant inference tests

Extension Guidance

Adding a new tool

  1. Implement the tool in ../server/tools.py.
  2. Add the action name to ../server/schema.py.
  3. Add cost handling and dispatch in ../server/environment.py.
  4. Add or update signal extraction in ../server/transition_engine.py if needed.
  5. Add tests and update docs.

Adding a new case

  1. Add it to ../server/fixtures/cases.json.
  2. Ensure any needed vendor/PO/receipt/email/ledger fixtures exist.
  3. Confirm case IDs are unique.
  4. Update ./tasks.md if the public case catalog changed.
  5. Add regression coverage.

Adding a new attack pattern

  1. Extend ../server/attack_library.py.
  2. Make sure the resulting reason codes and fraud flags are canonical.
  3. Add tests that prove the attack is reachable and meaningful.

Practical Notes

  • The repo uses a mix of benchmark runtime code and historical helper scripts. Prefer editing the core runtime paths first.
  • Some top-level report helpers are legacy utilities for saved JSON artifacts rather than part of the main runtime.
  • Keep docs and tests in sync with any public contract changes.