Spaces:

king673134
/

ledgershield

Sleeping

App Files Files Community

ledgershield / docs /development.md

king673134

Upload folder using huggingface_hub

1ed9b86 verified 3 months ago

preview code

Raw

History Blame Contribute Delete

13.4 kB

Development Guide

This guide is for contributors working inside the LedgerShield repo. It covers setup, validation, CI expectations, and a detailed file map so it is easy to find the right place to make changes.

Local Setup

Prerequisites

Python 3.11 or 3.12
git
Docker if you want container smoke tests
an OpenAI-compatible endpoint only if you plan to run the LLM-powered comparison scripts

Install

git clone https://github.com/BiradarScripts/Meta-s-LedgerShield.git
cd Meta-s-LedgerShield

python -m venv .venv
source .venv/bin/activate

pip install -e .
pip install -r requirements.txt

Start the server

python -m server.app

Run the test suite

python -m pytest tests/ -q

Useful focused runs:

python -m pytest tests/test_ledgershield_env.py -q
python -m pytest tests/test_grading.py tests/test_task_c_guardrails.py tests/test_task_d_guardrails.py -q
python -m pytest tests/test_currency_engine.py tests/test_compliance_engine.py tests/test_curriculum.py -q

Validate packaging and submission workflow

bash validate-submission.sh
docker build -t ledgershield:dev .

If openenv is installed:

openenv validate

CI Expectations

The repo includes ../.github/workflows/ci.yml, which currently runs:

pytest on Python 3.11 and 3.12
Docker build + container smoke test
openenv.yaml metadata validation

Pytest configuration is centralized in ../pyproject.toml under [tool.pytest.ini_options]:

asyncio_mode = "strict" with asyncio_default_fixture_loop_scope = "function"
custom tests marker
deprecation-warning filters for websockets.legacy

If you change APIs, packaging, or runtime behavior, assume CI should keep passing without special local context.

Repo Map

Root files

Path	What it is for
`../README.md`	top-level benchmark overview and quick start
`../CHANGELOG.md`	human-readable project changes
`../Dockerfile`	container image definition for server deployment
`../pyproject.toml`	package metadata, dependencies, pytest config
`../requirements.txt`	pinned runtime dependencies
`../uv.lock`	lockfile for reproducible dependency installs
`../openenv.yaml`	OpenEnv metadata, novelty claims, published benchmark numbers
`../__init__.py`	package marker
`../client.py`	thin HTTP client wrapper for the environment
`../ledgershield_env.py`	compatibility re-export module for legacy imports
`../models.py`	shared dataclasses, Pydantic reward model, typed internal returns
`../openenv_compat.py`	adapter around `openenv-core` with local fallback server/client
`../inference.py`	submission-safe agent with `ModelCapabilityProfile` tiers, evidence grounding, and strict stdout contract
`../inference_improved.py`	experimental improved agent entrypoint
`../inference_llm_powered.py`	richer LLM-powered agent used for debugging and comparisons
`../llm_utils.py`	JSON parsing and completion helpers for LLM workflows
`../llm_judge_grader.py`	optional LLM-as-judge grading experiments
`../compare_models_live.py`	live multi-model comparison with capability profiles and monotonic strength checks
`../compare_all_models.py`	broader multi-model sweep helper with `--models`, `--output`, `--timeout`, and a `0.85`-aligned pass threshold
`../benchmark_report.py`	public benchmark, holdout, and contrastive report generation
`../generate_branch_comparison_report.py`	legacy reporting helper for saved branch comparison JSONs
`../generate_comparison_report.py`	legacy reporting helper for multi-model JSON summaries
`../generate_final_report.py`	legacy reporting helper for final comparison JSONs
`../generate_sota_report.py`	legacy reporting helper for SOTA comparison JSONs
`../task_c_guardrails.py`	Task C sanitization, composite signal detection, and constructive PAY evidence
`../task_d_guardrails.py`	Task D sanitization, composite signal detection, and constructive PAY evidence
`../test_scoring.py`	local baseline scoring simulation helper
`../validate_grader.py`	end-to-end grader and environment validation script
`../validate_agent_grading.py`	score-separation validation helper
`../validate-submission.sh`	pre-submission validator for Docker, server health, and stdout contract
`../live_model_comparison.json`	saved live comparison summary artifact

`server/`

Path	What it is for
`../server/__init__.py`	package marker
`../server/app.py`	FastAPI app builder and endpoint registration
`../server/environment.py`	main environment loop, reward shaping, truncation logic, rendering
`../server/world_state.py`	hidden/public state, artifacts, readiness, pressure resistance
`../server/tools.py`	investigation tool implementations, email-thread payload construction, domain alignment inference
`../server/transition_engine.py`	intervention handling and signal extraction
`../server/grading.py`	task-specific grading rubrics
`../server/trajectory_grading.py`	trajectory-aware scoring components
`../server/outcome_simulator.py`	downstream operational/fraud outcome simulation
`../server/risk_rules.py`	risk bucket logic and heuristic submission-risk assessment
`../server/pressure_events.py`	adversarial pressure-event templates and scoring
`../server/vendor_simulator.py`	callback vendor-response simulation
`../server/data_loader.py`	fixture loading, indexing, and generated-case injection
`../server/case_factory.py`	challenge/holdout/benign-twin generation
`../server/attack_library.py`	16 adversarial AP fraud attack templates
`../server/schema.py`	canonical field/action/reason-code constants and normalizers
`../server/currency_engine.py`	multi-currency realism utilities
`../server/compliance_engine.py`	SOX-style internal-control evaluation
`../server/curriculum.py`	dynamic difficulty adaptation
`../server/dual_agent_mode.py`	watchdog-mode dual-agent novelty module

`server/fixtures/`

Path	What it stores
`../server/fixtures/cases.json`	the 21 curated benchmark cases
`../server/fixtures/vendors.json`	vendor master data
`../server/fixtures/vendor_history.json`	historical vendor changes and fraud history
`../server/fixtures/po_records.json`	purchase-order records
`../server/fixtures/receipts.json`	goods-receipt records
`../server/fixtures/ledger_index.json`	ledger/payment history used for duplicate detection
`../server/fixtures/email_threads.json`	structured email-thread records
`../server/fixtures/policy_rules.json`	policy rules used by `lookup_policy`

`tests/`

Path	What it validates
`../tests/conftest.py`	shared fixtures and suite-wide pytest marker setup
`../tests/test_api_smoke.py`	API endpoint smoke coverage
`../tests/test_benchmark_report.py`	public/holdout/contrastive reporting behavior
`../tests/test_compare_all_models.py`	score parsing helpers in broad model sweeps
`../tests/test_compare_models_live.py`	live comparison stats, capability profiles, and rendering helpers
`../tests/test_compliance_engine.py`	SOX compliance evaluation
`../tests/test_currency_engine.py`	FX/IBAN/SWIFT/aging-report utilities
`../tests/test_curriculum.py`	curriculum tiering and case selection
`../tests/test_grading.py`	degenerate evidence cap and grading edge cases
`../tests/test_inference_contract.py`	required stdout contract for `inference.py`
`../tests/test_inference_llm_powered.py`	derived thread reasoning in LLM-powered inference
`../tests/test_inference_runtime.py`	model capability profiles and runtime heuristics
`../tests/test_ledgershield_env.py`	environment transitions, scoring, and holdout generation
`../tests/test_schema_reason_codes.py`	reason-code normalization and aliasing
`../tests/test_task_c_guardrails.py`	Task C submission guardrails and PAY evidence
`../tests/test_task_d_guardrails.py`	Task D submission guardrails and PAY evidence

`docs/`

Path	What it covers
`../docs/README.md`	docs landing page
`../docs/index.md`	benchmark overview
`../docs/tasks.md`	task contracts and scoring
`../docs/api-reference.md`	REST API reference
`../docs/architecture.md`	architecture deep dive
`../docs/development.md`	this file
`../docs/deployment.md`	deployment and runtime configuration

Common Workflows

Changing the environment

Touch at least these files:

server/environment.py
server/world_state.py
relevant tests in tests/test_ledgershield_env.py
docs in docs/api-reference.md or docs/architecture.md if the contract changed

Changing grading

Touch at least these files:

server/grading.py
server/trajectory_grading.py
any new utility modules such as server/compliance_engine.py
tests in tests/test_grading.py and task-specific regression tests

Adding benchmark realism

Typical landing spots:

server/currency_engine.py
server/compliance_engine.py
server/attack_library.py
server/case_factory.py
server/fixtures/cases.json

Updating inference behavior

Touch at least these files:

inference.py
inference_llm_powered.py if comparison/debug behavior must stay aligned
task_c_guardrails.py / task_d_guardrails.py if structured output rules changed
tests/test_inference_contract.py and relevant inference tests

Extension Guidance

Adding a new tool

Implement the tool in ../server/tools.py.
Add the action name to ../server/schema.py.
Add cost handling and dispatch in ../server/environment.py.
Add or update signal extraction in ../server/transition_engine.py if needed.
Add tests and update docs.

Adding a new case

Add it to ../server/fixtures/cases.json.
Ensure any needed vendor/PO/receipt/email/ledger fixtures exist.
Confirm case IDs are unique.
Update ./tasks.md if the public case catalog changed.
Add regression coverage.

Adding a new attack pattern

Extend ../server/attack_library.py.
Make sure the resulting reason codes and fraud flags are canonical.
Add tests that prove the attack is reachable and meaningful.

Practical Notes

The repo uses a mix of benchmark runtime code and historical helper scripts. Prefer editing the core runtime paths first.
Some top-level report helpers are legacy utilities for saved JSON artifacts rather than part of the main runtime.
Keep docs and tests in sync with any public contract changes.