Spaces:
Sleeping
Sleeping
File size: 13,429 Bytes
007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 007fbdd 1065f91 5f7588b 1065f91 5f7588b 1065f91 5f7588b 1ed9b86 1065f91 5f7588b 1065f91 5f7588b 1065f91 5f7588b 1065f91 5f7588b 1065f91 007fbdd 1065f91 007fbdd 1065f91 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | # Development Guide
This guide is for contributors working inside the LedgerShield repo. It covers setup, validation, CI expectations, and a detailed file map so it is easy to find the right place to make changes.
## Local Setup
### Prerequisites
- Python 3.11 or 3.12
- `git`
- Docker if you want container smoke tests
- an OpenAI-compatible endpoint only if you plan to run the LLM-powered comparison scripts
### Install
```bash
git clone https://github.com/BiradarScripts/Meta-s-LedgerShield.git
cd Meta-s-LedgerShield
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.txt
```
### Start the server
```bash
python -m server.app
```
### Run the test suite
```bash
python -m pytest tests/ -q
```
Useful focused runs:
```bash
python -m pytest tests/test_ledgershield_env.py -q
python -m pytest tests/test_grading.py tests/test_task_c_guardrails.py tests/test_task_d_guardrails.py -q
python -m pytest tests/test_currency_engine.py tests/test_compliance_engine.py tests/test_curriculum.py -q
```
### Validate packaging and submission workflow
```bash
bash validate-submission.sh
docker build -t ledgershield:dev .
```
If `openenv` is installed:
```bash
openenv validate
```
## CI Expectations
The repo includes [`../.github/workflows/ci.yml`](../.github/workflows/ci.yml), which currently runs:
- pytest on Python 3.11 and 3.12
- Docker build + container smoke test
- `openenv.yaml` metadata validation
Pytest configuration is centralized in [`../pyproject.toml`](../pyproject.toml) under `[tool.pytest.ini_options]`:
- `asyncio_mode = "strict"` with `asyncio_default_fixture_loop_scope = "function"`
- custom `tests` marker
- deprecation-warning filters for `websockets.legacy`
If you change APIs, packaging, or runtime behavior, assume CI should keep passing without special local context.
## Repo Map
### Root files
| Path | What it is for |
|---|---|
| [`../README.md`](../README.md) | top-level benchmark overview and quick start |
| [`../CHANGELOG.md`](../CHANGELOG.md) | human-readable project changes |
| [`../Dockerfile`](../Dockerfile) | container image definition for server deployment |
| [`../pyproject.toml`](../pyproject.toml) | package metadata, dependencies, pytest config |
| [`../requirements.txt`](../requirements.txt) | pinned runtime dependencies |
| [`../uv.lock`](../uv.lock) | lockfile for reproducible dependency installs |
| [`../openenv.yaml`](../openenv.yaml) | OpenEnv metadata, novelty claims, published benchmark numbers |
| [`../__init__.py`](../__init__.py) | package marker |
| [`../client.py`](../client.py) | thin HTTP client wrapper for the environment |
| [`../ledgershield_env.py`](../ledgershield_env.py) | compatibility re-export module for legacy imports |
| [`../models.py`](../models.py) | shared dataclasses, Pydantic reward model, typed internal returns |
| [`../openenv_compat.py`](../openenv_compat.py) | adapter around `openenv-core` with local fallback server/client |
| [`../inference.py`](../inference.py) | submission-safe agent with `ModelCapabilityProfile` tiers, evidence grounding, and strict stdout contract |
| [`../inference_improved.py`](../inference_improved.py) | experimental improved agent entrypoint |
| [`../inference_llm_powered.py`](../inference_llm_powered.py) | richer LLM-powered agent used for debugging and comparisons |
| [`../llm_utils.py`](../llm_utils.py) | JSON parsing and completion helpers for LLM workflows |
| [`../llm_judge_grader.py`](../llm_judge_grader.py) | optional LLM-as-judge grading experiments |
| [`../compare_models_live.py`](../compare_models_live.py) | live multi-model comparison with capability profiles and monotonic strength checks |
| [`../compare_all_models.py`](../compare_all_models.py) | broader multi-model sweep helper with `--models`, `--output`, `--timeout`, and a `0.85`-aligned pass threshold |
| [`../benchmark_report.py`](../benchmark_report.py) | public benchmark, holdout, and contrastive report generation |
| [`../generate_branch_comparison_report.py`](../generate_branch_comparison_report.py) | legacy reporting helper for saved branch comparison JSONs |
| [`../generate_comparison_report.py`](../generate_comparison_report.py) | legacy reporting helper for multi-model JSON summaries |
| [`../generate_final_report.py`](../generate_final_report.py) | legacy reporting helper for final comparison JSONs |
| [`../generate_sota_report.py`](../generate_sota_report.py) | legacy reporting helper for SOTA comparison JSONs |
| [`../task_c_guardrails.py`](../task_c_guardrails.py) | Task C sanitization, composite signal detection, and constructive PAY evidence |
| [`../task_d_guardrails.py`](../task_d_guardrails.py) | Task D sanitization, composite signal detection, and constructive PAY evidence |
| [`../test_scoring.py`](../test_scoring.py) | local baseline scoring simulation helper |
| [`../validate_grader.py`](../validate_grader.py) | end-to-end grader and environment validation script |
| [`../validate_agent_grading.py`](../validate_agent_grading.py) | score-separation validation helper |
| [`../validate-submission.sh`](../validate-submission.sh) | pre-submission validator for Docker, server health, and stdout contract |
| [`../live_model_comparison.json`](../live_model_comparison.json) | saved live comparison summary artifact |
### `server/`
| Path | What it is for |
|---|---|
| [`../server/__init__.py`](../server/__init__.py) | package marker |
| [`../server/app.py`](../server/app.py) | FastAPI app builder and endpoint registration |
| [`../server/environment.py`](../server/environment.py) | main environment loop, reward shaping, truncation logic, rendering |
| [`../server/world_state.py`](../server/world_state.py) | hidden/public state, artifacts, readiness, pressure resistance |
| [`../server/tools.py`](../server/tools.py) | investigation tool implementations, email-thread payload construction, domain alignment inference |
| [`../server/transition_engine.py`](../server/transition_engine.py) | intervention handling and signal extraction |
| [`../server/grading.py`](../server/grading.py) | task-specific grading rubrics |
| [`../server/trajectory_grading.py`](../server/trajectory_grading.py) | trajectory-aware scoring components |
| [`../server/outcome_simulator.py`](../server/outcome_simulator.py) | downstream operational/fraud outcome simulation |
| [`../server/risk_rules.py`](../server/risk_rules.py) | risk bucket logic and heuristic submission-risk assessment |
| [`../server/pressure_events.py`](../server/pressure_events.py) | adversarial pressure-event templates and scoring |
| [`../server/vendor_simulator.py`](../server/vendor_simulator.py) | callback vendor-response simulation |
| [`../server/data_loader.py`](../server/data_loader.py) | fixture loading, indexing, and generated-case injection |
| [`../server/case_factory.py`](../server/case_factory.py) | challenge/holdout/benign-twin generation |
| [`../server/attack_library.py`](../server/attack_library.py) | 16 adversarial AP fraud attack templates |
| [`../server/schema.py`](../server/schema.py) | canonical field/action/reason-code constants and normalizers |
| [`../server/currency_engine.py`](../server/currency_engine.py) | multi-currency realism utilities |
| [`../server/compliance_engine.py`](../server/compliance_engine.py) | SOX-style internal-control evaluation |
| [`../server/curriculum.py`](../server/curriculum.py) | dynamic difficulty adaptation |
| [`../server/dual_agent_mode.py`](../server/dual_agent_mode.py) | watchdog-mode dual-agent novelty module |
### `server/fixtures/`
| Path | What it stores |
|---|---|
| [`../server/fixtures/cases.json`](../server/fixtures/cases.json) | the 21 curated benchmark cases |
| [`../server/fixtures/vendors.json`](../server/fixtures/vendors.json) | vendor master data |
| [`../server/fixtures/vendor_history.json`](../server/fixtures/vendor_history.json) | historical vendor changes and fraud history |
| [`../server/fixtures/po_records.json`](../server/fixtures/po_records.json) | purchase-order records |
| [`../server/fixtures/receipts.json`](../server/fixtures/receipts.json) | goods-receipt records |
| [`../server/fixtures/ledger_index.json`](../server/fixtures/ledger_index.json) | ledger/payment history used for duplicate detection |
| [`../server/fixtures/email_threads.json`](../server/fixtures/email_threads.json) | structured email-thread records |
| [`../server/fixtures/policy_rules.json`](../server/fixtures/policy_rules.json) | policy rules used by `lookup_policy` |
### `tests/`
| Path | What it validates |
|---|---|
| [`../tests/conftest.py`](../tests/conftest.py) | shared fixtures and suite-wide pytest marker setup |
| [`../tests/test_api_smoke.py`](../tests/test_api_smoke.py) | API endpoint smoke coverage |
| [`../tests/test_benchmark_report.py`](../tests/test_benchmark_report.py) | public/holdout/contrastive reporting behavior |
| [`../tests/test_compare_all_models.py`](../tests/test_compare_all_models.py) | score parsing helpers in broad model sweeps |
| [`../tests/test_compare_models_live.py`](../tests/test_compare_models_live.py) | live comparison stats, capability profiles, and rendering helpers |
| [`../tests/test_compliance_engine.py`](../tests/test_compliance_engine.py) | SOX compliance evaluation |
| [`../tests/test_currency_engine.py`](../tests/test_currency_engine.py) | FX/IBAN/SWIFT/aging-report utilities |
| [`../tests/test_curriculum.py`](../tests/test_curriculum.py) | curriculum tiering and case selection |
| [`../tests/test_grading.py`](../tests/test_grading.py) | degenerate evidence cap and grading edge cases |
| [`../tests/test_inference_contract.py`](../tests/test_inference_contract.py) | required stdout contract for `inference.py` |
| [`../tests/test_inference_llm_powered.py`](../tests/test_inference_llm_powered.py) | derived thread reasoning in LLM-powered inference |
| [`../tests/test_inference_runtime.py`](../tests/test_inference_runtime.py) | model capability profiles and runtime heuristics |
| [`../tests/test_ledgershield_env.py`](../tests/test_ledgershield_env.py) | environment transitions, scoring, and holdout generation |
| [`../tests/test_schema_reason_codes.py`](../tests/test_schema_reason_codes.py) | reason-code normalization and aliasing |
| [`../tests/test_task_c_guardrails.py`](../tests/test_task_c_guardrails.py) | Task C submission guardrails and PAY evidence |
| [`../tests/test_task_d_guardrails.py`](../tests/test_task_d_guardrails.py) | Task D submission guardrails and PAY evidence |
### `docs/`
| Path | What it covers |
|---|---|
| [`../docs/README.md`](../docs/README.md) | docs landing page |
| [`../docs/index.md`](../docs/index.md) | benchmark overview |
| [`../docs/tasks.md`](../docs/tasks.md) | task contracts and scoring |
| [`../docs/api-reference.md`](../docs/api-reference.md) | REST API reference |
| [`../docs/architecture.md`](../docs/architecture.md) | architecture deep dive |
| [`../docs/development.md`](../docs/development.md) | this file |
| [`../docs/deployment.md`](../docs/deployment.md) | deployment and runtime configuration |
## Common Workflows
### Changing the environment
Touch at least these files:
- `server/environment.py`
- `server/world_state.py`
- relevant tests in `tests/test_ledgershield_env.py`
- docs in `docs/api-reference.md` or `docs/architecture.md` if the contract changed
### Changing grading
Touch at least these files:
- `server/grading.py`
- `server/trajectory_grading.py`
- any new utility modules such as `server/compliance_engine.py`
- tests in `tests/test_grading.py` and task-specific regression tests
### Adding benchmark realism
Typical landing spots:
- `server/currency_engine.py`
- `server/compliance_engine.py`
- `server/attack_library.py`
- `server/case_factory.py`
- `server/fixtures/cases.json`
### Updating inference behavior
Touch at least these files:
- `inference.py`
- `inference_llm_powered.py` if comparison/debug behavior must stay aligned
- `task_c_guardrails.py` / `task_d_guardrails.py` if structured output rules changed
- `tests/test_inference_contract.py` and relevant inference tests
## Extension Guidance
### Adding a new tool
1. Implement the tool in [`../server/tools.py`](../server/tools.py).
2. Add the action name to [`../server/schema.py`](../server/schema.py).
3. Add cost handling and dispatch in [`../server/environment.py`](../server/environment.py).
4. Add or update signal extraction in [`../server/transition_engine.py`](../server/transition_engine.py) if needed.
5. Add tests and update docs.
### Adding a new case
1. Add it to [`../server/fixtures/cases.json`](../server/fixtures/cases.json).
2. Ensure any needed vendor/PO/receipt/email/ledger fixtures exist.
3. Confirm case IDs are unique.
4. Update [`./tasks.md`](./tasks.md) if the public case catalog changed.
5. Add regression coverage.
### Adding a new attack pattern
1. Extend [`../server/attack_library.py`](../server/attack_library.py).
2. Make sure the resulting reason codes and fraud flags are canonical.
3. Add tests that prove the attack is reachable and meaningful.
## Practical Notes
- The repo uses a mix of benchmark runtime code and historical helper scripts. Prefer editing the core runtime paths first.
- Some top-level report helpers are legacy utilities for saved JSON artifacts rather than part of the main runtime.
- Keep docs and tests in sync with any public contract changes.
|