prithic07 commited on
Commit
0dc27c9
·
1 Parent(s): 291987c

Docs: Fully restore technical specifications and tables

Browse files
Files changed (1) hide show
  1. README.md +76 -33
README.md CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  # ContextPrune: Adaptive Context Garbage Collection for RAG
2
 
3
  ContextPrune is a benchmark environment designed to solve the **"Attention Dilution"** problem in Large Language Model (LLM) workflows. It treats context management as a form of **Garbage Collection**, where the system identifies, filters, and compresses information to maintain high signal-to-noise ratios in RAG pipelines.
@@ -24,7 +33,7 @@ graph TD
24
 
25
  ## 2. Methodology: The Operational Loop
26
 
27
- ContextPrune enforces a 5-staged workflow that mirrors enterprise incident response. Each stage is designed to penalize laziness and reward systematic evidence handling.
28
 
29
  | Stage | Action | Rationale |
30
  | :--- | :--- | :--- |
@@ -36,63 +45,97 @@ ContextPrune enforces a 5-staged workflow that mirrors enterprise incident respo
36
 
37
  ---
38
 
39
- ## 3. Reward Engineering (The Benchmarking Grader)
40
 
41
- The environment calculates a weighted score (0.0 - 1.0) based on 8 distinct metrics. This ensures that a high score represents not just a "correct" answer, but an **optimal trajectory**.
42
 
43
- - **Required Coverage (24%)**: Inclusion of critical "Gold" artifacts identified in `tasks.py`.
44
- - **Cross-Domain Variety (12%)**: Rewards agents that correlate evidence across Support, Incident logs, and Release guardrails.
45
- - **Triage Thoroughness (12%)**: Penalizes agents that skip the inspection phase and blindly prioritize.
46
- - **Planning Logic (16%)**: Measures alignment between the drafted plan and the ground truth operational steps.
47
- - **Reporting Accuracy (18%)**: Presence of mission-critical operational keywords.
48
- - **Citation Fidelity (10%)**: Verification that claimed evidence is actually present in the working set.
49
- - **Token Efficiency (8%)**: Scaled bonus for solving the task with the smallest possible context.
50
- - **Hallucination Penalty (-18%)**: Severe deduction for claims made in the final report that lack any evidence in the prioritized chunks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ---
53
 
54
- ## 4. Scenario Benchmarks
55
 
56
- ContextPrune includes three canonical tasks that simulate high-pressure operational incidents:
57
 
58
- ### **[Hard] Executive Escalation: Suspected Admin Compromise**
59
- - **Objective**: Balance immediate customer protection, evidence preservation, and release safeguards.
60
- - **Budget**: 360 Tokens (Extremely Tight).
61
- - **Core Challenge**: Correlating suspicious incident logs with release-engineering change freezes across disjointed domains.
 
 
 
 
 
 
62
 
63
- ### **[Medium] Cross-Functional Outage Brief**
64
- - **Objective**: Align Support, Incident Command, and Release Engineering during a payment processing failure.
65
- - **Budget**: 620 Tokens.
66
- - **Core Challenge**: Filtering through overlapping narratives to find the "single source of truth" for customer comms.
67
 
68
- ### **[Easy] Refund Triage Memo**
69
- - **Objective**: Determine refund eligibility from support policies and outage impact artifacts.
70
- - **Budget**: 850 Tokens.
71
- - **Core Challenge**: Systematic inspection of policy artifacts to ensure relief is justified before escalation.
 
72
 
73
  ---
74
 
75
- ## 5. Technical Components
76
 
77
- - **`rag_optimizer_env/`**: Core state management, hybrid retrieval (Keyword + Semantic), and token estimation using `llm_runtime`.
78
- - **`app.py`**: A standard FastAPI implementation. Built for Context Optimization Research.
79
- - **`inference.py`**: A baseline agent script demonstrating how to use the OpenAI-compatible interface.
80
- - **`validate.py`**: A robust validation suite that runs a full episode lifecycle locally to ensure 100% environment compliance.
 
 
 
 
 
 
 
 
 
81
 
82
  ---
83
 
84
  ## 🚀 Quick Start
85
 
86
  1. **Setup**: `pip install -r requirements.txt`
87
- 2. **Server**: `python app.py` (Runs on Port 8000)
88
  3. **Control Panel**: `streamlit run optimizer_ui.py`
89
  4. **Validation**: `python validate.py`
90
 
91
-
92
  ---
93
 
94
  ## 🌎 Live Deployment
95
 
96
  - **Space URL**: [huggingface.co/spaces/prithic07/context-prune](https://huggingface.co/spaces/prithic07/context-prune)
97
  - **Direct App Link**: [prithic07-context-prune.hf.space](https://prithic07-context-prune.hf.space/)
98
- - **Space Repo**: `prithic07/context-prune`
 
 
 
1
+ ---
2
+ title: ContextPrune
3
+ emoji: 🧹
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ pinned: false
8
+ ---
9
+
10
  # ContextPrune: Adaptive Context Garbage Collection for RAG
11
 
12
  ContextPrune is a benchmark environment designed to solve the **"Attention Dilution"** problem in Large Language Model (LLM) workflows. It treats context management as a form of **Garbage Collection**, where the system identifies, filters, and compresses information to maintain high signal-to-noise ratios in RAG pipelines.
 
33
 
34
  ## 2. Methodology: The Operational Loop
35
 
36
+ ContextPrune enforces a 5-staged workflow that mirrors enterprise incident response.
37
 
38
  | Stage | Action | Rationale |
39
  | :--- | :--- | :--- |
 
45
 
46
  ---
47
 
48
+ ## 3. Observation Space
49
 
50
+ The `RagObservation` provides the agent with the internal state of the incident and the current working set budget.
51
 
52
+ | Field | Type | Description |
53
+ | :--- | :--- | :--- |
54
+ | `case_id` | `str` | Unique simulated case identifier |
55
+ | `case_summary` | `str` | Real-world case context and background |
56
+ | `objective` | `str` | Specific deliverable the agent must produce |
57
+ | `workflow_stage` | `triage \| analysis \| resolution \| submitted` | Current stage in the operational loop |
58
+ | `customer_tier` | `standard \| business \| enterprise` | Customer criticality and SLA priority |
59
+ | `incident_severity` | `sev3 \| sev2 \| sev1` | Impact magnitude of the incident |
60
+ | `available_artifacts` | `List[ChunkSummary]` | Metadata for artifacts available for inspection |
61
+ | `reviewed_artifacts` | `List[str]` | IDs of artifacts already triaged |
62
+ | `prioritized_artifacts` | `List[str]` | IDs of artifacts currently in the working set |
63
+ | `plan_draft` | `Optional[str]` | Current state of the resolution plan |
64
+ | `total_tokens_used` | `int` | Current token cost of the working set |
65
+ | `token_budget` | `int` | Maximum allowed token budget |
66
+
67
+ ---
68
+
69
+ ## 4. Action Space
70
+
71
+ Agents interact with the environment through the following canonical actions:
72
+
73
+ | Action Type | Parameters | Effect |
74
+ | :--- | :--- | :--- |
75
+ | `inspect_artifact` | `artifact_id` | Review artifact keywords without committing to the working set |
76
+ | `prioritize_artifact` | `artifact_id` | Add a reviewed artifact to the working set (consumes tokens) |
77
+ | `summarize_artifact` | `artifact_id`, `ratio` | Compress a prioritized artifact using AI summarization |
78
+ | `set_resolution_plan` | `plan` | Update the draft plan before final submission |
79
+ | `submit_report` | `answer` | Generate final response and terminate the episode |
80
 
81
  ---
82
 
83
+ ## 5. Reward Engineering (The Benchmarking Grader)
84
 
85
+ The environment calculates a weighted score (0.0 - 1.0) based on 8 distinct metrics.
86
 
87
+ - **Required Coverage (24%)**: Inclusion of critical "Gold" artifacts.
88
+ - **Cross-Domain Variety (12%)**: Rewards correlation across Support, Incident logs, and Release guardrails.
89
+ - **Triage Thoroughness (12%)**: Penalizes skipping the inspection phase.
90
+ - **Planning Logic (16%)**: Alignment between the drafted plan and ground truth steps.
91
+ - **Reporting Accuracy (18%)**: Presence of mission-critical operational keywords.
92
+ - **Citation Fidelity (10%)**: Verification that claimed evidence is in the working set.
93
+ - **Token Efficiency (8%)**: Scaled bonus for minimal context usage.
94
+ - **Hallucination Penalty (-18%)**: Severe deduction for unsupported claims.
95
+
96
+ ---
97
 
98
+ ## 6. Scenario Benchmarks
 
 
 
99
 
100
+ | Task | Difficulty | Steps | Budget | Key Challenge |
101
+ | :--- | :--- | :--- | :--- | :--- |
102
+ | `refund_triage_easy` | Easy | 7 | 850 | Systematically checking policy artifacts before relief. |
103
+ | `cross_function_brief_medium` | Medium | 8 | 620 | Filtering overlapping narratives for a singular source of truth. |
104
+ | `executive_escalation_hard` | Hard | 10 | 360 | Correlating suspicious logs with release freezes on a tight budget. |
105
 
106
  ---
107
 
108
+ ## 7. Configuration & Environment
109
 
110
+ ### Environment Variables
111
+ | Variable | Default | Purpose |
112
+ | :--- | :--- | :--- |
113
+ | `API_BASE_URL` | `https://router.huggingface.co/v1` | OpenAI-compatible inference endpoint |
114
+ | `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model used for baseline tasks |
115
+ | `HF_TOKEN` | *None* | Authentication for Hugging Face Inference API |
116
+ | `RAG_ENV_URL` | `http://localhost:7860` | Base URL for the ContextPrune server |
117
+
118
+ ### Project Components
119
+ - **`rag_optimizer_env/`**: State machine, hybrid retrieval, and token estimation.
120
+ - **`app.py`**: FastAPI implementation for remote agent interaction.
121
+ - **`inference.py`**: Baseline agent script (OpenAI-compatible).
122
+ - **`validate.py`**: Robust validation suite for episode lifecycle verification.
123
 
124
  ---
125
 
126
  ## 🚀 Quick Start
127
 
128
  1. **Setup**: `pip install -r requirements.txt`
129
+ 2. **Server**: `python app.py` (Runs on Port 7860)
130
  3. **Control Panel**: `streamlit run optimizer_ui.py`
131
  4. **Validation**: `python validate.py`
132
 
 
133
  ---
134
 
135
  ## 🌎 Live Deployment
136
 
137
  - **Space URL**: [huggingface.co/spaces/prithic07/context-prune](https://huggingface.co/spaces/prithic07/context-prune)
138
  - **Direct App Link**: [prithic07-context-prune.hf.space](https://prithic07-context-prune.hf.space/)
139
+ - **Space Repo ID**: `prithic07/context-prune`
140
+
141
+ Built for Context Optimization Research.