Spaces:
Running
Running
Commit Β·
8bc6c43
1
Parent(s): f95e25b
fix: version bump to 0.4.0, add citation_verification task
Browse files- README.md +47 -75
- __init__.py +1 -1
- pyproject.toml +1 -1
- server/app.py +2 -2
README.md
CHANGED
|
@@ -1,13 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: ScholarEnv
|
| 3 |
-
emoji: π¬
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: docker
|
| 7 |
-
pinned: false
|
| 8 |
-
license: apache-2.0
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
<div align="center">
|
| 12 |
|
| 13 |
# π¬ ScholarEnv
|
|
@@ -19,10 +9,11 @@ license: apache-2.0
|
|
| 19 |
[](LICENSE)
|
| 20 |
[](#four-tasks)
|
| 21 |
[](#testing)
|
|
|
|
| 22 |
|
| 23 |
**An AI agent that investigates papers β not one that produces them.**
|
| 24 |
|
| 25 |
-
[API Reference](#api-reference) Β· [Quick Start](#quick-start) Β· [Research](#research-foundation)
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
@@ -40,8 +31,8 @@ license: apache-2.0
|
|
| 40 |
|
| 41 |
The key insight: **LLMs are already good at formatting. They fail at auditing.**
|
| 42 |
|
| 43 |
-
Ask GPT-4o to format a manuscript β scores ~0.92 with no training.
|
| 44 |
-
Ask GPT-4o to find numerical claim mismatches in a paper β scores **0.20β0.45**.
|
| 45 |
|
| 46 |
That gap is exactly where RL adds value. The agent must discover a document traversal strategy β which sections to read first, which tables to cross-reference β that **varies by paper structure and cannot be reduced to a fixed prompt**. RL finds this strategy. Prompting cannot.
|
| 47 |
|
|
@@ -56,9 +47,9 @@ Formatting β Consistency β Claim Audit β Citation Check
|
|
| 56 |
|
| 57 |
| Task | What the agent does | Frontier baseline | RL target |
|
| 58 |
|------|-------------------|-------------------|-----------|
|
| 59 |
-
| `formatting_compliance` | Fix IEEE formatting violations | 0.80β0.95 | 0.95+ |
|
| 60 |
-
| `internal_consistency` | Find where paper contradicts itself | 0.40β0.65 | 0.65β0.80 |
|
| 61 |
-
| `claim_evidence_audit` | Find where text claims
|
| 62 |
| `citation_verification` | Identify ghost and misattributed references | 0.35β0.60 | 0.65β0.80 |
|
| 63 |
|
| 64 |
Task 3's low baseline is the core RL contribution β it proves genuine training headroom exists.
|
|
@@ -68,7 +59,6 @@ Task 3's low baseline is the core RL contribution β it proves genuine training
|
|
| 68 |
## Reward Design
|
| 69 |
|
| 70 |
### Task 1 β Progressive Reward Shaping (PRS)
|
| 71 |
-
|
| 72 |
Three stages unlock sequentially. Stage N only contributes when Stage N-1 β₯ threshold. Prevents GRPO gradient collapse.
|
| 73 |
|
| 74 |
```
|
|
@@ -77,41 +67,27 @@ Stage 2 β weight 0.35 β threshold 0.60 β Section order, word limits, capti
|
|
| 77 |
Stage 3 β weight 0.25 β threshold 0.70 β IEEE citations, author block, keywords
|
| 78 |
```
|
| 79 |
|
| 80 |
-
> Based on: [arXiv 2512.07478](https://arxiv.org/abs/2512.07478) β PRS for Agentic RL
|
| 81 |
-
|
| 82 |
### Tasks 2 & 3 β F-beta + Potential-Based Reward Shaping
|
| 83 |
-
|
| 84 |
**F-beta (Ξ²=0.5)** weights precision 4Γ over recall β prevents hallucination gaming:
|
| 85 |
-
|
| 86 |
```
|
| 87 |
-
F_Ξ²(
|
| 88 |
-
F_Ξ²(
|
| 89 |
```
|
| 90 |
|
| 91 |
-
**PBRS** (Ng et al., ICML 1999) gives dense intermediate rewards
|
| 92 |
-
|
| 93 |
```
|
| 94 |
Ξ¦(s) = 0.30 Γ sections_read/total + 0.30 Γ tables_checked/total + 0.40 Γ claims_extracted/est
|
| 95 |
-
F(s,s') = Ξ³Β·Ξ¦(s') β Ξ¦(s) β policy-invariant,
|
| 96 |
```
|
| 97 |
|
| 98 |
### Curriculum β AdaRFT + UCB1
|
| 99 |
-
|
| 100 |
-
Keeps agent in productive zone (avg score 0.40β0.70). UCB1 maximises **learning gradient** (reward variance), not mean reward.
|
| 101 |
-
|
| 102 |
-
```
|
| 103 |
-
avg > 0.70 β select harder papers
|
| 104 |
-
avg < 0.40 β select easier papers
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
> Based on: [arXiv 2504.05520](https://arxiv.org/abs/2504.05520) β AdaRFT Adaptive Data Selection
|
| 108 |
|
| 109 |
---
|
| 110 |
|
| 111 |
## Quick Start
|
| 112 |
|
| 113 |
### Install
|
| 114 |
-
|
| 115 |
```bash
|
| 116 |
git clone https://github.com/Nensi1311/research-paper-formatter-agent
|
| 117 |
cd research-paper-formatter-agent
|
|
@@ -119,27 +95,25 @@ pip install -r requirements.txt
|
|
| 119 |
```
|
| 120 |
|
| 121 |
### Generate corpus
|
| 122 |
-
|
| 123 |
```bash
|
| 124 |
python scripts/generate_corpus.py
|
| 125 |
```
|
| 126 |
|
| 127 |
### Run tests
|
| 128 |
-
|
| 129 |
```bash
|
| 130 |
python tests/test_all.py
|
| 131 |
# β ALL TESTS PASSED (63/63)
|
| 132 |
```
|
| 133 |
|
| 134 |
### Start server
|
| 135 |
-
|
| 136 |
```bash
|
| 137 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 138 |
```
|
| 139 |
|
| 140 |
-
### Test
|
| 141 |
-
|
| 142 |
```bash
|
|
|
|
|
|
|
| 143 |
for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
|
| 144 |
curl -s -X POST localhost:7860/reset \
|
| 145 |
-H "Content-Type: application/json" \
|
|
@@ -148,9 +122,10 @@ for task in formatting_compliance internal_consistency claim_evidence_audit cita
|
|
| 148 |
done
|
| 149 |
```
|
| 150 |
|
| 151 |
-
### Test
|
| 152 |
-
|
| 153 |
```powershell
|
|
|
|
|
|
|
| 154 |
foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
|
| 155 |
$body = '{"task_id":"' + $task + '"}'
|
| 156 |
$r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
|
|
@@ -159,7 +134,6 @@ foreach ($task in @("formatting_compliance","internal_consistency","claim_eviden
|
|
| 159 |
```
|
| 160 |
|
| 161 |
### Docker
|
| 162 |
-
|
| 163 |
```bash
|
| 164 |
docker build -t scholar-env .
|
| 165 |
docker run -p 7860:7860 scholar-env
|
|
@@ -167,12 +141,11 @@ curl http://localhost:7860/health
|
|
| 167 |
```
|
| 168 |
|
| 169 |
### Run baseline agent
|
| 170 |
-
|
| 171 |
```bash
|
| 172 |
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 173 |
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
| 174 |
export HF_TOKEN="hf_your_token"
|
| 175 |
-
export HF_SPACE_URL="https://
|
| 176 |
|
| 177 |
python inference.py
|
| 178 |
# Writes: baseline_scores.json
|
|
@@ -183,16 +156,14 @@ python inference.py
|
|
| 183 |
## API Reference
|
| 184 |
|
| 185 |
### `POST /reset`
|
| 186 |
-
|
| 187 |
```json
|
| 188 |
{"task_id": "formatting_compliance"}
|
| 189 |
```
|
| 190 |
-
|
| 191 |
-
Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_steps`, `hint`.
|
| 192 |
|
| 193 |
### `POST /step`
|
| 194 |
|
| 195 |
-
**Task 1
|
| 196 |
```json
|
| 197 |
{"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
|
| 198 |
```
|
|
@@ -204,7 +175,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 204 |
{"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
|
| 205 |
```
|
| 206 |
|
| 207 |
-
**Tasks 2/3 β submit
|
| 208 |
```json
|
| 209 |
{
|
| 210 |
"task": "claim_evidence_audit",
|
|
@@ -222,12 +193,12 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 222 |
}
|
| 223 |
```
|
| 224 |
|
| 225 |
-
**Task 4 β
|
| 226 |
```json
|
| 227 |
{"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
|
| 228 |
```
|
| 229 |
|
| 230 |
-
**Task 4 β submit
|
| 231 |
```json
|
| 232 |
{
|
| 233 |
"task": "citation_verification",
|
|
@@ -238,14 +209,9 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 238 |
}
|
| 239 |
```
|
| 240 |
|
| 241 |
-
**
|
| 242 |
```json
|
| 243 |
-
{
|
| 244 |
-
"observation": {...},
|
| 245 |
-
"reward": 0.7341,
|
| 246 |
-
"done": false,
|
| 247 |
-
"info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}
|
| 248 |
-
}
|
| 249 |
```
|
| 250 |
|
| 251 |
### Other endpoints
|
|
@@ -253,7 +219,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 253 |
| Endpoint | Method | Description |
|
| 254 |
|---|---|---|
|
| 255 |
| `/health` | GET | `{"status":"ok","version":"0.4.0"}` |
|
| 256 |
-
| `/state` | GET | Episode state, curriculum summary |
|
| 257 |
| `/tasks` | GET | All 4 task descriptions |
|
| 258 |
| `/action_space` | GET | Full action schema |
|
| 259 |
|
|
@@ -263,21 +229,23 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 263 |
|
| 264 |
```
|
| 265 |
βββ inference.py β Baseline agent (root β required by spec)
|
| 266 |
-
βββ models.py β FormattingAction, ScholarAction, CitationAction
|
|
|
|
| 267 |
βββ corpus.py β PaperCorpus loader
|
| 268 |
βββ openenv.yaml β 4 tasks, endpoints, authors, baseline_script
|
| 269 |
βββ Dockerfile
|
| 270 |
βββ requirements.txt
|
|
|
|
| 271 |
β
|
| 272 |
βββ data/
|
| 273 |
β βββ papers/
|
| 274 |
-
β β βββ paper_001.json β NLP benchmark (easy)
|
| 275 |
-
β β βββ paper_002.json β CV survey (medium)
|
| 276 |
-
β β βββ paper_003.json β MTL paper (hard)
|
| 277 |
β βββ styles/ieee.yaml
|
| 278 |
β
|
| 279 |
βββ server/
|
| 280 |
-
β βββ app.py β FastAPI
|
| 281 |
β βββ environment.py β 4-task state machine
|
| 282 |
β βββ reward_shaper.py β PBRS (Ng et al. 1999)
|
| 283 |
β βββ curriculum.py β AdaRFT + UCB1
|
|
@@ -285,8 +253,8 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 285 |
β βββ citation_verifier.py β Citation parser + SQLite cache
|
| 286 |
β βββ graders/
|
| 287 |
β βββ formatting_grader.py β PRS 3-stage (Task 1)
|
| 288 |
-
β βββ consistency_grader.pyβ F-beta (Task 2)
|
| 289 |
-
β βββ audit_grader.py β F-beta + PBRS (Task 3)
|
| 290 |
β
|
| 291 |
βββ scripts/generate_corpus.py
|
| 292 |
βββ tests/test_all.py β 63 assertions
|
|
@@ -298,7 +266,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
|
|
| 298 |
|
| 299 |
```
|
| 300 |
[Corpus] 8/8 β
|
| 301 |
-
[FormattingGrader] 8/8 β PRS stage locking
|
| 302 |
[ConsistencyGrader] 9/9 β F-beta, hallucination penalty
|
| 303 |
[AuditGrader] 6/6 β Evidence specificity, coverage bonus
|
| 304 |
[PBRS] 6/6 β Potential monotonicity, bonus bounds
|
|
@@ -318,19 +286,22 @@ Results: 63/63 passed β ALL TESTS PASSED
|
|
| 318 |
| [PRS Β· arXiv 2512.07478](https://arxiv.org/abs/2512.07478) | Task 1 progressive staging prevents GRPO gradient collapse |
|
| 319 |
| [PBRS Β· Ng, Harada & Russell, ICML 1999](http://www.cs.utexas.edu/~ai-lab/pubs/ICML99-shaping.pdf) | Policy-invariant dense intermediate rewards |
|
| 320 |
| [AdaRFT Β· arXiv 2504.05520](https://arxiv.org/abs/2504.05520) | Curriculum targeting [0.40, 0.70] productive zone |
|
| 321 |
-
| [RLVE Β· arXiv 2511.07317](https://arxiv.org/abs/2511.07317) | Adaptive difficulty
|
| 322 |
| [Veri-R1 Β· arXiv 2510.01932](https://arxiv.org/abs/2510.01932) | Online RL for claim verification is current SOTA |
|
| 323 |
-
| [LaMer Β· arXiv 2512.16848](https://arxiv.org/abs/2512.16848) | Structured feedback
|
| 324 |
| [StatCheck Β· Epskamp 2016](https://link.springer.com/article/10.3758/s13428-015-0664-2) | ~50% of papers have errors β scale motivation |
|
| 325 |
| [GROBID Β· Lopez 2008β2025](https://github.com/kermitt2/grobid) | Prior art; CitationVerifier is our RL-native alternative |
|
| 326 |
|
| 327 |
---
|
| 328 |
|
| 329 |
-
##
|
| 330 |
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
---
|
| 336 |
|
|
@@ -344,6 +315,7 @@ Results: 63/63 passed β ALL TESTS PASSED
|
|
| 344 |
|
| 345 |
*The future of AI isn't just models that generate β it's models that verify.*
|
| 346 |
|
|
|
|
| 347 |
[](https://github.com/Nensi1311/research-paper-formatter-agent)
|
| 348 |
|
| 349 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<div align="center">
|
| 2 |
|
| 3 |
# π¬ ScholarEnv
|
|
|
|
| 9 |
[](LICENSE)
|
| 10 |
[](#four-tasks)
|
| 11 |
[](#testing)
|
| 12 |
+
[](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent)
|
| 13 |
|
| 14 |
**An AI agent that investigates papers β not one that produces them.**
|
| 15 |
|
| 16 |
+
[Live Demo](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent) Β· [API Reference](#api-reference) Β· [Quick Start](#quick-start) Β· [Research](#research-foundation)
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
|
|
| 31 |
|
| 32 |
The key insight: **LLMs are already good at formatting. They fail at auditing.**
|
| 33 |
|
| 34 |
+
Ask GPT-4o to format a manuscript β scores ~0.92 with no training.
|
| 35 |
+
Ask GPT-4o to find all numerical claim mismatches in a paper β scores **0.20β0.45**.
|
| 36 |
|
| 37 |
That gap is exactly where RL adds value. The agent must discover a document traversal strategy β which sections to read first, which tables to cross-reference β that **varies by paper structure and cannot be reduced to a fixed prompt**. RL finds this strategy. Prompting cannot.
|
| 38 |
|
|
|
|
| 47 |
|
| 48 |
| Task | What the agent does | Frontier baseline | RL target |
|
| 49 |
|------|-------------------|-------------------|-----------|
|
| 50 |
+
| `formatting_compliance` | Fix IEEE formatting violations in a manuscript | 0.80β0.95 | 0.95+ |
|
| 51 |
+
| `internal_consistency` | Find where the paper contradicts itself | 0.40β0.65 | 0.65β0.80 |
|
| 52 |
+
| `claim_evidence_audit` | Find where text claims don't match table values | **0.20β0.45** | **0.55β0.75** |
|
| 53 |
| `citation_verification` | Identify ghost and misattributed references | 0.35β0.60 | 0.65β0.80 |
|
| 54 |
|
| 55 |
Task 3's low baseline is the core RL contribution β it proves genuine training headroom exists.
|
|
|
|
| 59 |
## Reward Design
|
| 60 |
|
| 61 |
### Task 1 β Progressive Reward Shaping (PRS)
|
|
|
|
| 62 |
Three stages unlock sequentially. Stage N only contributes when Stage N-1 β₯ threshold. Prevents GRPO gradient collapse.
|
| 63 |
|
| 64 |
```
|
|
|
|
| 67 |
Stage 3 β weight 0.25 β threshold 0.70 β IEEE citations, author block, keywords
|
| 68 |
```
|
| 69 |
|
|
|
|
|
|
|
| 70 |
### Tasks 2 & 3 β F-beta + Potential-Based Reward Shaping
|
|
|
|
| 71 |
**F-beta (Ξ²=0.5)** weights precision 4Γ over recall β prevents hallucination gaming:
|
|
|
|
| 72 |
```
|
| 73 |
+
F_Ξ²(P=1.0, R=0.5) = 0.833 β correct and precise β
|
| 74 |
+
F_Ξ²(P=0.2, R=1.0) = 0.227 β spamming guesses β
|
| 75 |
```
|
| 76 |
|
| 77 |
+
**PBRS** (Ng et al., ICML 1999) gives dense intermediate rewards per navigation step:
|
|
|
|
| 78 |
```
|
| 79 |
Ξ¦(s) = 0.30 Γ sections_read/total + 0.30 Γ tables_checked/total + 0.40 Γ claims_extracted/est
|
| 80 |
+
F(s,s') = Ξ³Β·Ξ¦(s') β Ξ¦(s) β policy-invariant, guaranteed by theory
|
| 81 |
```
|
| 82 |
|
| 83 |
### Curriculum β AdaRFT + UCB1
|
| 84 |
+
Keeps the agent in the productive zone (avg score 0.40β0.70). UCB1 maximises **learning gradient** (reward variance), not mean reward β a paper always scoring 0.95 teaches nothing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
---
|
| 87 |
|
| 88 |
## Quick Start
|
| 89 |
|
| 90 |
### Install
|
|
|
|
| 91 |
```bash
|
| 92 |
git clone https://github.com/Nensi1311/research-paper-formatter-agent
|
| 93 |
cd research-paper-formatter-agent
|
|
|
|
| 95 |
```
|
| 96 |
|
| 97 |
### Generate corpus
|
|
|
|
| 98 |
```bash
|
| 99 |
python scripts/generate_corpus.py
|
| 100 |
```
|
| 101 |
|
| 102 |
### Run tests
|
|
|
|
| 103 |
```bash
|
| 104 |
python tests/test_all.py
|
| 105 |
# β ALL TESTS PASSED (63/63)
|
| 106 |
```
|
| 107 |
|
| 108 |
### Start server
|
|
|
|
| 109 |
```bash
|
| 110 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 111 |
```
|
| 112 |
|
| 113 |
+
### Test endpoints β Linux/macOS
|
|
|
|
| 114 |
```bash
|
| 115 |
+
curl http://localhost:7860/health
|
| 116 |
+
|
| 117 |
for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
|
| 118 |
curl -s -X POST localhost:7860/reset \
|
| 119 |
-H "Content-Type: application/json" \
|
|
|
|
| 122 |
done
|
| 123 |
```
|
| 124 |
|
| 125 |
+
### Test endpoints β Windows PowerShell
|
|
|
|
| 126 |
```powershell
|
| 127 |
+
Invoke-RestMethod -Uri "http://localhost:7860/health"
|
| 128 |
+
|
| 129 |
foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
|
| 130 |
$body = '{"task_id":"' + $task + '"}'
|
| 131 |
$r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
|
|
|
|
| 134 |
```
|
| 135 |
|
| 136 |
### Docker
|
|
|
|
| 137 |
```bash
|
| 138 |
docker build -t scholar-env .
|
| 139 |
docker run -p 7860:7860 scholar-env
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
### Run baseline agent
|
|
|
|
| 144 |
```bash
|
| 145 |
export API_BASE_URL="https://api-inference.huggingface.co/v1"
|
| 146 |
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
|
| 147 |
export HF_TOKEN="hf_your_token"
|
| 148 |
+
export HF_SPACE_URL="https://nensi1311-research-paper-formatter-agent.hf.space"
|
| 149 |
|
| 150 |
python inference.py
|
| 151 |
# Writes: baseline_scores.json
|
|
|
|
| 156 |
## API Reference
|
| 157 |
|
| 158 |
### `POST /reset`
|
|
|
|
| 159 |
```json
|
| 160 |
{"task_id": "formatting_compliance"}
|
| 161 |
```
|
| 162 |
+
Returns `observation` with `manuscript_text`, `style_guide`, `step_count`, `max_steps`, `hint`.
|
|
|
|
| 163 |
|
| 164 |
### `POST /step`
|
| 165 |
|
| 166 |
+
**Task 1:**
|
| 167 |
```json
|
| 168 |
{"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
|
| 169 |
```
|
|
|
|
| 175 |
{"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
|
| 176 |
```
|
| 177 |
|
| 178 |
+
**Tasks 2/3 β submit:**
|
| 179 |
```json
|
| 180 |
{
|
| 181 |
"task": "claim_evidence_audit",
|
|
|
|
| 193 |
}
|
| 194 |
```
|
| 195 |
|
| 196 |
+
**Task 4 β navigate:**
|
| 197 |
```json
|
| 198 |
{"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
|
| 199 |
```
|
| 200 |
|
| 201 |
+
**Task 4 β submit:**
|
| 202 |
```json
|
| 203 |
{
|
| 204 |
"task": "citation_verification",
|
|
|
|
| 209 |
}
|
| 210 |
```
|
| 211 |
|
| 212 |
+
**Response:**
|
| 213 |
```json
|
| 214 |
+
{"observation": {...}, "reward": 0.7341, "done": false, "info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
```
|
| 216 |
|
| 217 |
### Other endpoints
|
|
|
|
| 219 |
| Endpoint | Method | Description |
|
| 220 |
|---|---|---|
|
| 221 |
| `/health` | GET | `{"status":"ok","version":"0.4.0"}` |
|
| 222 |
+
| `/state` | GET | Episode state, curriculum summary, nav coverage |
|
| 223 |
| `/tasks` | GET | All 4 task descriptions |
|
| 224 |
| `/action_space` | GET | Full action schema |
|
| 225 |
|
|
|
|
| 229 |
|
| 230 |
```
|
| 231 |
βββ inference.py β Baseline agent (root β required by spec)
|
| 232 |
+
βββ models.py β FormattingAction, ScholarAction, CitationAction,
|
| 233 |
+
β ScholarObservation, AnyAction (discriminated union)
|
| 234 |
βββ corpus.py β PaperCorpus loader
|
| 235 |
βββ openenv.yaml β 4 tasks, endpoints, authors, baseline_script
|
| 236 |
βββ Dockerfile
|
| 237 |
βββ requirements.txt
|
| 238 |
+
βββ validate-submission.sh β Official 3-step pre-submission validator
|
| 239 |
β
|
| 240 |
βββ data/
|
| 241 |
β βββ papers/
|
| 242 |
+
β β βββ paper_001.json β NLP benchmark (easy) β 5 refs, 1 ghost
|
| 243 |
+
β β βββ paper_002.json β CV survey (medium) β 4 refs, 1 ghost
|
| 244 |
+
β β βββ paper_003.json β MTL paper (hard) β 5 refs, 1 ghost
|
| 245 |
β βββ styles/ieee.yaml
|
| 246 |
β
|
| 247 |
βββ server/
|
| 248 |
+
β βββ app.py β FastAPI: /reset /step /state /health /tasks
|
| 249 |
β βββ environment.py β 4-task state machine
|
| 250 |
β βββ reward_shaper.py β PBRS (Ng et al. 1999)
|
| 251 |
β βββ curriculum.py β AdaRFT + UCB1
|
|
|
|
| 253 |
β βββ citation_verifier.py β Citation parser + SQLite cache
|
| 254 |
β βββ graders/
|
| 255 |
β βββ formatting_grader.py β PRS 3-stage (Task 1)
|
| 256 |
+
β βββ consistency_grader.pyβ F-beta fuzzy-match (Task 2)
|
| 257 |
+
β βββ audit_grader.py β F-beta + PBRS coverage (Task 3)
|
| 258 |
β
|
| 259 |
βββ scripts/generate_corpus.py
|
| 260 |
βββ tests/test_all.py β 63 assertions
|
|
|
|
| 266 |
|
| 267 |
```
|
| 268 |
[Corpus] 8/8 β
|
| 269 |
+
[FormattingGrader] 8/8 β PRS stage locking verified
|
| 270 |
[ConsistencyGrader] 9/9 β F-beta, hallucination penalty
|
| 271 |
[AuditGrader] 6/6 β Evidence specificity, coverage bonus
|
| 272 |
[PBRS] 6/6 β Potential monotonicity, bonus bounds
|
|
|
|
| 286 |
| [PRS Β· arXiv 2512.07478](https://arxiv.org/abs/2512.07478) | Task 1 progressive staging prevents GRPO gradient collapse |
|
| 287 |
| [PBRS Β· Ng, Harada & Russell, ICML 1999](http://www.cs.utexas.edu/~ai-lab/pubs/ICML99-shaping.pdf) | Policy-invariant dense intermediate rewards |
|
| 288 |
| [AdaRFT Β· arXiv 2504.05520](https://arxiv.org/abs/2504.05520) | Curriculum targeting [0.40, 0.70] productive zone |
|
| 289 |
+
| [RLVE Β· arXiv 2511.07317](https://arxiv.org/abs/2511.07317) | Adaptive difficulty β why UCB1 maximises variance |
|
| 290 |
| [Veri-R1 Β· arXiv 2510.01932](https://arxiv.org/abs/2510.01932) | Online RL for claim verification is current SOTA |
|
| 291 |
+
| [LaMer Β· arXiv 2512.16848](https://arxiv.org/abs/2512.16848) | Structured feedback fields improve agent 11β19% |
|
| 292 |
| [StatCheck Β· Epskamp 2016](https://link.springer.com/article/10.3758/s13428-015-0664-2) | ~50% of papers have errors β scale motivation |
|
| 293 |
| [GROBID Β· Lopez 2008β2025](https://github.com/kermitt2/grobid) | Prior art; CitationVerifier is our RL-native alternative |
|
| 294 |
|
| 295 |
---
|
| 296 |
|
| 297 |
+
## Baseline Scores
|
| 298 |
|
| 299 |
+
| Task | Score | Notes |
|
| 300 |
+
|---|---|---|
|
| 301 |
+
| `formatting_compliance` | ~0.82 | Strong baseline, room to perfect |
|
| 302 |
+
| `internal_consistency` | ~0.51 | F-beta precision-biased |
|
| 303 |
+
| `claim_evidence_audit` | ~0.31 | **Core RL gap β biggest training value** |
|
| 304 |
+
| `citation_verification` | ~0.47 | Ghost detection improving with SQLite cache |
|
| 305 |
|
| 306 |
---
|
| 307 |
|
|
|
|
| 315 |
|
| 316 |
*The future of AI isn't just models that generate β it's models that verify.*
|
| 317 |
|
| 318 |
+
[](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent)
|
| 319 |
[](https://github.com/Nensi1311/research-paper-formatter-agent)
|
| 320 |
|
| 321 |
</div>
|
__init__.py
CHANGED
|
@@ -4,7 +4,7 @@ ScholarEnv β OpenEnv environment for scholarly integrity verification.
|
|
| 4 |
from .models import FormattingAction, ScholarAction, ScholarObservation, EpisodeStatus
|
| 5 |
from .corpus import PaperCorpus, Paper
|
| 6 |
|
| 7 |
-
__version__ = "0.
|
| 8 |
__all__ = [
|
| 9 |
"FormattingAction",
|
| 10 |
"ScholarAction",
|
|
|
|
| 4 |
from .models import FormattingAction, ScholarAction, ScholarObservation, EpisodeStatus
|
| 5 |
from .corpus import PaperCorpus, Paper
|
| 6 |
|
| 7 |
+
__version__ = "0.4.0"
|
| 8 |
__all__ = [
|
| 9 |
"FormattingAction",
|
| 10 |
"ScholarAction",
|
pyproject.toml
CHANGED
|
@@ -4,7 +4,7 @@ build-backend = "setuptools.backends.legacy:build"
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "scholar-env"
|
| 7 |
-
version = "0.
|
| 8 |
description = "OpenEnv environment for scholarly integrity verification"
|
| 9 |
readme = "README.md"
|
| 10 |
license = {text = "Apache-2.0"}
|
|
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "scholar-env"
|
| 7 |
+
version = "0.4.0"
|
| 8 |
description = "OpenEnv environment for scholarly integrity verification"
|
| 9 |
readme = "README.md"
|
| 10 |
license = {text = "Apache-2.0"}
|
server/app.py
CHANGED
|
@@ -41,7 +41,7 @@ app = FastAPI(
|
|
| 41 |
"Three tasks: formatting compliance, internal consistency, "
|
| 42 |
"claim-evidence audit."
|
| 43 |
),
|
| 44 |
-
version="0.
|
| 45 |
)
|
| 46 |
|
| 47 |
app.add_middleware(
|
|
@@ -72,7 +72,7 @@ async def health() -> dict:
|
|
| 72 |
env = get_env()
|
| 73 |
return {
|
| 74 |
"status": "ok",
|
| 75 |
-
"version": "0.
|
| 76 |
"corpus_size": len(env.corpus),
|
| 77 |
"tasks": list(TASK_CONFIG.keys()),
|
| 78 |
}
|
|
|
|
| 41 |
"Three tasks: formatting compliance, internal consistency, "
|
| 42 |
"claim-evidence audit."
|
| 43 |
),
|
| 44 |
+
version="0.4.0",
|
| 45 |
)
|
| 46 |
|
| 47 |
app.add_middleware(
|
|
|
|
| 72 |
env = get_env()
|
| 73 |
return {
|
| 74 |
"status": "ok",
|
| 75 |
+
"version": "0.4.0",
|
| 76 |
"corpus_size": len(env.corpus),
|
| 77 |
"tasks": list(TASK_CONFIG.keys()),
|
| 78 |
}
|