Spaces:

flyingmaverick
/

scholar-env

Running

App Files Files Community

flyingmaverick commited on 18 days ago

Commit

8bc6c43

1 Parent(s): f95e25b

fix: version bump to 0.4.0, add citation_verification task

Browse files

Files changed (4) hide show

README.md +47 -75
__init__.py +1 -1
pyproject.toml +1 -1
server/app.py +2 -2

README.md CHANGED Viewed

@@ -1,13 +1,3 @@
----
-title: ScholarEnv
-emoji: 🔬
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
-license: apache-2.0
----
 <div align="center">
 # 🔬 ScholarEnv
@@ -19,10 +9,11 @@ license: apache-2.0
 [![License](https://img.shields.io/badge/License-Apache_2.0-orange?style=flat-square)](LICENSE)
 [![Tasks](https://img.shields.io/badge/Tasks-4-purple?style=flat-square)](#four-tasks)
 [![Tests](https://img.shields.io/badge/Tests-63%2F63-success?style=flat-square)](#testing)
 **An AI agent that investigates papers — not one that produces them.**
-[API Reference](#api-reference) · [Quick Start](#quick-start) · [Research](#research-foundation)
 ---
@@ -40,8 +31,8 @@ license: apache-2.0
 The key insight: **LLMs are already good at formatting. They fail at auditing.**
-Ask GPT-4o to format a manuscript → scores ~0.92 with no training.
-Ask GPT-4o to find numerical claim mismatches in a paper → scores **0.20–0.45**.
 That gap is exactly where RL adds value. The agent must discover a document traversal strategy — which sections to read first, which tables to cross-reference — that **varies by paper structure and cannot be reduced to a fixed prompt**. RL finds this strategy. Prompting cannot.
@@ -56,9 +47,9 @@ Formatting → Consistency → Claim Audit → Citation Check
 | Task | What the agent does | Frontier baseline | RL target |
 |------|-------------------|-------------------|-----------|
-| `formatting_compliance` | Fix IEEE formatting violations | 0.80–0.95 | 0.95+ |
-| `internal_consistency` | Find where paper contradicts itself | 0.40–0.65 | 0.65–0.80 |
-| `claim_evidence_audit` | Find where text claims ≠ table values | **0.20–0.45** | **0.55–0.75** |
 | `citation_verification` | Identify ghost and misattributed references | 0.35–0.60 | 0.65–0.80 |
 Task 3's low baseline is the core RL contribution — it proves genuine training headroom exists.
@@ -68,7 +59,6 @@ Task 3's low baseline is the core RL contribution — it proves genuine training
 ## Reward Design
 ### Task 1 — Progressive Reward Shaping (PRS)
 Three stages unlock sequentially. Stage N only contributes when Stage N-1 ≥ threshold. Prevents GRPO gradient collapse.
 ```
@@ -77,41 +67,27 @@ Stage 2 │ weight 0.35 │ threshold 0.60 │ Section order, word limits, capti
 Stage 3 │ weight 0.25 │ threshold 0.70 │ IEEE citations, author block, keywords
 ```
-> Based on: [arXiv 2512.07478](https://arxiv.org/abs/2512.07478) — PRS for Agentic RL
 ### Tasks 2 & 3 — F-beta + Potential-Based Reward Shaping
 **F-beta (β=0.5)** weights precision 4× over recall — prevents hallucination gaming:
 ```
-F_β(precision=1.0, recall=0.5) = 0.833   ✓ correct and precise
-F_β(precision=0.2, recall=1.0) = 0.227   ✗ spamming guesses
 ```
-**PBRS** (Ng et al., ICML 1999) gives dense intermediate rewards on every navigation step:
 ```
 Φ(s) = 0.30 × sections_read/total + 0.30 × tables_checked/total + 0.40 × claims_extracted/est
-F(s,s') = γ·Φ(s') − Φ(s)     ← policy-invariant, theoretically guaranteed
 ```
 ### Curriculum — AdaRFT + UCB1
-Keeps agent in productive zone (avg score 0.40–0.70). UCB1 maximises **learning gradient** (reward variance), not mean reward.
-```
-avg > 0.70  →  select harder papers
-avg < 0.40  →  select easier papers
-```
-> Based on: [arXiv 2504.05520](https://arxiv.org/abs/2504.05520) — AdaRFT Adaptive Data Selection
 ---
 ## Quick Start
 ### Install
 ```bash
 git clone https://github.com/Nensi1311/research-paper-formatter-agent
 cd research-paper-formatter-agent
@@ -119,27 +95,25 @@ pip install -r requirements.txt
 ```
 ### Generate corpus
 ```bash
 python scripts/generate_corpus.py
 ```
 ### Run tests
 ```bash
 python tests/test_all.py
 # → ALL TESTS PASSED (63/63)
 ```
 ### Start server
 ```bash
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
-### Test all 4 tasks — Linux/macOS
 ```bash
 for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
   curl -s -X POST localhost:7860/reset \
     -H "Content-Type: application/json" \
@@ -148,9 +122,10 @@ for task in formatting_compliance internal_consistency claim_evidence_audit cita
 done
 ```
-### Test all 4 tasks — Windows PowerShell
 ```powershell
 foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
     $body = '{"task_id":"' + $task + '"}'
     $r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
@@ -159,7 +134,6 @@ foreach ($task in @("formatting_compliance","internal_consistency","claim_eviden
 ```
 ### Docker
 ```bash
 docker build -t scholar-env .
 docker run -p 7860:7860 scholar-env
@@ -167,12 +141,11 @@ curl http://localhost:7860/health
 ```
 ### Run baseline agent
 ```bash
 export API_BASE_URL="https://api-inference.huggingface.co/v1"
 export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
 export HF_TOKEN="hf_your_token"
-export HF_SPACE_URL="https://flyingmaverick-scholar-env.hf.space"
 python inference.py
 # Writes: baseline_scores.json
@@ -183,16 +156,14 @@ python inference.py
 ## API Reference
 ### `POST /reset`
 ```json
 {"task_id": "formatting_compliance"}
 ```
-Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_steps`, `hint`.
 ### `POST /step`
-**Task 1 — submit formatted manuscript:**
 ```json
 {"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
 ```
@@ -204,7 +175,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 {"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
 ```
-**Tasks 2/3 — submit findings:**
 ```json
 {
   "task": "claim_evidence_audit",
@@ -222,12 +193,12 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 }
 ```
-**Task 4 — check citation:**
 ```json
 {"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
 ```
-**Task 4 — submit verdicts:**
 ```json
 {
   "task": "citation_verification",
@@ -238,14 +209,9 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 }
 ```
-**Step response:**
 ```json
-{
-  "observation": {...},
-  "reward": 0.7341,
-  "done": false,
-  "info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}
-}
 ```
 ### Other endpoints
@@ -253,7 +219,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 | Endpoint | Method | Description |
 |---|---|---|
 | `/health` | GET | `{"status":"ok","version":"0.4.0"}` |
-| `/state` | GET | Episode state, curriculum summary |
 | `/tasks` | GET | All 4 task descriptions |
 | `/action_space` | GET | Full action schema |
@@ -263,21 +229,23 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 ```
 ├── inference.py                 ← Baseline agent (root — required by spec)
-├── models.py                    ← FormattingAction, ScholarAction, CitationAction
 ├── corpus.py                    ← PaperCorpus loader
 ├── openenv.yaml                 ← 4 tasks, endpoints, authors, baseline_script
 ├── Dockerfile
 ├── requirements.txt
 │
 ├── data/
 │   ├── papers/
-│   │   ├── paper_001.json       ← NLP benchmark (easy)
-│   │   ├── paper_002.json       ← CV survey (medium)
-│   │   └── paper_003.json       ← MTL paper (hard)
 │   └── styles/ieee.yaml
 │
 ├── server/
-│   ├── app.py                   ← FastAPI endpoints
 │   ├── environment.py           ← 4-task state machine
 │   ├── reward_shaper.py         ← PBRS (Ng et al. 1999)
 │   ├── curriculum.py            ← AdaRFT + UCB1
@@ -285,8 +253,8 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 │   ├── citation_verifier.py     ← Citation parser + SQLite cache
 │   └── graders/
 │       ├── formatting_grader.py ← PRS 3-stage (Task 1)
-│       ├── consistency_grader.py← F-beta (Task 2)
-│       └── audit_grader.py      ← F-beta + PBRS (Task 3)
 │
 ├── scripts/generate_corpus.py
 └── tests/test_all.py            ← 63 assertions
@@ -298,7 +266,7 @@ Returns observation with `manuscript_text`, `style_guide`, `step_count`, `max_st
 ```
 [Corpus]              8/8  ✓
-[FormattingGrader]    8/8  ✓  PRS stage locking
 [ConsistencyGrader]   9/9  ✓  F-beta, hallucination penalty
 [AuditGrader]         6/6  ✓  Evidence specificity, coverage bonus
 [PBRS]                6/6  ✓  Potential monotonicity, bonus bounds
@@ -318,19 +286,22 @@ Results: 63/63 passed — ALL TESTS PASSED
 | [PRS · arXiv 2512.07478](https://arxiv.org/abs/2512.07478) | Task 1 progressive staging prevents GRPO gradient collapse |
 | [PBRS · Ng, Harada & Russell, ICML 1999](http://www.cs.utexas.edu/~ai-lab/pubs/ICML99-shaping.pdf) | Policy-invariant dense intermediate rewards |
 | [AdaRFT · arXiv 2504.05520](https://arxiv.org/abs/2504.05520) | Curriculum targeting [0.40, 0.70] productive zone |
-| [RLVE · arXiv 2511.07317](https://arxiv.org/abs/2511.07317) | Adaptive difficulty, UCB1 maximises variance |
 | [Veri-R1 · arXiv 2510.01932](https://arxiv.org/abs/2510.01932) | Online RL for claim verification is current SOTA |
-| [LaMer · arXiv 2512.16848](https://arxiv.org/abs/2512.16848) | Structured feedback improves agent 11–19% |
 | [StatCheck · Epskamp 2016](https://link.springer.com/article/10.3758/s13428-015-0664-2) | ~50% of papers have errors — scale motivation |
 | [GROBID · Lopez 2008–2025](https://github.com/kermitt2/grobid) | Prior art; CitationVerifier is our RL-native alternative |
 ---
-## Authors
-**Nensi Pansuriya · Krushna Parmar · Ishita Bhojani**
-*Meta × PyTorch OpenEnv Hackathon · Round 1 · April 2026*
 ---
@@ -344,6 +315,7 @@ Results: 63/63 passed — ALL TESTS PASSED
 *The future of AI isn't just models that generate — it's models that verify.*
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/Nensi1311/research-paper-formatter-agent)
 </div>

 <div align="center">
 # 🔬 ScholarEnv
 [![License](https://img.shields.io/badge/License-Apache_2.0-orange?style=flat-square)](LICENSE)
 [![Tasks](https://img.shields.io/badge/Tasks-4-purple?style=flat-square)](#four-tasks)
 [![Tests](https://img.shields.io/badge/Tests-63%2F63-success?style=flat-square)](#testing)
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Space-Live-yellow?style=flat-square)](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent)
 **An AI agent that investigates papers — not one that produces them.**
+[Live Demo](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent) · [API Reference](#api-reference) · [Quick Start](#quick-start) · [Research](#research-foundation)
 ---
 The key insight: **LLMs are already good at formatting. They fail at auditing.**
+Ask GPT-4o to format a manuscript → scores ~0.92 with no training.
+Ask GPT-4o to find all numerical claim mismatches in a paper → scores **0.20–0.45**.
 That gap is exactly where RL adds value. The agent must discover a document traversal strategy — which sections to read first, which tables to cross-reference — that **varies by paper structure and cannot be reduced to a fixed prompt**. RL finds this strategy. Prompting cannot.
 | Task | What the agent does | Frontier baseline | RL target |
 |------|-------------------|-------------------|-----------|
+| `formatting_compliance` | Fix IEEE formatting violations in a manuscript | 0.80–0.95 | 0.95+ |
+| `internal_consistency` | Find where the paper contradicts itself | 0.40–0.65 | 0.65–0.80 |
+| `claim_evidence_audit` | Find where text claims don't match table values | **0.20–0.45** | **0.55–0.75** |
 | `citation_verification` | Identify ghost and misattributed references | 0.35–0.60 | 0.65–0.80 |
 Task 3's low baseline is the core RL contribution — it proves genuine training headroom exists.
 ## Reward Design
 ### Task 1 — Progressive Reward Shaping (PRS)
 Three stages unlock sequentially. Stage N only contributes when Stage N-1 ≥ threshold. Prevents GRPO gradient collapse.
 ```
 Stage 3 │ weight 0.25 │ threshold 0.70 │ IEEE citations, author block, keywords
 ```
 ### Tasks 2 & 3 — F-beta + Potential-Based Reward Shaping
 **F-beta (β=0.5)** weights precision 4× over recall — prevents hallucination gaming:
 ```
+F_β(P=1.0, R=0.5) = 0.833   ← correct and precise ✓
+F_β(P=0.2, R=1.0) = 0.227   ← spamming guesses   ✗
 ```
+**PBRS** (Ng et al., ICML 1999) gives dense intermediate rewards per navigation step:
 ```
 Φ(s) = 0.30 × sections_read/total + 0.30 × tables_checked/total + 0.40 × claims_extracted/est
+F(s,s') = γ·Φ(s') − Φ(s)     ← policy-invariant, guaranteed by theory
 ```
 ### Curriculum — AdaRFT + UCB1
+Keeps the agent in the productive zone (avg score 0.40–0.70). UCB1 maximises **learning gradient** (reward variance), not mean reward — a paper always scoring 0.95 teaches nothing.
 ---
 ## Quick Start
 ### Install
 ```bash
 git clone https://github.com/Nensi1311/research-paper-formatter-agent
 cd research-paper-formatter-agent
 ```
 ### Generate corpus
 ```bash
 python scripts/generate_corpus.py
 ```
 ### Run tests
 ```bash
 python tests/test_all.py
 # → ALL TESTS PASSED (63/63)
 ```
 ### Start server
 ```bash
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
+### Test endpoints — Linux/macOS
 ```bash
+curl http://localhost:7860/health
 for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
   curl -s -X POST localhost:7860/reset \
     -H "Content-Type: application/json" \
 done
 ```
+### Test endpoints — Windows PowerShell
 ```powershell
+Invoke-RestMethod -Uri "http://localhost:7860/health"
 foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
     $body = '{"task_id":"' + $task + '"}'
     $r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
 ```
 ### Docker
 ```bash
 docker build -t scholar-env .
 docker run -p 7860:7860 scholar-env
 ```
 ### Run baseline agent
 ```bash
 export API_BASE_URL="https://api-inference.huggingface.co/v1"
 export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
 export HF_TOKEN="hf_your_token"
+export HF_SPACE_URL="https://nensi1311-research-paper-formatter-agent.hf.space"
 python inference.py
 # Writes: baseline_scores.json
 ## API Reference
 ### `POST /reset`
 ```json
 {"task_id": "formatting_compliance"}
 ```
+Returns `observation` with `manuscript_text`, `style_guide`, `step_count`, `max_steps`, `hint`.
 ### `POST /step`
+**Task 1:**
 ```json
 {"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
 ```
 {"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
 ```
+**Tasks 2/3 — submit:**
 ```json
 {
   "task": "claim_evidence_audit",
 }
 ```
+**Task 4 — navigate:**
 ```json
 {"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
 ```
+**Task 4 — submit:**
 ```json
 {
   "task": "citation_verification",
 }
 ```
+**Response:**
 ```json
+{"observation": {...}, "reward": 0.7341, "done": false, "info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}}
 ```
 ### Other endpoints
 | Endpoint | Method | Description |
 |---|---|---|
 | `/health` | GET | `{"status":"ok","version":"0.4.0"}` |
+| `/state` | GET | Episode state, curriculum summary, nav coverage |
 | `/tasks` | GET | All 4 task descriptions |
 | `/action_space` | GET | Full action schema |
 ```
 ├── inference.py                 ← Baseline agent (root — required by spec)
+├── models.py                    ← FormattingAction, ScholarAction, CitationAction,
+│                                   ScholarObservation, AnyAction (discriminated union)
 ├── corpus.py                    ← PaperCorpus loader
 ├── openenv.yaml                 ← 4 tasks, endpoints, authors, baseline_script
 ├── Dockerfile
 ├── requirements.txt
+├── validate-submission.sh       ← Official 3-step pre-submission validator
 │
 ├── data/
 │   ├── papers/
+│   │   ├── paper_001.json       ← NLP benchmark (easy)   — 5 refs, 1 ghost
+│   │   ├── paper_002.json       ← CV survey (medium)     — 4 refs, 1 ghost
+│   │   └── paper_003.json       ← MTL paper (hard)       — 5 refs, 1 ghost
 │   └── styles/ieee.yaml
 │
 ├── server/
+│   ├── app.py                   ← FastAPI: /reset /step /state /health /tasks
 │   ├── environment.py           ← 4-task state machine
 │   ├── reward_shaper.py         ← PBRS (Ng et al. 1999)
 │   ├── curriculum.py            ← AdaRFT + UCB1
 │   ├── citation_verifier.py     ← Citation parser + SQLite cache
 │   └── graders/
 │       ├── formatting_grader.py ← PRS 3-stage (Task 1)
+│       ├── consistency_grader.py← F-beta fuzzy-match (Task 2)
+│       └── audit_grader.py      ← F-beta + PBRS coverage (Task 3)
 │
 ├── scripts/generate_corpus.py
 └── tests/test_all.py            ← 63 assertions
 ```
 [Corpus]              8/8  ✓
+[FormattingGrader]    8/8  ✓  PRS stage locking verified
 [ConsistencyGrader]   9/9  ✓  F-beta, hallucination penalty
 [AuditGrader]         6/6  ✓  Evidence specificity, coverage bonus
 [PBRS]                6/6  ✓  Potential monotonicity, bonus bounds
 | [PRS · arXiv 2512.07478](https://arxiv.org/abs/2512.07478) | Task 1 progressive staging prevents GRPO gradient collapse |
 | [PBRS · Ng, Harada & Russell, ICML 1999](http://www.cs.utexas.edu/~ai-lab/pubs/ICML99-shaping.pdf) | Policy-invariant dense intermediate rewards |
 | [AdaRFT · arXiv 2504.05520](https://arxiv.org/abs/2504.05520) | Curriculum targeting [0.40, 0.70] productive zone |
+| [RLVE · arXiv 2511.07317](https://arxiv.org/abs/2511.07317) | Adaptive difficulty — why UCB1 maximises variance |
 | [Veri-R1 · arXiv 2510.01932](https://arxiv.org/abs/2510.01932) | Online RL for claim verification is current SOTA |
+| [LaMer · arXiv 2512.16848](https://arxiv.org/abs/2512.16848) | Structured feedback fields improve agent 11–19% |
 | [StatCheck · Epskamp 2016](https://link.springer.com/article/10.3758/s13428-015-0664-2) | ~50% of papers have errors — scale motivation |
 | [GROBID · Lopez 2008–2025](https://github.com/kermitt2/grobid) | Prior art; CitationVerifier is our RL-native alternative |
 ---
+## Baseline Scores
+| Task | Score | Notes |
+|---|---|---|
+| `formatting_compliance` | ~0.82 | Strong baseline, room to perfect |
+| `internal_consistency` | ~0.51 | F-beta precision-biased |
+| `claim_evidence_audit` | ~0.31 | **Core RL gap — biggest training value** |
+| `citation_verification` | ~0.47 | Ghost detection improving with SQLite cache |
 ---
 *The future of AI isn't just models that generate — it's models that verify.*
+[![Live Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Live%20Demo-HuggingFace-blue?style=for-the-badge)](https://huggingface.co/spaces/nensi1311/research-paper-formatter-agent)
 [![GitHub](https://img.shields.io/badge/GitHub-Repository-black?style=for-the-badge&logo=github)](https://github.com/Nensi1311/research-paper-formatter-agent)
 </div>

__init__.py CHANGED Viewed

@@ -4,7 +4,7 @@ ScholarEnv — OpenEnv environment for scholarly integrity verification.
 from .models import FormattingAction, ScholarAction, ScholarObservation, EpisodeStatus
 from .corpus import PaperCorpus, Paper
-__version__ = "0.3.0"
 __all__ = [
     "FormattingAction",
     "ScholarAction",

 from .models import FormattingAction, ScholarAction, ScholarObservation, EpisodeStatus
 from .corpus import PaperCorpus, Paper
+__version__ = "0.4.0"
 __all__ = [
     "FormattingAction",
     "ScholarAction",

pyproject.toml CHANGED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.backends.legacy:build"
 [project]
 name            = "scholar-env"
-version         = "0.3.0"
 description     = "OpenEnv environment for scholarly integrity verification"
 readme          = "README.md"
 license         = {text = "Apache-2.0"}

 [project]
 name            = "scholar-env"
+version         = "0.4.0"
 description     = "OpenEnv environment for scholarly integrity verification"
 readme          = "README.md"
 license         = {text = "Apache-2.0"}

server/app.py CHANGED Viewed

@@ -41,7 +41,7 @@ app = FastAPI(
         "Three tasks: formatting compliance, internal consistency, "
         "claim-evidence audit."
     ),
-    version="0.3.0",
 )
 app.add_middleware(
@@ -72,7 +72,7 @@ async def health() -> dict:
     env = get_env()
     return {
         "status": "ok",
-        "version": "0.3.0",
         "corpus_size": len(env.corpus),
         "tasks": list(TASK_CONFIG.keys()),
     }

         "Three tasks: formatting compliance, internal consistency, "
         "claim-evidence audit."
     ),
+    version="0.4.0",
 )
 app.add_middleware(
     env = get_env()
     return {
         "status": "ok",
+        "version": "0.4.0",
         "corpus_size": len(env.corpus),
         "tasks": list(TASK_CONFIG.keys()),
     }