Sync JudgeGPT app from GitHub

#4
by AliIqbal05 - opened
Files changed (2) hide show
  1. README.md +173 -105
  2. modal_app.py +212 -193
README.md CHANGED
@@ -1,105 +1,173 @@
1
- ---
2
- title: Judge-GPT
3
- emoji: ⚖️
4
- colorFrom: yellow
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 6.17.3
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: AI-native miniature trials under 32B.
12
- ---
13
-
14
- # Judge-GPT
15
-
16
- Judge-GPT is a cinematic Gradio Space for the Build Small Hackathon's Thousand Token Wood track. It runs two-minute AI-native miniature trials where small-model agents act as advocates, judge, jurors, clerk, and evidence auditor.
17
-
18
- The app is built to stay under the 32B named-model budget:
19
-
20
- - `openai/gpt-oss-20b` for primary legal reasoning.
21
- - `openbmb/AgentCPM-Explore` for clerk/stage/verdict style.
22
- - `nvidia/Nemotron-Orchestrator-8B` for juror and evidence-auditor review.
23
-
24
- Total named budget: 32B parameters.
25
-
26
- ## What the app can do
27
-
28
- - Run cached trials for the Socrates and Barnaby demo cases without network search.
29
- - Run the Live Search Tribunal path, which builds a search packet from a user query and stops if live material is too weak to support a trial.
30
- - Add a hypothetical sidebar to shift the framing of a trial without editing cached case files.
31
- - Switch trial pacing between swift, measured, and ceremonial speeds.
32
- - Stage the courtroom with phase-specific visuals, agent puppets, evidence props, captions, and browser audio cues.
33
- - Show the Mind Layer as a compact JSON trace of agent turns and phase metadata.
34
- - Call a Modal streaming endpoint when `MODAL_TRIAL_URL` is configured. Endpoint or model failures stop the trial instead of substituting cached dialogue.
35
- - Retain decree and agent-trace export helpers in `sovereign_bench/export.py` for future UI restoration.
36
-
37
- ## Limitations
38
-
39
- - Judge-GPT is not legal advice and should not be used for real legal decisions.
40
- - Live search snippets are not independently verified by the app.
41
- - Output quality depends on Modal GPU availability, token limits, and the configured Hugging Face models.
42
- - Model, Modal, or live retrieval failures stop the current trial rather than returning substitute courtroom dialogue.
43
- - Trial results are not persisted across sessions.
44
- - Export generation remains in the codebase, but the visible download UI is currently hidden.
45
-
46
- ## Run locally
47
-
48
- ```powershell
49
- python -m pip install -r requirements.txt
50
- python app.py
51
- ```
52
-
53
- ## Modal backend
54
-
55
- The Gradio app works locally without Modal. If `MODAL_TRIAL_URL` is set, the Space calls the Modal streaming endpoint and stops the trial if the endpoint is unavailable.
56
-
57
- The deployed Modal endpoint runs each role prompt through a GPU-backed vLLM class on H100 by default. Traces mark successful GPU calls with `runtime: modal-gpu-vllm`, `provider: modal-gpu-vllm`, and `gpu: H100`. If a GPU/model load fails, the trial stops; the app does not substitute provider or cached dialogue.
58
-
59
- ```powershell
60
- python -m modal deploy modal_app.py
61
- ```
62
-
63
- Keep the deployed endpoint URL as a Hugging Face Space variable named `MODAL_TRIAL_URL`.
64
-
65
- ## Project targets
66
-
67
- Workspace connected to:
68
-
69
- - GitHub: `https://github.com/aliiqbal24/BuildSmallfinal.git`
70
- - Modal profile: `ali-j-iqbal24`
71
- - Hugging Face user: `AliIqbal05`
72
-
73
- ## Secrets
74
-
75
- Credentials are not committed to this repo.
76
-
77
- - Local Hugging Face CLI auth is stored in the Hugging Face cache.
78
- - Modal auth is stored in the local Modal profile.
79
- - Modal has a secret named `huggingface` with `HF_TOKEN`.
80
-
81
- Use the Modal secret in functions like this:
82
-
83
- ```python
84
- @app.function(secrets=[modal.Secret.from_name("huggingface")])
85
- def run_model():
86
- token = os.getenv("HF_TOKEN")
87
- ```
88
-
89
- ## Developer guide
90
-
91
- - `app.py`: Gradio UI, CSS, JavaScript audio hooks, HTML renderers, and Modal/local streaming switch.
92
- - `sovereign_bench/engine.py`: trial phases, agent orchestration, verdict assembly, and trace construction.
93
- - `sovereign_bench/llm.py`: Hugging Face calls, strict model error handling, and prompt building.
94
- - `sovereign_bench/retrieval.py`: live search packet construction.
95
- - `sovereign_bench/models.py`: Pydantic schemas for cases, evidence, events, turns, votes, and verdicts.
96
- - `sovereign_bench/cases.py`: cached demo case packets.
97
- - `sovereign_bench/export.py`: dormant decree and trace writers.
98
- - `modal_app.py`: Modal deployment and GPU-backed streaming endpoint.
99
- - `tests/`: engine, case, and rendering regression coverage.
100
-
101
- ## Verify Modal to Hugging Face
102
-
103
- ```powershell
104
- python -m modal run modal_app.py
105
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Judge-GPT
3
+ emoji: ⚖️
4
+ colorFrom: yellow
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 6.17.3
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: AI-native miniature trials under 32B.
12
+ tags:
13
+ - track:wood
14
+ - sponsor:openai
15
+ - sponsor:nvidia
16
+ - sponsor:modal
17
+ - achievement:offbrand
18
+ - achievement:fieldnotes
19
+ ---
20
+
21
+ # Judge-GPT
22
+
23
+ Judge-GPT is a cinematic Gradio courtroom for the Build Small Hackathon's Thousand Token Wood track. It turns a compact evidence packet into a two-minute AI-native trial: a clerk opens the docket, two lawyers argue opposite sides, Marcus Aurelius presides, six fixed-perspective jurors vote, and the court seals a verdict.
24
+
25
+ The point is not legal advice. It is a small-model theater for structured disagreement: evidence is visible, roles are constrained, hidden reasoning is stripped, and every trial leaves a trace of which agent said what.
26
+
27
+ ## Submission Links
28
+
29
+ - Hugging Face Space: https://huggingface.co/spaces/build-small-hackathon/JudgeGPT
30
+ - Demo video: https://drive.google.com/drive/folders/10pWJ7NVCsnVV7wOlqm4MGWg4Kmh4rMY2?usp=sharing
31
+ - Social post: TODO paste final public social post URL
32
+ - GitHub repo: https://github.com/aliiqbal24/BuildSmallfinal
33
+ - Field guide validator: https://build-small-hackathon-field-guide.hf.space/submit
34
+
35
+ ## What Judges Should Try
36
+
37
+ 1. Open the Space and keep the default `Trial of Socrates`.
38
+ 2. Click `Begin Trial`.
39
+ 3. Watch the courtroom progress from intake to verdict.
40
+ 4. Hover the judge, clerk, lawyers, and jurors to inspect model/agent threads.
41
+ 5. Open the `Evidence Drawer` and `Juror Panel` tabs after the verdict.
42
+ 6. Try `Greg Heffley vs Mom` for a lighter family-court case.
43
+ 7. Try `Custom` to write a short dispute and up to three pieces of evidence per side directly into the docket book.
44
+
45
+ ## Why It Fits Build Small
46
+
47
+ - **Thousand Token Wood:** the app is whimsical, theatrical, and AI-native rather than a generic chatbot.
48
+ - **Best Use of Codex:** Codex was used throughout implementation, debugging, UI iteration, tests, and commit prep in the connected GitHub repo.
49
+ - **Nemotron Hardware Prize:** Nemotron is a core runtime model for the jury and juror vote generation.
50
+ - **Best Use of Modal:** the Gradio Space delegates live model inference to a Modal GPU streaming endpoint.
51
+ - **Off-Brand:** the UI pushes past stock Gradio with a custom courtroom, animated puppets, docket book, evidence props, audio cues, and verdict staging.
52
+ - **Field Notes:** this README documents the build idea, model choices, runtime architecture, limitations, and submission checklist.
53
+
54
+ ## Small-Model Budget
55
+
56
+ Every named model is under the 32B parameter cap.
57
+
58
+ | Role | Model | Budgeted size | Used for |
59
+ | --- | --- | ---: | --- |
60
+ | Presiding advocate | `openai/gpt-oss-20b` | 20B | Judge, claimant lawyer, respondent lawyer, verdict voice |
61
+ | Clerk of style | `openbmb/AgentCPM-Explore` | 4B | Clerk/stage voice |
62
+ | Jury ring | `nvidia/Nemotron-Orchestrator-8B` | 8B | Jury panel and six juror votes |
63
+
64
+ Displayed aggregate budget: 32B. The app does not use a model above 32B.
65
+
66
+ ## How It Works
67
+
68
+ Judge-GPT runs a deterministic courtroom sequence over a `CasePacket`:
69
+
70
+ 1. Clerk opens the docket.
71
+ 2. Judge frames the dispute.
72
+ 3. Mike OSS argues for the claimant.
73
+ 4. Harvey Vector argues for the respondent.
74
+ 5. The evidence record is displayed without adding a third lawyer.
75
+ 6. The judge asks a hinge question.
76
+ 7. Each lawyer answers from their side.
77
+ 8. Nemotron Jury retires the panel.
78
+ 9. Six named jurors vote from distinct worldviews.
79
+ 10. The judge announces the final verdict.
80
+
81
+ The shipped demo cases are:
82
+
83
+ - `The Polis v. Socrates`
84
+ - `Greg Heffley v. Mom`
85
+ - `Custom`, built from the docket-book fields in the UI
86
+
87
+ ## Runtime Architecture
88
+
89
+ - `app.py` renders the Gradio UI, courtroom HTML/CSS, audio hooks, case preview book, and live event stream.
90
+ - `sovereign_bench/engine.py` orchestrates trial phases, model calls, evidence events, jury votes, verdict assembly, and trace metadata.
91
+ - `sovereign_bench/llm.py` builds role prompts, calls Hugging Face-compatible chat models, and rejects hidden reasoning or instruction echoes.
92
+ - `sovereign_bench/cases.py` contains the cached demo case packets.
93
+ - `modal_app.py` hosts the GPU-backed streaming endpoint used by the Space.
94
+ - `tests/` contains engine, case, and rendering regression tests.
95
+
96
+ The Gradio app uses `MODAL_TRIAL_URL` when set, otherwise it uses the built-in deployed Modal endpoint. The Modal app owns the Hugging Face token through a Modal secret named `huggingface`; no real credentials are committed.
97
+
98
+ ## Run Locally
99
+
100
+ ```powershell
101
+ python -m pip install -r requirements.txt
102
+ python app.py
103
+ ```
104
+
105
+ Open:
106
+
107
+ ```text
108
+ http://127.0.0.1:7860
109
+ ```
110
+
111
+ ## Deploy Modal Backend
112
+
113
+ ```powershell
114
+ python -m modal deploy modal_app.py
115
+ ```
116
+
117
+ After deployment, pre-warm every configured courtroom model in the deployed `sovereign-bench` app so the first trial does not wait for all GPU containers to cold start. Run this after each deploy because deployments reset Modal autoscaler overrides:
118
+
119
+ ```powershell
120
+ python -m modal run modal_app.py::warm_models
121
+ ```
122
+
123
+ If the endpoint changes, set the Hugging Face Space variable:
124
+
125
+ ```text
126
+ MODAL_TRIAL_URL=https://your-modal-endpoint.example
127
+ ```
128
+
129
+ ## Deploy Hugging Face Space
130
+
131
+ Create or upload this repo as a Gradio Space inside the official Build Small org:
132
+
133
+ ```text
134
+ build-small-hackathon/<your-space-name>
135
+ ```
136
+
137
+ Space settings:
138
+
139
+ - SDK: Gradio
140
+ - App file: `app.py`
141
+ - Python requirements: `requirements.txt`
142
+ - Optional variable: `MODAL_TRIAL_URL`
143
+ - No Space secret is required if using the hosted Modal endpoint.
144
+
145
+ ## Verification
146
+
147
+ ```powershell
148
+ python -m pytest
149
+ ```
150
+
151
+ Focused checks used during final prep:
152
+
153
+ ```powershell
154
+ python -m pytest tests/test_engine.py tests/test_ui_rendering.py
155
+ ```
156
+
157
+ ## Limitations
158
+
159
+ - Judge-GPT is not legal advice and should not be used for real legal decisions.
160
+ - The demo packets are compact, staged evidence packets, not exhaustive source research.
161
+ - Model, Modal, or retrieval failures stop the current trial instead of substituting fake dialogue.
162
+ - Trial results are not persisted across sessions.
163
+ - Custom trials require a short case context and evidence from both sides.
164
+
165
+ ## Final Submission Checklist
166
+
167
+ - [ ] Push the repo to the Build Small Hugging Face org as a Gradio Space.
168
+ - [ ] Confirm the Space launches and can complete `Trial of Socrates`.
169
+ - [ ] Record a short demo video showing the trial flow and verdict.
170
+ - [ ] Replace the `Demo video` TODO above with the final public URL.
171
+ - [ ] Publish one social post about the app.
172
+ - [ ] Replace the `Social post` TODO above with the final public URL.
173
+ - [ ] Run the README through the Build Small validator.
modal_app.py CHANGED
@@ -1,193 +1,212 @@
1
- import os
2
- import time
3
-
4
- import modal
5
-
6
- from sovereign_bench.engine import stream_trial_jsonl
7
- from sovereign_bench.llm import (
8
- ModelCall,
9
- ModelResult,
10
- build_role_messages,
11
- messages_hash,
12
- )
13
- from sovereign_bench.models import TrialRequest
14
-
15
- app = modal.App("sovereign-bench")
16
- GPU_NAME = "H100"
17
- GPU_TIMEOUT_SECONDS = 20 * 60
18
- HF_CACHE_DIR = "/root/.cache/huggingface"
19
-
20
- image = (
21
- modal.Image.debian_slim(python_version="3.12")
22
- .pip_install("fastapi", "huggingface_hub", "httpx", "pydantic")
23
- .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
24
- )
25
-
26
- model_cache = modal.Volume.from_name("sovereign-bench-model-cache", create_if_missing=True)
27
-
28
- vllm_image = (
29
- modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu22.04", add_python="3.12")
30
- .entrypoint([])
31
- .uv_pip_install(
32
- "vllm==0.18.1",
33
- "huggingface_hub[hf_transfer]==0.36.0",
34
- "transformers",
35
- "httpx",
36
- "pydantic",
37
- )
38
- .env(
39
- {
40
- "HF_HUB_ENABLE_HF_TRANSFER": "1",
41
- "HF_HOME": HF_CACHE_DIR,
42
- "VLLM_WORKER_MULTIPROC_METHOD": "spawn",
43
- "VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": "1",
44
- }
45
- )
46
- .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
47
- )
48
-
49
-
50
- @app.cls(
51
- image=vllm_image,
52
- gpu=GPU_NAME,
53
- secrets=[modal.Secret.from_name("huggingface")],
54
- volumes={HF_CACHE_DIR: model_cache},
55
- timeout=GPU_TIMEOUT_SECONDS,
56
- scaledown_window=10 * 60,
57
- max_containers=3,
58
- )
59
- class VllmModel:
60
- model_id: str = modal.parameter()
61
-
62
- @modal.enter()
63
- def load(self) -> None:
64
- from vllm import LLM, SamplingParams
65
-
66
- self.SamplingParams = SamplingParams
67
- self.llm = LLM(
68
- model=self.model_id,
69
- trust_remote_code=True,
70
- max_model_len=4096,
71
- gpu_memory_utilization=0.9,
72
- )
73
-
74
- @modal.method()
75
- def generate(self, payload: dict) -> dict:
76
- from sovereign_bench.llm import ModelCallError, clean_model_text
77
-
78
- started = time.perf_counter()
79
- messages = payload["messages"]
80
- max_tokens = int(payload.get("max_tokens") or 120)
81
- temperature = float(payload.get("temperature") or 0.45)
82
- sampling_params = self.SamplingParams(
83
- max_tokens=max_tokens,
84
- temperature=temperature,
85
- top_p=0.9,
86
- )
87
- retry_messages = messages + [
88
- {
89
- "role": "user",
90
- "content": (
91
- "Your previous response did not include visible courtroom dialogue. "
92
- "Return only the final spoken dialogue now. Do not include <think>, analysis, reasoning, markdown, or notes. /no_think"
93
- ),
94
- }
95
- ]
96
- last_error: Exception | None = None
97
- text = ""
98
- for attempt_messages in (messages, retry_messages):
99
- outputs = self.llm.chat(
100
- [attempt_messages],
101
- sampling_params=sampling_params,
102
- use_tqdm=False,
103
- chat_template_kwargs={"enable_thinking": False},
104
- )
105
- raw_text = outputs[0].outputs[0].text.strip()
106
- try:
107
- text = clean_model_text(raw_text)
108
- break
109
- except ModelCallError as exc:
110
- last_error = exc
111
- if not text and last_error:
112
- raise last_error
113
- return {
114
- "text": text,
115
- "latency_ms": int((time.perf_counter() - started) * 1000),
116
- }
117
-
118
-
119
- def modal_gpu_enabled() -> bool:
120
- return os.getenv("SOVEREIGN_DISABLE_MODAL_GPU", "").lower() not in {"1", "true", "yes"}
121
-
122
-
123
- def modal_gpu_runner(**kwargs) -> ModelResult:
124
- messages = build_role_messages(
125
- agent=kwargs["agent"],
126
- role=kwargs["role"],
127
- case_summary=kwargs["case_summary"],
128
- task=kwargs["task"],
129
- evidence_summary=kwargs["evidence_summary"],
130
- )
131
- requested_model = kwargs["model"]
132
- prompt_hash = messages_hash(messages)
133
-
134
- if modal_gpu_enabled():
135
- output = VllmModel(model_id=requested_model).generate.remote(
136
- {
137
- "messages": messages,
138
- "max_tokens": kwargs.get("max_tokens", 120),
139
- "temperature": 0.45,
140
- }
141
- )
142
- return ModelResult(
143
- text=output["text"],
144
- input_text="\n\n".join(f"{item.get('role', 'user').upper()}:\n{item.get('content', '')}" for item in messages)
145
- + "\n\nASSISTANT:\n",
146
- call=ModelCall(
147
- model=requested_model,
148
- provider="modal-gpu-vllm",
149
- ok=True,
150
- latency_ms=output["latency_ms"],
151
- prompt_hash=prompt_hash,
152
- requested_model=requested_model,
153
- runtime="modal-gpu-vllm",
154
- gpu=GPU_NAME,
155
- ),
156
- )
157
-
158
- raise RuntimeError("Modal GPU is disabled; no provider fallback is allowed.")
159
-
160
-
161
- @app.function(image=image, secrets=[modal.Secret.from_name("huggingface")])
162
- def check_huggingface_connection() -> str:
163
- token = os.getenv("HF_TOKEN")
164
- if not token:
165
- return "HF_TOKEN is not available inside Modal."
166
-
167
- from huggingface_hub import HfApi
168
-
169
- user = HfApi(token=token).whoami()["name"]
170
- return f"Connected to Hugging Face as {user}."
171
-
172
-
173
- @app.function(
174
- image=image,
175
- secrets=[modal.Secret.from_name("huggingface")],
176
- min_containers=1,
177
- timeout=GPU_TIMEOUT_SECONDS,
178
- )
179
- @modal.fastapi_endpoint(method="POST", label="trial-stream")
180
- def trial_stream(payload: dict):
181
- from fastapi.responses import StreamingResponse
182
-
183
- request = TrialRequest.model_validate(payload)
184
- delay = {"swift": 0.02, "measured": 0.12, "ceremonial": 0.25}[request.speed]
185
- return StreamingResponse(
186
- stream_trial_jsonl(request, delay=delay, model_runner=modal_gpu_runner),
187
- media_type="application/x-ndjson",
188
- )
189
-
190
-
191
- @app.local_entrypoint()
192
- def main():
193
- print(check_huggingface_connection.remote())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+
4
+ import modal
5
+
6
+ from sovereign_bench.engine import MODEL_BUDGET, stream_trial_jsonl
7
+ from sovereign_bench.llm import (
8
+ ModelCall,
9
+ ModelResult,
10
+ build_role_messages,
11
+ messages_hash,
12
+ )
13
+ from sovereign_bench.models import TrialRequest
14
+
15
+ MODAL_APP_NAME = "sovereign-bench"
16
+ app = modal.App(MODAL_APP_NAME)
17
+ GPU_NAME = "H100"
18
+ GPU_TIMEOUT_SECONDS = 20 * 60
19
+ HF_CACHE_DIR = "/root/.cache/huggingface"
20
+ USED_MODEL_IDS = tuple(dict.fromkeys(model for _, model, _ in MODEL_BUDGET))
21
+
22
+ image = (
23
+ modal.Image.debian_slim(python_version="3.12")
24
+ .pip_install("fastapi", "huggingface_hub", "httpx", "pydantic")
25
+ .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
26
+ )
27
+
28
+ model_cache = modal.Volume.from_name("sovereign-bench-model-cache", create_if_missing=True)
29
+
30
+ vllm_image = (
31
+ modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu22.04", add_python="3.12")
32
+ .entrypoint([])
33
+ .uv_pip_install(
34
+ "vllm==0.18.1",
35
+ "huggingface_hub[hf_transfer]==0.36.0",
36
+ "transformers",
37
+ "httpx",
38
+ "pydantic",
39
+ )
40
+ .env(
41
+ {
42
+ "HF_HUB_ENABLE_HF_TRANSFER": "1",
43
+ "HF_HOME": HF_CACHE_DIR,
44
+ "VLLM_WORKER_MULTIPROC_METHOD": "spawn",
45
+ "VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": "1",
46
+ }
47
+ )
48
+ .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
49
+ )
50
+
51
+
52
+ @app.cls(
53
+ image=vllm_image,
54
+ gpu=GPU_NAME,
55
+ secrets=[modal.Secret.from_name("huggingface")],
56
+ volumes={HF_CACHE_DIR: model_cache},
57
+ timeout=GPU_TIMEOUT_SECONDS,
58
+ scaledown_window=10 * 60,
59
+ max_containers=3,
60
+ )
61
+ class VllmModel:
62
+ model_id: str = modal.parameter()
63
+
64
+ @modal.enter()
65
+ def load(self) -> None:
66
+ from vllm import LLM, SamplingParams
67
+
68
+ self.SamplingParams = SamplingParams
69
+ self.llm = LLM(
70
+ model=self.model_id,
71
+ trust_remote_code=True,
72
+ max_model_len=4096,
73
+ gpu_memory_utilization=0.9,
74
+ )
75
+
76
+ @modal.method()
77
+ def generate(self, payload: dict) -> dict:
78
+ from sovereign_bench.llm import ModelCallError, clean_model_text
79
+
80
+ started = time.perf_counter()
81
+ messages = payload["messages"]
82
+ max_tokens = int(payload.get("max_tokens") or 120)
83
+ temperature = float(payload.get("temperature") or 0.45)
84
+ sampling_params = self.SamplingParams(
85
+ max_tokens=max_tokens,
86
+ temperature=temperature,
87
+ top_p=0.9,
88
+ )
89
+ retry_messages = messages + [
90
+ {
91
+ "role": "user",
92
+ "content": (
93
+ "Your previous response did not include visible courtroom dialogue. "
94
+ "Return only the final answer now. Do not mention prompts, tasks, requirements, or that you are following instructions. "
95
+ "Do not include <think>, analysis, reasoning, markdown, narration, or notes. /no_think"
96
+ ),
97
+ }
98
+ ]
99
+ last_error: Exception | None = None
100
+ text = ""
101
+ for attempt_messages in (messages, retry_messages):
102
+ outputs = self.llm.chat(
103
+ [attempt_messages],
104
+ sampling_params=sampling_params,
105
+ use_tqdm=False,
106
+ chat_template_kwargs={"enable_thinking": False},
107
+ )
108
+ raw_text = outputs[0].outputs[0].text.strip()
109
+ try:
110
+ text = clean_model_text(raw_text)
111
+ break
112
+ except ModelCallError as exc:
113
+ last_error = exc
114
+ if not text and last_error:
115
+ raise last_error
116
+ return {
117
+ "text": text,
118
+ "latency_ms": int((time.perf_counter() - started) * 1000),
119
+ }
120
+
121
+ @modal.method()
122
+ def warm(self) -> dict:
123
+ return {"model": self.model_id, "status": "warm"}
124
+
125
+
126
+ def modal_gpu_enabled() -> bool:
127
+ return os.getenv("SOVEREIGN_DISABLE_MODAL_GPU", "").lower() not in {"1", "true", "yes"}
128
+
129
+
130
+ def modal_gpu_runner(**kwargs) -> ModelResult:
131
+ messages = build_role_messages(
132
+ agent=kwargs["agent"],
133
+ role=kwargs["role"],
134
+ case_summary=kwargs["case_summary"],
135
+ task=kwargs["task"],
136
+ evidence_summary=kwargs["evidence_summary"],
137
+ trial_history=kwargs.get("trial_history", ""),
138
+ persona=kwargs.get("persona", ""),
139
+ objective=kwargs.get("objective", ""),
140
+ )
141
+ requested_model = kwargs["model"]
142
+ prompt_hash = messages_hash(messages)
143
+
144
+ if modal_gpu_enabled():
145
+ output = VllmModel(model_id=requested_model).generate.remote(
146
+ {
147
+ "messages": messages,
148
+ "max_tokens": kwargs.get("max_tokens", 120),
149
+ "temperature": 0.45,
150
+ }
151
+ )
152
+ return ModelResult(
153
+ text=output["text"],
154
+ input_text="\n\n".join(f"{item.get('role', 'user').upper()}:\n{item.get('content', '')}" for item in messages)
155
+ + "\n\nASSISTANT:\n",
156
+ call=ModelCall(
157
+ model=requested_model,
158
+ provider="modal-gpu-vllm",
159
+ ok=True,
160
+ latency_ms=output["latency_ms"],
161
+ prompt_hash=prompt_hash,
162
+ requested_model=requested_model,
163
+ runtime="modal-gpu-vllm",
164
+ gpu=GPU_NAME,
165
+ ),
166
+ )
167
+
168
+ raise RuntimeError("Modal GPU is disabled; no provider fallback is allowed.")
169
+
170
+
171
+ @app.function(image=image, secrets=[modal.Secret.from_name("huggingface")])
172
+ def check_huggingface_connection() -> str:
173
+ token = os.getenv("HF_TOKEN")
174
+ if not token:
175
+ return "HF_TOKEN is not available inside Modal."
176
+
177
+ from huggingface_hub import HfApi
178
+
179
+ user = HfApi(token=token).whoami()["name"]
180
+ return f"Connected to Hugging Face as {user}."
181
+
182
+
183
+ @app.function(
184
+ image=image,
185
+ secrets=[modal.Secret.from_name("huggingface")],
186
+ min_containers=1,
187
+ timeout=GPU_TIMEOUT_SECONDS,
188
+ )
189
+ @modal.fastapi_endpoint(method="POST", label="trial-stream")
190
+ def trial_stream(payload: dict):
191
+ from fastapi.responses import StreamingResponse
192
+
193
+ request = TrialRequest.model_validate(payload)
194
+ delay = {"swift": 0.02, "measured": 0.12, "ceremonial": 0.25}[request.speed]
195
+ return StreamingResponse(
196
+ stream_trial_jsonl(request, delay=delay, model_runner=modal_gpu_runner),
197
+ media_type="application/x-ndjson",
198
+ )
199
+
200
+
201
+ @app.local_entrypoint()
202
+ def main():
203
+ print(check_huggingface_connection.remote())
204
+
205
+
206
+ @app.local_entrypoint()
207
+ def warm_models():
208
+ deployed_model = modal.Cls.from_name(MODAL_APP_NAME, "VllmModel")
209
+ for model_id in USED_MODEL_IDS:
210
+ model = deployed_model(model_id=model_id)
211
+ model.update_autoscaler(min_containers=1)
212
+ print(model.warm.remote())