Spaces:

build-small-hackathon
/

JudgeGPT

Sleeping

App Files Files Community

Sync JudgeGPT app from GitHub

by AliIqbal05 - opened 17 days ago

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+385

-298

Files changed (2) hide show

README.md +173 -105
modal_app.py +212 -193

README.md CHANGED Viewed

@@ -1,105 +1,173 @@
----
-title: Judge-GPT
-emoji: ⚖️
-colorFrom: yellow
-colorTo: red
-sdk: gradio
-sdk_version: 6.17.3
-app_file: app.py
-pinned: false
-license: mit
-short_description: AI-native miniature trials under 32B.
----
-# Judge-GPT
-Judge-GPT is a cinematic Gradio Space for the Build Small Hackathon's Thousand Token Wood track. It runs two-minute AI-native miniature trials where small-model agents act as advocates, judge, jurors, clerk, and evidence auditor.
-The app is built to stay under the 32B named-model budget:
-- `openai/gpt-oss-20b` for primary legal reasoning.
-- `openbmb/AgentCPM-Explore` for clerk/stage/verdict style.
-- `nvidia/Nemotron-Orchestrator-8B` for juror and evidence-auditor review.
-Total named budget: 32B parameters.
-## What the app can do
-- Run cached trials for the Socrates and Barnaby demo cases without network search.
-- Run the Live Search Tribunal path, which builds a search packet from a user query and stops if live material is too weak to support a trial.
-- Add a hypothetical sidebar to shift the framing of a trial without editing cached case files.
-- Switch trial pacing between swift, measured, and ceremonial speeds.
-- Stage the courtroom with phase-specific visuals, agent puppets, evidence props, captions, and browser audio cues.
-- Show the Mind Layer as a compact JSON trace of agent turns and phase metadata.
-- Call a Modal streaming endpoint when `MODAL_TRIAL_URL` is configured. Endpoint or model failures stop the trial instead of substituting cached dialogue.
-- Retain decree and agent-trace export helpers in `sovereign_bench/export.py` for future UI restoration.
-## Limitations
-- Judge-GPT is not legal advice and should not be used for real legal decisions.
-- Live search snippets are not independently verified by the app.
-- Output quality depends on Modal GPU availability, token limits, and the configured Hugging Face models.
-- Model, Modal, or live retrieval failures stop the current trial rather than returning substitute courtroom dialogue.
-- Trial results are not persisted across sessions.
-- Export generation remains in the codebase, but the visible download UI is currently hidden.
-## Run locally
-```powershell
-python -m pip install -r requirements.txt
-python app.py
-```
-## Modal backend
-The Gradio app works locally without Modal. If `MODAL_TRIAL_URL` is set, the Space calls the Modal streaming endpoint and stops the trial if the endpoint is unavailable.
-The deployed Modal endpoint runs each role prompt through a GPU-backed vLLM class on H100 by default. Traces mark successful GPU calls with `runtime: modal-gpu-vllm`, `provider: modal-gpu-vllm`, and `gpu: H100`. If a GPU/model load fails, the trial stops; the app does not substitute provider or cached dialogue.
-```powershell
-python -m modal deploy modal_app.py
-```
-Keep the deployed endpoint URL as a Hugging Face Space variable named `MODAL_TRIAL_URL`.
-## Project targets
-Workspace connected to:
-- GitHub: `https://github.com/aliiqbal24/BuildSmallfinal.git`
-- Modal profile: `ali-j-iqbal24`
-- Hugging Face user: `AliIqbal05`
-## Secrets
-Credentials are not committed to this repo.
-- Local Hugging Face CLI auth is stored in the Hugging Face cache.
-- Modal auth is stored in the local Modal profile.
-- Modal has a secret named `huggingface` with `HF_TOKEN`.
-Use the Modal secret in functions like this:
-```python
-@app.function(secrets=[modal.Secret.from_name("huggingface")])
-def run_model():
-    token = os.getenv("HF_TOKEN")
-```
-## Developer guide
-- `app.py`: Gradio UI, CSS, JavaScript audio hooks, HTML renderers, and Modal/local streaming switch.
-- `sovereign_bench/engine.py`: trial phases, agent orchestration, verdict assembly, and trace construction.
-- `sovereign_bench/llm.py`: Hugging Face calls, strict model error handling, and prompt building.
-- `sovereign_bench/retrieval.py`: live search packet construction.
-- `sovereign_bench/models.py`: Pydantic schemas for cases, evidence, events, turns, votes, and verdicts.
-- `sovereign_bench/cases.py`: cached demo case packets.
-- `sovereign_bench/export.py`: dormant decree and trace writers.
-- `modal_app.py`: Modal deployment and GPU-backed streaming endpoint.
-- `tests/`: engine, case, and rendering regression coverage.
-## Verify Modal to Hugging Face
-```powershell
-python -m modal run modal_app.py
-```

+---
+title: Judge-GPT
+emoji: ⚖️
+colorFrom: yellow
+colorTo: red
+sdk: gradio
+sdk_version: 6.17.3
+app_file: app.py
+pinned: false
+license: mit
+short_description: AI-native miniature trials under 32B.
+tags:
+  - track:wood
+  - sponsor:openai
+  - sponsor:nvidia
+  - sponsor:modal
+  - achievement:offbrand
+  - achievement:fieldnotes
+---
+# Judge-GPT
+Judge-GPT is a cinematic Gradio courtroom for the Build Small Hackathon's Thousand Token Wood track. It turns a compact evidence packet into a two-minute AI-native trial: a clerk opens the docket, two lawyers argue opposite sides, Marcus Aurelius presides, six fixed-perspective jurors vote, and the court seals a verdict.
+The point is not legal advice. It is a small-model theater for structured disagreement: evidence is visible, roles are constrained, hidden reasoning is stripped, and every trial leaves a trace of which agent said what.
+## Submission Links
+- Hugging Face Space: https://huggingface.co/spaces/build-small-hackathon/JudgeGPT
+- Demo video: https://drive.google.com/drive/folders/10pWJ7NVCsnVV7wOlqm4MGWg4Kmh4rMY2?usp=sharing
+- Social post: TODO paste final public social post URL
+- GitHub repo: https://github.com/aliiqbal24/BuildSmallfinal
+- Field guide validator: https://build-small-hackathon-field-guide.hf.space/submit
+## What Judges Should Try
+1. Open the Space and keep the default `Trial of Socrates`.
+2. Click `Begin Trial`.
+3. Watch the courtroom progress from intake to verdict.
+4. Hover the judge, clerk, lawyers, and jurors to inspect model/agent threads.
+5. Open the `Evidence Drawer` and `Juror Panel` tabs after the verdict.
+6. Try `Greg Heffley vs Mom` for a lighter family-court case.
+7. Try `Custom` to write a short dispute and up to three pieces of evidence per side directly into the docket book.
+## Why It Fits Build Small
+- **Thousand Token Wood:** the app is whimsical, theatrical, and AI-native rather than a generic chatbot.
+- **Best Use of Codex:** Codex was used throughout implementation, debugging, UI iteration, tests, and commit prep in the connected GitHub repo.
+- **Nemotron Hardware Prize:** Nemotron is a core runtime model for the jury and juror vote generation.
+- **Best Use of Modal:** the Gradio Space delegates live model inference to a Modal GPU streaming endpoint.
+- **Off-Brand:** the UI pushes past stock Gradio with a custom courtroom, animated puppets, docket book, evidence props, audio cues, and verdict staging.
+- **Field Notes:** this README documents the build idea, model choices, runtime architecture, limitations, and submission checklist.
+## Small-Model Budget
+Every named model is under the 32B parameter cap.
+| Role | Model | Budgeted size | Used for |
+| --- | --- | ---: | --- |
+| Presiding advocate | `openai/gpt-oss-20b` | 20B | Judge, claimant lawyer, respondent lawyer, verdict voice |
+| Clerk of style | `openbmb/AgentCPM-Explore` | 4B | Clerk/stage voice |
+| Jury ring | `nvidia/Nemotron-Orchestrator-8B` | 8B | Jury panel and six juror votes |
+Displayed aggregate budget: 32B. The app does not use a model above 32B.
+## How It Works
+Judge-GPT runs a deterministic courtroom sequence over a `CasePacket`:
+1. Clerk opens the docket.
+2. Judge frames the dispute.
+3. Mike OSS argues for the claimant.
+4. Harvey Vector argues for the respondent.
+5. The evidence record is displayed without adding a third lawyer.
+6. The judge asks a hinge question.
+7. Each lawyer answers from their side.
+8. Nemotron Jury retires the panel.
+9. Six named jurors vote from distinct worldviews.
+10. The judge announces the final verdict.
+The shipped demo cases are:
+- `The Polis v. Socrates`
+- `Greg Heffley v. Mom`
+- `Custom`, built from the docket-book fields in the UI
+## Runtime Architecture
+- `app.py` renders the Gradio UI, courtroom HTML/CSS, audio hooks, case preview book, and live event stream.
+- `sovereign_bench/engine.py` orchestrates trial phases, model calls, evidence events, jury votes, verdict assembly, and trace metadata.
+- `sovereign_bench/llm.py` builds role prompts, calls Hugging Face-compatible chat models, and rejects hidden reasoning or instruction echoes.
+- `sovereign_bench/cases.py` contains the cached demo case packets.
+- `modal_app.py` hosts the GPU-backed streaming endpoint used by the Space.
+- `tests/` contains engine, case, and rendering regression tests.
+The Gradio app uses `MODAL_TRIAL_URL` when set, otherwise it uses the built-in deployed Modal endpoint. The Modal app owns the Hugging Face token through a Modal secret named `huggingface`; no real credentials are committed.
+## Run Locally
+```powershell
+python -m pip install -r requirements.txt
+python app.py
+```
+Open:
+```text
+http://127.0.0.1:7860
+```
+## Deploy Modal Backend
+```powershell
+python -m modal deploy modal_app.py
+```
+After deployment, pre-warm every configured courtroom model in the deployed `sovereign-bench` app so the first trial does not wait for all GPU containers to cold start. Run this after each deploy because deployments reset Modal autoscaler overrides:
+```powershell
+python -m modal run modal_app.py::warm_models
+```
+If the endpoint changes, set the Hugging Face Space variable:
+```text
+MODAL_TRIAL_URL=https://your-modal-endpoint.example
+```
+## Deploy Hugging Face Space
+Create or upload this repo as a Gradio Space inside the official Build Small org:
+```text
+build-small-hackathon/<your-space-name>
+```
+Space settings:
+- SDK: Gradio
+- App file: `app.py`
+- Python requirements: `requirements.txt`
+- Optional variable: `MODAL_TRIAL_URL`
+- No Space secret is required if using the hosted Modal endpoint.
+## Verification
+```powershell
+python -m pytest
+```
+Focused checks used during final prep:
+```powershell
+python -m pytest tests/test_engine.py tests/test_ui_rendering.py
+```
+## Limitations
+- Judge-GPT is not legal advice and should not be used for real legal decisions.
+- The demo packets are compact, staged evidence packets, not exhaustive source research.
+- Model, Modal, or retrieval failures stop the current trial instead of substituting fake dialogue.
+- Trial results are not persisted across sessions.
+- Custom trials require a short case context and evidence from both sides.
+## Final Submission Checklist
+- [ ] Push the repo to the Build Small Hugging Face org as a Gradio Space.
+- [ ] Confirm the Space launches and can complete `Trial of Socrates`.
+- [ ] Record a short demo video showing the trial flow and verdict.
+- [ ] Replace the `Demo video` TODO above with the final public URL.
+- [ ] Publish one social post about the app.
+- [ ] Replace the `Social post` TODO above with the final public URL.
+- [ ] Run the README through the Build Small validator.

modal_app.py CHANGED Viewed

@@ -1,193 +1,212 @@
-import os
-import time
-import modal
-from sovereign_bench.engine import stream_trial_jsonl
-from sovereign_bench.llm import (
-    ModelCall,
-    ModelResult,
-    build_role_messages,
-    messages_hash,
-)
-from sovereign_bench.models import TrialRequest
-app = modal.App("sovereign-bench")
-GPU_NAME = "H100"
-GPU_TIMEOUT_SECONDS = 20 * 60
-HF_CACHE_DIR = "/root/.cache/huggingface"
-image = (
-    modal.Image.debian_slim(python_version="3.12")
-    .pip_install("fastapi", "huggingface_hub", "httpx", "pydantic")
-    .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
-)
-model_cache = modal.Volume.from_name("sovereign-bench-model-cache", create_if_missing=True)
-vllm_image = (
-    modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu22.04", add_python="3.12")
-    .entrypoint([])
-    .uv_pip_install(
-        "vllm==0.18.1",
-        "huggingface_hub[hf_transfer]==0.36.0",
-        "transformers",
-        "httpx",
-        "pydantic",
-    )
-    .env(
-        {
-            "HF_HUB_ENABLE_HF_TRANSFER": "1",
-            "HF_HOME": HF_CACHE_DIR,
-            "VLLM_WORKER_MULTIPROC_METHOD": "spawn",
-            "VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": "1",
-        }
-    )
-    .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
-)
-@app.cls(
-    image=vllm_image,
-    gpu=GPU_NAME,
-    secrets=[modal.Secret.from_name("huggingface")],
-    volumes={HF_CACHE_DIR: model_cache},
-    timeout=GPU_TIMEOUT_SECONDS,
-    scaledown_window=10 * 60,
-    max_containers=3,
-)
-class VllmModel:
-    model_id: str = modal.parameter()
-    @modal.enter()
-    def load(self) -> None:
-        from vllm import LLM, SamplingParams
-        self.SamplingParams = SamplingParams
-        self.llm = LLM(
-            model=self.model_id,
-            trust_remote_code=True,
-            max_model_len=4096,
-            gpu_memory_utilization=0.9,
-        )
-    @modal.method()
-    def generate(self, payload: dict) -> dict:
-        from sovereign_bench.llm import ModelCallError, clean_model_text
-        started = time.perf_counter()
-        messages = payload["messages"]
-        max_tokens = int(payload.get("max_tokens") or 120)
-        temperature = float(payload.get("temperature") or 0.45)
-        sampling_params = self.SamplingParams(
-            max_tokens=max_tokens,
-            temperature=temperature,
-            top_p=0.9,
-        )
-        retry_messages = messages + [
-            {
-                "role": "user",
-                "content": (
-                    "Your previous response did not include visible courtroom dialogue. "
-                    "Return only the final spoken dialogue now. Do not include <think>, analysis, reasoning, markdown, or notes. /no_think"
-                ),
-            }
-        ]
-        last_error: Exception | None = None
-        text = ""
-        for attempt_messages in (messages, retry_messages):
-            outputs = self.llm.chat(
-                [attempt_messages],
-                sampling_params=sampling_params,
-                use_tqdm=False,
-                chat_template_kwargs={"enable_thinking": False},
-            )
-            raw_text = outputs[0].outputs[0].text.strip()
-            try:
-                text = clean_model_text(raw_text)
-                break
-            except ModelCallError as exc:
-                last_error = exc
-        if not text and last_error:
-            raise last_error
-        return {
-            "text": text,
-            "latency_ms": int((time.perf_counter() - started) * 1000),
-        }
-def modal_gpu_enabled() -> bool:
-    return os.getenv("SOVEREIGN_DISABLE_MODAL_GPU", "").lower() not in {"1", "true", "yes"}
-def modal_gpu_runner(**kwargs) -> ModelResult:
-    messages = build_role_messages(
-        agent=kwargs["agent"],
-        role=kwargs["role"],
-        case_summary=kwargs["case_summary"],
-        task=kwargs["task"],
-        evidence_summary=kwargs["evidence_summary"],
-    )
-    requested_model = kwargs["model"]
-    prompt_hash = messages_hash(messages)
-    if modal_gpu_enabled():
-        output = VllmModel(model_id=requested_model).generate.remote(
-            {
-                "messages": messages,
-                "max_tokens": kwargs.get("max_tokens", 120),
-                "temperature": 0.45,
-            }
-        )
-        return ModelResult(
-            text=output["text"],
-            input_text="\n\n".join(f"{item.get('role', 'user').upper()}:\n{item.get('content', '')}" for item in messages)
-            + "\n\nASSISTANT:\n",
-            call=ModelCall(
-                model=requested_model,
-                provider="modal-gpu-vllm",
-                ok=True,
-                latency_ms=output["latency_ms"],
-                prompt_hash=prompt_hash,
-                requested_model=requested_model,
-                runtime="modal-gpu-vllm",
-                gpu=GPU_NAME,
-            ),
-        )
-    raise RuntimeError("Modal GPU is disabled; no provider fallback is allowed.")
-@app.function(image=image, secrets=[modal.Secret.from_name("huggingface")])
-def check_huggingface_connection() -> str:
-    token = os.getenv("HF_TOKEN")
-    if not token:
-        return "HF_TOKEN is not available inside Modal."
-    from huggingface_hub import HfApi
-    user = HfApi(token=token).whoami()["name"]
-    return f"Connected to Hugging Face as {user}."
-@app.function(
-    image=image,
-    secrets=[modal.Secret.from_name("huggingface")],
-    min_containers=1,
-    timeout=GPU_TIMEOUT_SECONDS,
-)
-@modal.fastapi_endpoint(method="POST", label="trial-stream")
-def trial_stream(payload: dict):
-    from fastapi.responses import StreamingResponse
-    request = TrialRequest.model_validate(payload)
-    delay = {"swift": 0.02, "measured": 0.12, "ceremonial": 0.25}[request.speed]
-    return StreamingResponse(
-        stream_trial_jsonl(request, delay=delay, model_runner=modal_gpu_runner),
-        media_type="application/x-ndjson",
-    )
-@app.local_entrypoint()
-def main():
-    print(check_huggingface_connection.remote())

+import os
+import time
+import modal
+from sovereign_bench.engine import MODEL_BUDGET, stream_trial_jsonl
+from sovereign_bench.llm import (
+    ModelCall,
+    ModelResult,
+    build_role_messages,
+    messages_hash,
+)
+from sovereign_bench.models import TrialRequest
+MODAL_APP_NAME = "sovereign-bench"
+app = modal.App(MODAL_APP_NAME)
+GPU_NAME = "H100"
+GPU_TIMEOUT_SECONDS = 20 * 60
+HF_CACHE_DIR = "/root/.cache/huggingface"
+USED_MODEL_IDS = tuple(dict.fromkeys(model for _, model, _ in MODEL_BUDGET))
+image = (
+    modal.Image.debian_slim(python_version="3.12")
+    .pip_install("fastapi", "huggingface_hub", "httpx", "pydantic")
+    .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
+)
+model_cache = modal.Volume.from_name("sovereign-bench-model-cache", create_if_missing=True)
+vllm_image = (
+    modal.Image.from_registry("nvidia/cuda:12.8.1-devel-ubuntu22.04", add_python="3.12")
+    .entrypoint([])
+    .uv_pip_install(
+        "vllm==0.18.1",
+        "huggingface_hub[hf_transfer]==0.36.0",
+        "transformers",
+        "httpx",
+        "pydantic",
+    )
+    .env(
+        {
+            "HF_HUB_ENABLE_HF_TRANSFER": "1",
+            "HF_HOME": HF_CACHE_DIR,
+            "VLLM_WORKER_MULTIPROC_METHOD": "spawn",
+            "VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8": "1",
+        }
+    )
+    .add_local_dir("sovereign_bench", remote_path="/root/sovereign_bench")
+)
+@app.cls(
+    image=vllm_image,
+    gpu=GPU_NAME,
+    secrets=[modal.Secret.from_name("huggingface")],
+    volumes={HF_CACHE_DIR: model_cache},
+    timeout=GPU_TIMEOUT_SECONDS,
+    scaledown_window=10 * 60,
+    max_containers=3,
+)
+class VllmModel:
+    model_id: str = modal.parameter()
+    @modal.enter()
+    def load(self) -> None:
+        from vllm import LLM, SamplingParams
+        self.SamplingParams = SamplingParams
+        self.llm = LLM(
+            model=self.model_id,
+            trust_remote_code=True,
+            max_model_len=4096,
+            gpu_memory_utilization=0.9,
+        )
+    @modal.method()
+    def generate(self, payload: dict) -> dict:
+        from sovereign_bench.llm import ModelCallError, clean_model_text
+        started = time.perf_counter()
+        messages = payload["messages"]
+        max_tokens = int(payload.get("max_tokens") or 120)
+        temperature = float(payload.get("temperature") or 0.45)
+        sampling_params = self.SamplingParams(
+            max_tokens=max_tokens,
+            temperature=temperature,
+            top_p=0.9,
+        )
+        retry_messages = messages + [
+            {
+                "role": "user",
+                "content": (
+                    "Your previous response did not include visible courtroom dialogue. "
+                    "Return only the final answer now. Do not mention prompts, tasks, requirements, or that you are following instructions. "
+                    "Do not include <think>, analysis, reasoning, markdown, narration, or notes. /no_think"
+                ),
+            }
+        ]
+        last_error: Exception | None = None
+        text = ""
+        for attempt_messages in (messages, retry_messages):
+            outputs = self.llm.chat(
+                [attempt_messages],
+                sampling_params=sampling_params,
+                use_tqdm=False,
+                chat_template_kwargs={"enable_thinking": False},
+            )
+            raw_text = outputs[0].outputs[0].text.strip()
+            try:
+                text = clean_model_text(raw_text)
+                break
+            except ModelCallError as exc:
+                last_error = exc
+        if not text and last_error:
+            raise last_error
+        return {
+            "text": text,
+            "latency_ms": int((time.perf_counter() - started) * 1000),
+        }
+    @modal.method()
+    def warm(self) -> dict:
+        return {"model": self.model_id, "status": "warm"}
+def modal_gpu_enabled() -> bool:
+    return os.getenv("SOVEREIGN_DISABLE_MODAL_GPU", "").lower() not in {"1", "true", "yes"}
+def modal_gpu_runner(**kwargs) -> ModelResult:
+    messages = build_role_messages(
+        agent=kwargs["agent"],
+        role=kwargs["role"],
+        case_summary=kwargs["case_summary"],
+        task=kwargs["task"],
+        evidence_summary=kwargs["evidence_summary"],
+        trial_history=kwargs.get("trial_history", ""),
+        persona=kwargs.get("persona", ""),
+        objective=kwargs.get("objective", ""),
+    )
+    requested_model = kwargs["model"]
+    prompt_hash = messages_hash(messages)
+    if modal_gpu_enabled():
+        output = VllmModel(model_id=requested_model).generate.remote(
+            {
+                "messages": messages,
+                "max_tokens": kwargs.get("max_tokens", 120),
+                "temperature": 0.45,
+            }
+        )
+        return ModelResult(
+            text=output["text"],
+            input_text="\n\n".join(f"{item.get('role', 'user').upper()}:\n{item.get('content', '')}" for item in messages)
+            + "\n\nASSISTANT:\n",
+            call=ModelCall(
+                model=requested_model,
+                provider="modal-gpu-vllm",
+                ok=True,
+                latency_ms=output["latency_ms"],
+                prompt_hash=prompt_hash,
+                requested_model=requested_model,
+                runtime="modal-gpu-vllm",
+                gpu=GPU_NAME,
+            ),
+        )
+    raise RuntimeError("Modal GPU is disabled; no provider fallback is allowed.")
+@app.function(image=image, secrets=[modal.Secret.from_name("huggingface")])
+def check_huggingface_connection() -> str:
+    token = os.getenv("HF_TOKEN")
+    if not token:
+        return "HF_TOKEN is not available inside Modal."
+    from huggingface_hub import HfApi
+    user = HfApi(token=token).whoami()["name"]
+    return f"Connected to Hugging Face as {user}."
+@app.function(
+    image=image,
+    secrets=[modal.Secret.from_name("huggingface")],
+    min_containers=1,
+    timeout=GPU_TIMEOUT_SECONDS,
+)
+@modal.fastapi_endpoint(method="POST", label="trial-stream")
+def trial_stream(payload: dict):
+    from fastapi.responses import StreamingResponse
+    request = TrialRequest.model_validate(payload)
+    delay = {"swift": 0.02, "measured": 0.12, "ceremonial": 0.25}[request.speed]
+    return StreamingResponse(
+        stream_trial_jsonl(request, delay=delay, model_runner=modal_gpu_runner),
+        media_type="application/x-ndjson",
+    )
+@app.local_entrypoint()
+def main():
+    print(check_huggingface_connection.remote())
+@app.local_entrypoint()
+def warm_models():
+    deployed_model = modal.Cls.from_name(MODAL_APP_NAME, "VllmModel")
+    for model_id in USED_MODEL_IDS:
+        model = deployed_model(model_id=model_id)
+        model.update_autoscaler(min_containers=1)
+        print(model.warm.remote())