Modal Deployment: Implementation Report
Date: 2026-05-11
Context: Deploying Future-as-Label GRPO training on Modal
Duration: ~3 hours of iteration (10 failed builds β 1 successful launch)
Root cause: TRL's undeclared optional dependencies + eager import pattern
1. The Errors (in order)
| # | Error | Root Cause |
|---|---|---|
| 1 | accelerate==1.2.1 conflicts with trl==0.24.0 |
TRL 0.24.0 requires accelerate>=1.4.0 |
| 2 | transformers==4.47.1 conflicts with trl==0.24.0 |
TRL 0.24.0 now requires transformers>=4.56.1 |
| 3 | No module named 'mergekit' |
TRL imports mergekit.config.MergeConfiguration unconditionally |
| 4 | No module named 'llm_blender' |
TRL imports llm_blender from callbacks.py unconditionally |
| 5 | cannot import name 'TRANSFORMERS_CACHE' |
llm_blender uses a removed transformers API |
| 6 | No module named 'weave' |
TRL imports weave.EvaluationLogger from callbacks.py |
| 7 | Invalid version: 'N/A' (vllm stub) |
Our vllm stub had no __version__, TRL tried to parse it |
| 8 | No module named 'vllm.distributed' |
vllm version check passed β TRL tried deep vllm imports |
| 9 | GRPOConfig.__init__() got unexpected keyword argument 'max_prompt_length' |
Parameter renamed/removed in TRL 0.28 |
| 10 | generation_batch_size (2) must be divisible by num_generations (16) |
TRL 0.28 changed batch/generation semantics |
| 11 | _amp_foreach_non_finite_check_and_unscale_cuda not implemented for BFloat16 |
T4 doesn't support bf16, but model weights are bf16 |
2. Root Cause Analysis
Problem A: TRL's Broken Dependency Declarations
TRL 0.24.0 and 0.28.0 both have undeclared hard imports of optional packages. The import chain:
from trl import GRPOTrainer
β trl/trainer/grpo_trainer.py
β from .callbacks import SyncRefModelCallback
β trl/trainer/callbacks.py
β from .judges import BasePairwiseJudge
β import llm_blender β NOT in requirements
β from weave import EvaluationLogger β NOT in requirements
β from ..generation.vllm_generation import VLLMGeneration
β from .vllm_client import VLLMClient
β from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator β DEEP internal
β from ..mergekit_utils import MergeConfig, merge_models
β from mergekit.config import MergeConfiguration β NOT in requirements
These are NOT guarded by try/except or is_available() checks. They execute unconditionally at import time. This means you cannot import GRPOTrainer without having weave, mergekit, llm_blender, and vllm installed β even though you'll never use any of them.
Problem B: PyPI Version Drift
TRL 0.24.0 on PyPI in May 2026 has different dependency requirements than TRL 0.24.0 from January 2025 (when we used it on Vertex AI with Unsloth). PyPI allows updating package metadata without changing the version number. The transformers>=4.56.1 requirement was added retroactively.
Problem C: vllm Import is Unconditional Despite Version Check
TRL has an is_vllm_available() function that checks the version range, but the actual from ..generation.vllm_generation import VLLMGeneration line in grpo_trainer.py is not guarded by this check. The check only controls whether warnings are shown. The import always executes.
Problem D: TRL 0.28 API Breaking Changes (Undocumented)
max_prompt_lengthremoved from GRPOConfig (no deprecation warning in 0.28 release notes)generation_batch_sizeconstraint added: must be divisible bynum_generationswarmup_ratiodeprecated in favor ofwarmup_stepsper_device_train_batch_sizesemantics changed (now represents prompts, not total samples)
3. How We Fixed It
The Solution Stack (final working configuration):
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch==2.7.0",
"trl==0.28.0",
"transformers", # let pip resolve compatible version
"accelerate", # let pip resolve
"peft",
"datasets",
"bitsandbytes",
"weave", # install for real (lightweight W&B package)
"mergekit", # install for real
"scipy", "pandas", "numpy<2", "safetensors",
"huggingface_hub", "hf_transfer",
extra_index_url="https://download.pytorch.org/whl/cu128",
)
.run_commands(
# llm_blender: install without deps, patch broken transformers import
"pip install llm-blender --no-deps",
"sed -i 's/from transformers.utils.hub import TRANSFORMERS_CACHE/TRANSFORMERS_CACHE = None/' .../llm_blender/blender/blender_utils.py",
# vllm: patch TRL source to remove unconditional import
"sed -i 's/from ..generation.vllm_generation import VLLMGeneration/VLLMGeneration = None/' .../trl/trainer/grpo_trainer.py",
"sed -i 's/from .vllm_generation import VLLMGeneration/VLLMGeneration = None/' .../trl/generation/__init__.py",
"sed -i 's/from .vllm_client import VLLMClient/VLLMClient = None/' .../trl/generation/__init__.py",
# Validate
"python -c 'from trl import GRPOConfig, GRPOTrainer; print(\"OK\")'",
)
)
Key Decisions:
| Package | Approach | Why |
|---|---|---|
weave |
Install for real | Lightweight, no conflicts |
mergekit |
Install for real | Has complex internals TRL imports |
llm_blender |
Install --no-deps + patch |
Its deps conflict with modern transformers |
vllm |
Patch TRL source with sed | We don't use vllm; installing it conflicts with torch |
The GRPOConfig Fix:
# TRL 0.28 constraints:
# - No max_prompt_length parameter
# - per_device_train_batch_size must be divisible by num_generations
# - fp16=False on T4 (model has bf16 weights, fp16 scaler can't handle them)
# - warmup_ratio deprecated (still works with warning)
GRPOConfig(
per_device_train_batch_size=2,
num_generations=2, # 2 % 2 == 0 β
max_completion_length=48,
fp16=False, # avoid bf16/fp16 scaler conflict on T4
...
)
4. Approaches That Failed
| Approach | Why It Failed |
|---|---|
Stub packages (empty __init__.py) |
TRL imports specific classes/functions from submodules |
vllm stub with __version__ = "0.11.0" |
Version check passes β TRL tries to import vllm.distributed |
vllm stub with __version__ = "0.0.1" |
TRL still imports vllm unconditionally despite check |
trl[vllm]==0.28.0 + real vllm |
vllm requires torch==2.9.0, conflicts with torch==2.7.0 |
trl==0.24.0 (original plan) |
Requires transformers>=4.56.1 now; broken llm_blender import chain |
trl[judges]==0.24.0 |
Same broken import chain |
| Unsloth integration | Version conflicts with every other package |
5. Lessons Learned
For Future Modal Deployments
Never pin TRL tightly without testing the full import. TRL's PyPI metadata drifts, and its import graph is a minefield of undeclared deps. Always include a validation step:
python -c 'from trl import GRPOConfig, GRPOTrainer'in the image build.Start from Modal's official examples. Their GRPO example (
modal-examples/06_gpu_and_ml/reinforcement-learning/grpo_trl.py) usestrl[vllm]==0.28.0withvllm==0.12.0+torch==2.9.0. If you need vllm, match those exactly. If you don't need vllm, patch it out.sed patches on library source are legitimate. When a library has broken optional imports that can't be resolved through pip, patching the source during image build is the pragmatic solution. Always validate with an import check afterward.
Don't pin transitive deps. Let pip resolve
transformers,accelerate,peftversions freely β only pintorchandtrl. Every pinned version is a potential conflict.T4 doesn't support bf16. If the model's weights are bf16, either:
- Load with
torch_dtype=torch.float32(safe, slightly more memory) - Set
fp16=Falsein TrainingArguments (disable mixed precision entirely) - Never set
fp16=Trueon T4 with a bf16 model β the grad scaler crashes
- Load with
TRL 0.28 changed
per_device_train_batch_sizesemantics. It now means "prompts per step" and must be β₯num_generations(or more precisely, divisible by it). With G=16 you need batchβ₯16 which may not fit. Start with G=2 or G=4.Use
--detachcarefully. Modal's detach mode only keeps the last triggered function alive. Iftrain.remote()finishes and thenevaluate.remote()starts, the train container dies. For sequential functions, usemodal runwithout--detach, or use.spawn()for true background work.
For the ML Workflow
Separate image build from training logic. The image build should be validated independently with import checks before any GPU time is consumed. Every failed runtime import wastes a cold-start + GPU allocation cycle.
G=2 with batch=2 is the minimum viable GRPO config. It works but provides weak signal (70% zero-std batches in our run). For real experiments, G=8+ with batch=8+ is better β but requires more VRAM or a larger GPU.
4-bit quantized 0.5B models are tiny. They fit on any GPU without mixed precision. Don't waste time optimizing fp16/bf16 β just disable it and let the model run in whatever dtype it loads as.
6. Decisions and Consequences
| Decision | Rationale | Consequence |
|---|---|---|
| Switch from TRL 0.24 to 0.28 | 0.24 had broken import chain with modern transformers | Required learning new GRPOConfig API |
| Drop Unsloth | Version conflicts with every TRL/transformers combo | Slightly slower training (~10%), but eliminates 90% of dep issues |
| Install weave + mergekit for real | Stubs don't work (TRL imports specific classes) | Adds ~200MB to image, but builds reliably |
| Patch vllm out of TRL source | No valid vllm version works without torch conflict | We lose ability to use vllm acceleration (irrelevant for 0.5B) |
| Use T4 instead of L40S | Cost savings (~$0.60/hr vs ~$1.50/hr) | Slower, no bf16, but 0.5B model fits easily |
| Set fp16=False | T4 bf16 incompatibility with model weights | Slightly more memory, but training runs without crashing |
| Set num_generations=2 | TRL 0.28 batch divisibility constraint | Weaker GRPO signal (70% zero-std), slower learning |
| Keep max_completion_length=48 | "PROB: 0.73" fits in ~10 tokens | 85% clipped ratio means model generates preamble; may need 128 if format doesn't converge |
7. What We'd Do Differently Next Time
Start with Modal's exact example deps (torch 2.9.0 + trl 0.28.0 + vllm 0.12.0), then remove vllm if not needed β instead of building from scratch.
Use
image.run_function()instead ofrun_commandsfor complex patching β Modal supports running Python functions during image build, which avoids shell quoting issues.Test image build locally first with
modal run --no-remoteor a local Docker build to catch import errors without consuming GPU credits.Pin fewer things. Only
torchandtrlneed pinning. Everything else should be>=minimumat most.For TRL specifically: file a GitHub issue about the undeclared imports. The correct fix is upstream:
try: import weave; except ImportError: weave = Nonein their callbacks.py.
8. Final Working State
Image: debian_slim + torch 2.7.0 + trl 0.28.0 + patches
GPU: Tesla T4 (16GB)
Training: 800 steps, 6.25s/step, ~83 min total
Status: Running, reward=0.08 at step 11 (format learning phase)
Cost estimate: ~$0.83 (83 min Γ $0.60/hr)
The training is now running successfully. First 11 steps show the model producing 85% clipped completions with 8% parse rate β consistent with the "format learning phase" prediction from the ADR. The format bonus should gradually teach the model to output "PROB: X.XX" over the next ~100 steps.