tucano2-commerce / docs /reports /modal_deployment_postmortem.md
rtferraz's picture
add: Modal deployment lessons learned β€” TRL dependency hell postmortem
81195a7 verified
|
Raw
History Blame Contribute Delete
12 kB

Modal Deployment: Implementation Report

Date: 2026-05-11
Context: Deploying Future-as-Label GRPO training on Modal
Duration: ~3 hours of iteration (10 failed builds β†’ 1 successful launch)
Root cause: TRL's undeclared optional dependencies + eager import pattern


1. The Errors (in order)

# Error Root Cause
1 accelerate==1.2.1 conflicts with trl==0.24.0 TRL 0.24.0 requires accelerate>=1.4.0
2 transformers==4.47.1 conflicts with trl==0.24.0 TRL 0.24.0 now requires transformers>=4.56.1
3 No module named 'mergekit' TRL imports mergekit.config.MergeConfiguration unconditionally
4 No module named 'llm_blender' TRL imports llm_blender from callbacks.py unconditionally
5 cannot import name 'TRANSFORMERS_CACHE' llm_blender uses a removed transformers API
6 No module named 'weave' TRL imports weave.EvaluationLogger from callbacks.py
7 Invalid version: 'N/A' (vllm stub) Our vllm stub had no __version__, TRL tried to parse it
8 No module named 'vllm.distributed' vllm version check passed β†’ TRL tried deep vllm imports
9 GRPOConfig.__init__() got unexpected keyword argument 'max_prompt_length' Parameter renamed/removed in TRL 0.28
10 generation_batch_size (2) must be divisible by num_generations (16) TRL 0.28 changed batch/generation semantics
11 _amp_foreach_non_finite_check_and_unscale_cuda not implemented for BFloat16 T4 doesn't support bf16, but model weights are bf16

2. Root Cause Analysis

Problem A: TRL's Broken Dependency Declarations

TRL 0.24.0 and 0.28.0 both have undeclared hard imports of optional packages. The import chain:

from trl import GRPOTrainer
  β†’ trl/trainer/grpo_trainer.py
    β†’ from .callbacks import SyncRefModelCallback
      β†’ trl/trainer/callbacks.py
        β†’ from .judges import BasePairwiseJudge
          β†’ import llm_blender  ← NOT in requirements
        β†’ from weave import EvaluationLogger  ← NOT in requirements
      β†’ from ..generation.vllm_generation import VLLMGeneration
        β†’ from .vllm_client import VLLMClient
          β†’ from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator  ← DEEP internal
    β†’ from ..mergekit_utils import MergeConfig, merge_models
      β†’ from mergekit.config import MergeConfiguration  ← NOT in requirements

These are NOT guarded by try/except or is_available() checks. They execute unconditionally at import time. This means you cannot import GRPOTrainer without having weave, mergekit, llm_blender, and vllm installed β€” even though you'll never use any of them.

Problem B: PyPI Version Drift

TRL 0.24.0 on PyPI in May 2026 has different dependency requirements than TRL 0.24.0 from January 2025 (when we used it on Vertex AI with Unsloth). PyPI allows updating package metadata without changing the version number. The transformers>=4.56.1 requirement was added retroactively.

Problem C: vllm Import is Unconditional Despite Version Check

TRL has an is_vllm_available() function that checks the version range, but the actual from ..generation.vllm_generation import VLLMGeneration line in grpo_trainer.py is not guarded by this check. The check only controls whether warnings are shown. The import always executes.

Problem D: TRL 0.28 API Breaking Changes (Undocumented)

  • max_prompt_length removed from GRPOConfig (no deprecation warning in 0.28 release notes)
  • generation_batch_size constraint added: must be divisible by num_generations
  • warmup_ratio deprecated in favor of warmup_steps
  • per_device_train_batch_size semantics changed (now represents prompts, not total samples)

3. How We Fixed It

The Solution Stack (final working configuration):

image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.7.0",
        "trl==0.28.0",
        "transformers",        # let pip resolve compatible version
        "accelerate",          # let pip resolve
        "peft",
        "datasets",
        "bitsandbytes",
        "weave",               # install for real (lightweight W&B package)
        "mergekit",            # install for real
        "scipy", "pandas", "numpy<2", "safetensors",
        "huggingface_hub", "hf_transfer",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .run_commands(
        # llm_blender: install without deps, patch broken transformers import
        "pip install llm-blender --no-deps",
        "sed -i 's/from transformers.utils.hub import TRANSFORMERS_CACHE/TRANSFORMERS_CACHE = None/' .../llm_blender/blender/blender_utils.py",
        # vllm: patch TRL source to remove unconditional import
        "sed -i 's/from ..generation.vllm_generation import VLLMGeneration/VLLMGeneration = None/' .../trl/trainer/grpo_trainer.py",
        "sed -i 's/from .vllm_generation import VLLMGeneration/VLLMGeneration = None/' .../trl/generation/__init__.py",
        "sed -i 's/from .vllm_client import VLLMClient/VLLMClient = None/' .../trl/generation/__init__.py",
        # Validate
        "python -c 'from trl import GRPOConfig, GRPOTrainer; print(\"OK\")'",
    )
)

Key Decisions:

Package Approach Why
weave Install for real Lightweight, no conflicts
mergekit Install for real Has complex internals TRL imports
llm_blender Install --no-deps + patch Its deps conflict with modern transformers
vllm Patch TRL source with sed We don't use vllm; installing it conflicts with torch

The GRPOConfig Fix:

# TRL 0.28 constraints:
# - No max_prompt_length parameter
# - per_device_train_batch_size must be divisible by num_generations
# - fp16=False on T4 (model has bf16 weights, fp16 scaler can't handle them)
# - warmup_ratio deprecated (still works with warning)

GRPOConfig(
    per_device_train_batch_size=2,
    num_generations=2,          # 2 % 2 == 0 βœ“
    max_completion_length=48,
    fp16=False,                 # avoid bf16/fp16 scaler conflict on T4
    ...
)

4. Approaches That Failed

Approach Why It Failed
Stub packages (empty __init__.py) TRL imports specific classes/functions from submodules
vllm stub with __version__ = "0.11.0" Version check passes β†’ TRL tries to import vllm.distributed
vllm stub with __version__ = "0.0.1" TRL still imports vllm unconditionally despite check
trl[vllm]==0.28.0 + real vllm vllm requires torch==2.9.0, conflicts with torch==2.7.0
trl==0.24.0 (original plan) Requires transformers>=4.56.1 now; broken llm_blender import chain
trl[judges]==0.24.0 Same broken import chain
Unsloth integration Version conflicts with every other package

5. Lessons Learned

For Future Modal Deployments

  1. Never pin TRL tightly without testing the full import. TRL's PyPI metadata drifts, and its import graph is a minefield of undeclared deps. Always include a validation step: python -c 'from trl import GRPOConfig, GRPOTrainer' in the image build.

  2. Start from Modal's official examples. Their GRPO example (modal-examples/06_gpu_and_ml/reinforcement-learning/grpo_trl.py) uses trl[vllm]==0.28.0 with vllm==0.12.0 + torch==2.9.0. If you need vllm, match those exactly. If you don't need vllm, patch it out.

  3. sed patches on library source are legitimate. When a library has broken optional imports that can't be resolved through pip, patching the source during image build is the pragmatic solution. Always validate with an import check afterward.

  4. Don't pin transitive deps. Let pip resolve transformers, accelerate, peft versions freely β€” only pin torch and trl. Every pinned version is a potential conflict.

  5. T4 doesn't support bf16. If the model's weights are bf16, either:

    • Load with torch_dtype=torch.float32 (safe, slightly more memory)
    • Set fp16=False in TrainingArguments (disable mixed precision entirely)
    • Never set fp16=True on T4 with a bf16 model β€” the grad scaler crashes
  6. TRL 0.28 changed per_device_train_batch_size semantics. It now means "prompts per step" and must be β‰₯ num_generations (or more precisely, divisible by it). With G=16 you need batchβ‰₯16 which may not fit. Start with G=2 or G=4.

  7. Use --detach carefully. Modal's detach mode only keeps the last triggered function alive. If train.remote() finishes and then evaluate.remote() starts, the train container dies. For sequential functions, use modal run without --detach, or use .spawn() for true background work.

For the ML Workflow

  1. Separate image build from training logic. The image build should be validated independently with import checks before any GPU time is consumed. Every failed runtime import wastes a cold-start + GPU allocation cycle.

  2. G=2 with batch=2 is the minimum viable GRPO config. It works but provides weak signal (70% zero-std batches in our run). For real experiments, G=8+ with batch=8+ is better β€” but requires more VRAM or a larger GPU.

  3. 4-bit quantized 0.5B models are tiny. They fit on any GPU without mixed precision. Don't waste time optimizing fp16/bf16 β€” just disable it and let the model run in whatever dtype it loads as.


6. Decisions and Consequences

Decision Rationale Consequence
Switch from TRL 0.24 to 0.28 0.24 had broken import chain with modern transformers Required learning new GRPOConfig API
Drop Unsloth Version conflicts with every TRL/transformers combo Slightly slower training (~10%), but eliminates 90% of dep issues
Install weave + mergekit for real Stubs don't work (TRL imports specific classes) Adds ~200MB to image, but builds reliably
Patch vllm out of TRL source No valid vllm version works without torch conflict We lose ability to use vllm acceleration (irrelevant for 0.5B)
Use T4 instead of L40S Cost savings (~$0.60/hr vs ~$1.50/hr) Slower, no bf16, but 0.5B model fits easily
Set fp16=False T4 bf16 incompatibility with model weights Slightly more memory, but training runs without crashing
Set num_generations=2 TRL 0.28 batch divisibility constraint Weaker GRPO signal (70% zero-std), slower learning
Keep max_completion_length=48 "PROB: 0.73" fits in ~10 tokens 85% clipped ratio means model generates preamble; may need 128 if format doesn't converge

7. What We'd Do Differently Next Time

  1. Start with Modal's exact example deps (torch 2.9.0 + trl 0.28.0 + vllm 0.12.0), then remove vllm if not needed β€” instead of building from scratch.

  2. Use image.run_function() instead of run_commands for complex patching β€” Modal supports running Python functions during image build, which avoids shell quoting issues.

  3. Test image build locally first with modal run --no-remote or a local Docker build to catch import errors without consuming GPU credits.

  4. Pin fewer things. Only torch and trl need pinning. Everything else should be >=minimum at most.

  5. For TRL specifically: file a GitHub issue about the undeclared imports. The correct fix is upstream: try: import weave; except ImportError: weave = None in their callbacks.py.


8. Final Working State

Image: debian_slim + torch 2.7.0 + trl 0.28.0 + patches
GPU: Tesla T4 (16GB)
Training: 800 steps, 6.25s/step, ~83 min total
Status: Running, reward=0.08 at step 11 (format learning phase)
Cost estimate: ~$0.83 (83 min Γ— $0.60/hr)

The training is now running successfully. First 11 steps show the model producing 85% clipped completions with 8% parse rate β€” consistent with the "format learning phase" prediction from the ADR. The format bonus should gradually teach the model to output "PROB: X.XX" over the next ~100 steps.