Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
AWS SageMaker Quickstart — the runnable-now GRPO smoke
The minimum path to running the Composer-replication RL inner loop on a real
GPU, end-to-end, for under $1. Implements F3 (research/design-F3-rl-sagemaker.md).
Live account facts (verified 2026-06-09, acct 386931836011, us-west-2)
| Fact | Value |
|---|---|
ml.g5.2xlarge training-job quota |
1 (live, code L-2D6DEB3C) → no quota ticket needed |
| Execution role | arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247 |
| Bucket (rendezvous + output) | amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d |
| PyTorch DLC base image | 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26 |
The base image MUST be the torch-2.7 DLC — learned from two live runs (2026-06-09). The dependency chain forces it:
ComposerReplicationTrainer → trl 1.5.x → transformers>=4.56.2 → torch.float8_e8m0fnu (MXFP4, torch>=2.7). The torch-2.6 DLC failsAutoModel.from_pretrainedwithAttributeError: module 'torch' has no attribute 'float8_e8m0fnu', and pinning transformers down is impossible (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory-vX.Ybuild suffix — no bare floating tag):aws ecr describe-images --registry-id 763104351884 \ --repository-name pytorch-training --region us-west-2 \ --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \ | tr '\t' '\n' | grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' | head -1vLLM is OFF by default in the smoke:
vllm==0.8.5hard-pinstorch==2.6.0, which fights the torch-2.7 base. The smoke usesmodel.generaterollout (what it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM, bake a torch-2.7-matchedvllm>=0.9into an image and pass--image <ecr> --vllm.
SDK pin: the smoke launcher uses the sagemaker SDK v2 Estimator API. SDK v3 is an API rewrite that dropped
sagemaker.estimator.Estimatorandsagemaker.pytorch— installpip install 'sagemaker>=2.200,<3'.
Run it (no local Docker build)
pip install 'sagemaker>=2.200,<3'
export AWS_REGION=us-west-2
python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20
This uses the stock PyTorch DLC directly as the training image and ships the
framework + entry script via source_dir; examples/gsm8k_grpo/requirements.txt
(trl + the RL stack) installs at job start. No 15 GB local image build, no ECR
push. The script trains Qwen/Qwen2.5-0.5B-Instruct with GRPO + the GSM8K
#### NUMBER RLVR reward, using model.generate rollout (vLLM off by default —
see the torch-pin note above).
Flags: --no-wait (submit + poll later), --spot (managed spot, quota=1 too),
--vllm (enable colocated vLLM — only with a baked --image carrying a
torch-2.7 vllm), --image <ecr-uri> (use a prebuilt baked image instead of the DLC).
Cost: ml.g5.2xlarge ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
~15–25 min ⇒ well under $1. Spot ≈ $0.45–0.60/hr ⇒ pennies.
The repeatable path (baked image)
For runs where the ~5–10 min per-job pip-install is unwanted (and for the
DiLoCo N-replica SageMakerExecutor, which passes ContainerEntrypoint and
needs the framework baked in), build the image once:
scripts/build_and_push_ecr.sh # creates ECR repo, builds, pushes composer-rl:smoke
python examples/gsm8k_grpo/run_sagemaker_launch.py \
--image 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke
On an Apple-Silicon host the build cross-compiles (--platform linux/amd64) the
~15 GB GPU DLC under emulation — slow; prefer a linux/amd64 host or CodeBuild.
Gotchas (load-bearing)
EnableNetworkIsolationstays False (the default) so the container can reachhuggingface.co(model + GSM8K download) and S3.vllm_gpu_memory_utilization=0.3is the load-bearing knob on a 24 GB A10G: too high → OOM when the policy + grads also need the GPU; too low → tiny KV cache. Use--no-vllmif a vLLM wheel/CUDA mismatch surfaces.- Warm pools are off —
g5 training warm pool usagequota is 0 in this account, so each job pays ~3–6 min cold-start. Request a warm-pool quota bump for iterative dev, or move the long inner loop to HyperPod (persistent). - "Waiting for capacity" in
SecondaryStatusis transient g5 capacity contention in the region, not an error — the job proceeds when capacity frees.
Next: DiLoCo N-replica (the SageMakerExecutor path)
examples/diloco_sagemaker/run.py (driver, F3 §4.3) drives N independent
single-instance Training Jobs sharing one s3://.../rendezvous/ prefix via
ObjectStoreAllReduce — no cross-job NCCL. N=1 runs today; N=2–4 needs a
ml.g5.2xlarge for training job usage quota increase. The DiLoCo math, loss,
trainer, and ObjectStoreAllReduce are unchanged from the smoke — the S3
rendezvous is the entire portability contract (validated file:// and live
s3://; see test_serverless_local.py::test_s3_rendezvous_allreduce_across_replicas).