llm-training / RunPod LLM Fine-Tuning Strategy.md
percyraskova's picture
Upload folder using huggingface_hub
81b3473 verified

Operationalizing Headless Large Language Model Fine-Tuning on RunPod: A Comprehensive Infrastructure and Workflow Analysis

1. Introduction: The Paradigm Shift to Headless AI Operations

The contemporary landscape of Large Language Model (LLM) development is currently undergoing a fundamental transition, shifting from interactive, exploratory environments toward rigorous, automated production pipelines. For years, the Jupyter notebook has served as the lingua franca of data science—a canvas for experimentation, visualization, and iterative code development. However, as the field matures from research to engineering, the limitations of the notebook paradigm become increasingly acute, particularly when applied to the resource-intensive and time-critical task of fine-tuning custom LLMs. The user requirement for a workflow that eliminates the notebook interface in favor of a "code-upload-and-train" paradigm reflects a sophisticated understanding of MLOps principles: reproducibility, resource efficiency, and maximizing computational throughput.
RunPod, as a specialized GPU cloud provider, occupies a unique and critical niche within this evolving ecosystem. Distinct from hyperscalers such as AWS, Azure, or Google Cloud Platform—which often necessitate complex Identity and Access Management (IAM) configurations, quota negotiations, and long-term commitments—RunPod offers a container-centric infrastructure that is ideally optimized for headless, ephemeral training jobs. The platform’s architecture effectively democratizes access to high-performance compute, offering everything from consumer-grade RTX 4090s to enterprise-class NVIDIA H100 clusters. This report provides an exhaustive, expert-level analysis of the architectural, operational, and software strategies necessary to fine-tune custom LLMs on RunPod using a strictly headless approach.
To fully satisfy the requirement of "training as fast as possible" with "custom training data," this analysis moves beyond simple tutorials to construct a robust engineering framework. It dissects the interplay between hardware selection (Secure vs. Community Cloud), containerization strategies (Docker-based execution), and high-efficiency fine-tuning frameworks (Unsloth and Axolotl). By decoupling the training process from an interactive Integrated Development Environment (IDE), developers can leverage spot instances more effectively, dramatically reduce idle compute costs, and integrate training runs into broader Continuous Integration/Continuous Deployment (CI/CD) pipelines. This report serves as a definitive guide to architecting these headless systems.

---

2. Infrastructure Architecture and Instance Selection Strategy

The foundation of any high-performance fine-tuning workflow is the underlying compute architecture. In the context of "training as fast as possible," the choice of hardware dictates not only the wall-clock time of the training run but also the stability, cost-efficiency, and maximum capable model size of the session. RunPod’s inventory is segmented into distinct tiers, each offering specific advantages and liabilities for headless operations. A nuanced understanding of these hardware profiles is essential for optimizing the price-performance ratio.

2.1 The GPU Hierarchy: Performance Profiles and Architectural Suitability

The selection of a specific GPU architecture must be directly correlated with the parameter count of the target model (e.g., Llama 3 8B, Mistral, or Llama 3 70B) and the chosen quantization method (Full Fine-Tuning vs. LoRA/QLoRA).

The Enterprise Tier: NVIDIA H100 and A100

For users prioritizing raw speed and throughput above all else, the NVIDIA H100 and A100 Tensor Core GPUs represent the gold standard of current AI acceleration. These cards are designed for datacenter reliability and massive parallel throughput.
The NVIDIA H100 (80GB) stands as the pinnacle of current commercial AI hardware. It is specifically engineered to accelerate Transformer-based models via its fourth-generation Tensor Cores and the dedicated Transformer Engine, which automatically manages mixed-precision calculations using FP8 formats.1 For headless workflows, the H100 offers a distinct advantage: its sheer speed minimizes the "window of vulnerability." In a headless setup, particularly one utilizing spot instances or decentralized nodes, the longer a job runs, the higher the statistical probability of a network disconnect or node preemption. By reducing training time by factors of 3x or more compared to previous generations, the H100 significantly increases the reliability of job completion.2 It is the only viable option for users attempting to full fine-tune models in the 70B+ parameter range within reasonable timeframes. However, this performance comes at a premium, with costs ranging from approximately $2.69 to $4.00 per hour depending on the specific configuration (SXM vs. PCIe) and market demand.1
The NVIDIA A100 (80GB) remains the industry workhorse for LLM training. While it lacks the H100's specific FP8 Transformer Engine, its 80GB of High Bandwidth Memory (HBM2e) provides sufficient capacity to fine-tune 70B models using QLoRA or 8B models with full precision and extended context windows.1 The availability of A100s on RunPod is generally higher than that of H100s, making them a more reliable fallback for automated pipelines that require immediate provisioning without queuing. For users engaging in "headless" operations where the script automatically requests resources, the A100's ubiquity often makes it the path of least resistance.4

The Prosumer Tier: NVIDIA RTX 4090 and RTX 6000 Ada

For users targeting smaller models, such as the 7B or 8B parameter classes (e.g., Llama 3 8B, Mistral, Gemma), the enterprise tier may represent overkill. The NVIDIA RTX 4090 has emerged as an exceptionally cost-effective alternative for these specific workloads.
With 24GB of VRAM, the RTX 4090 can comfortably handle 8B models using 4-bit quantization (QLoRA) or, when paired with memory-efficient frameworks like Unsloth, even larger batch sizes.5 The cost efficiency is dramatic: at approximately $0.34 to $0.69 per hour, a developer can run extensive hyperparameter sweeps (grid searches) for the cost of a single hour on an H100.1 However, the use of consumer hardware in a headless workflow introduces specific constraints. These cards are typically hosted in the "Community Cloud" tier, meaning they are decentralized nodes often residing in non-tier-1 datacenters or even private residences. This introduces a higher risk of interruption, necessitating that the headless script implements robust, frequent checkpointing to resume training automatically if a node goes offline.
The RTX 6000 Ada Generation bridges the gap, offering 48GB of VRAM—double that of the 4090—while retaining the Ada Lovelace architecture's efficiency. Priced around $0.79/hr, it allows for training mid-sized models (e.g., 30B parameters with QLoRA) or 8B models with much longer context windows than the 4090 allows.1

2.2 Deployment Tiers: Secure Cloud vs. Community Cloud

RunPod segments its GPU inventory into two primary distinct tiers: Community Cloud and Secure Cloud. This distinction is critical for designing a headless operation, as it fundamentally dictates the reliability engineering required in the training code.
Secure Cloud represents enterprise-grade datacenters with high reliability, redundancy, and security certifications (SOC2, etc.). For a user whose primary requirement is to "upload and train," Secure Cloud offers the assurance that the pod will not vanish mid-training due to a provider pulling the machine off the network. The pricing is slightly higher, but the reduction in operational complexity—specifically the reduced need for aggressive fault-tolerance scripting—often outweighs the raw hourly cost difference.1 For the final "production" training run, specifically when processing a massive dataset that might take 10+ hours, Secure Cloud is the recommended tier to ensure uninterrupted execution.
Community Cloud consists of crowdsourced GPUs provided by third parties. While significantly cheaper, these function similarly to Spot instances in traditional clouds, though with potentially higher variance in uptime and network bandwidth. They are ideal for "bursty" workloads where a user might spin up 10 simultaneous experiments to test different learning rates. However, utilizing this tier for headless training requires the training script to be resilient. It implies that the "code upload" must include logic to check for existing checkpoints on a persistent volume and resume automatically, as the probability of a node restart is non-zero.1

2.3 Cost-Performance Matrix

To assist in making the precise hardware decision, the following table synthesizes the cost, utility, and risk profile of available hardware for fine-tuning tasks on RunPod.

GPU Model VRAM Cloud Tier Est. Price/Hr Best Use Case Headless Reliability
H100 SXM 80GB Secure ~$2.69 Full FT 70B+, Time-Critical Jobs High (Fastest completion minimizes risk)
A100 SXM 80GB Secure ~$1.49 QLoRA 70B, Full FT 8B High (Standard enterprise reliability)
A100 PCIe 40GB Secure ~$1.39 LoRA 13B-30B Medium (Memory constraints may limit batch size)
RTX 6000 Ada 48GB Secure ~$0.79 Mid-range models (30B), Long Context High (Excellent VRAM/Price ratio)
RTX 4090 24GB Community ~$0.34 QLoRA 8B, Debugging, Sweeps Low/Medium (Requires fault tolerance logic)
RTX 3090 24GB Community ~$0.22 Low-budget experimentation Low (Slower speed increases interrupt risk)

1

---

3. The Headless Workflow Architecture: Containerization and Automation

To satisfy the user's explicit requirement of avoiding a Jupyter notebook in favor of a "code upload" model, the workflow must shift from an interactive session to a batch-processing paradigm. In this model, the local machine is used for code development and configuration, while the remote GPU serves purely as an execution engine. This requires a Docker-first approach where the environment, code, and execution logic are encapsulated within a portable container image.

3.1 The Docker-First Approach

The cornerstone of a robust headless workflow is containerization. Launching a generic Ubuntu pod and manually installing libraries via a startup script is prone to error, hard to reproduce, and slow. Instead, the user must define the entire training environment in a Docker image. This ensures that "uploading code" translates immediately to execution without manual environment setup.

The "Entrypoint" Strategy

In a standard interactive RunPod session, the container launches and idles, typically running a sleep command or a Jupyter server, waiting for a user to connect. In a headless workflow, the Docker container utilizes an ENTRYPOINT or CMD script that immediately initiates the training process upon launch. Crucially, once the training process concludes (whether successfully or with a failure), the script handles data egress and terminates the pod.7
This approach perfectly aligns with the "upload code and train" desire. The "code" is baked into the Docker image (or mounted at runtime), and the "train" command is the automatic, inevitable action of the container starting up.

Constructing the Golden Image

A "Golden Image" for fine-tuning must include the base CUDA drivers, the Python environment, and the specific fine-tuning frameworks (Axolotl or Unsloth). Below is an architectural breakdown of such a Dockerfile, optimized for RunPod.
Scenario: A Docker image designed for fine-tuning Llama 3 using Unsloth.

Dockerfile

# Use RunPod's base image or NVIDIA's CUDA image to ensure driver compatibility
# CUDA 11.8 or 12.1 is often required for modern LLM frameworks
FROM runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel

# Set working directory
WORKDIR /workspace

# Install system dependencies
# git-lfs is critical for downloading large models/datasets
RUN apt-get update && apt-get install -y git git-lfs htop nvtop tmux

# Install Python dependencies
# Unsloth and Axolotl often require specific versions of xformers and trl
# Using a requirements.txt allows for easier version pinning
COPY requirements.txt /workspace/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Install specific frameworks (Example: Unsloth)
# Note: Unsloth installation often requires specific CUDA paths
RUN pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
RUN pip install --no-deps "xformers<0.0.26" "trl<0.9.0" peft accelerate bitsandbytes

# Copy the automation scripts and training code
COPY train.py /workspace/train.py
COPY start.sh /workspace/start.sh

# Make the start script executable
RUN chmod +x /workspace/start.sh

# Set the entrypoint to the automation script
ENTRYPOINT ["/workspace/start.sh"]

7
Optimization Insight: Embedding the dataset directly into the Docker image (via COPY dataset.jsonl) is a viable strategy only for small datasets (<5GB). For massive datasets (>100GB), as implied by the "custom training data" requirement, this approach creates bloated images that are slow to push and pull. For large-scale data, the start.sh script should be designed to pull the data from S3 or a RunPod Network Volume at runtime, ensuring the Docker image remains lightweight and agile.10

3.2 The Automation Logic: The start.sh Script

The start.sh script acts as the "brain" of the headless operation. It orchestrates the sequence of events inside the pod, managing authentication, data ingestion, execution, and cleanup.

Bash

#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status

# 1. Environment Setup (Logging in to Hugging Face and WandB)
# These tokens are passed via environment variables at runtime
if; then
huggingface-cli login --token $HF_TOKEN
fi

if; then
wandb login $WANDB_API_KEY
fi

# 2. Data Ingestion
# Download dataset from S3 or Network Volume if not present
if [! -f "/workspace/dataset.jsonl" ]; then
echo "Downloading dataset from remote source..."
# Example using a presigned URL or S3 CLI
wget -O /workspace/dataset.jsonl "$DATASET_URL"
fi

# 3. Execution
echo "Starting Training..."
# Launch the Python training script
# Unsloth or Axolotl command goes here
python train.py --config config.json

# 4. Exfiltration/Saving
echo "Training Complete. Merging and Uploading..."
# Assuming train.py saves to /workspace/output
# This step ensures the trained weights are saved to HF Hub or S3
python upload_to_hub.py --path /workspace/output --repo my-user/my-finetuned-model

# 5. Cleanup (Critical for Cost Savings)
echo "Shutting down pod to stop billing..."
runpodctl stop pod $RUNPOD_POD_ID

7
FinOps Strategy: By including runpodctl stop pod $RUNPOD_POD_ID as the final command, the user ensures they only pay for the exact duration of the training. This effectively transforms a standard GPU pod into a serverless-like job, preventing "zombie pods" from racking up bills after the training is finished.12

3.3 Remote Management: The runpodctl Utility

For managing these headless pods, runpodctl is the essential Command Line Interface (CLI) tool provided by RunPod. It allows the user to spin up pods, stream logs, and transfer files without ever navigating to the web console.13
Automation via CLI:
The user can script the deployment of the training job from their local machine. A single command can instantiate the pod using the custom image defined above:

Bash

runpodctl create pod \
--name "headless-llama3-finetune" \
--gpuType "NVIDIA A100 80GB PCIe" \
--imageName "myregistry/custom-llm-trainer:v1" \
--containerDiskSize 100 \
--volumeSize 200 \
--env HF_TOKEN=$HF_TOKEN \
--env WANDB_API_KEY=$WANDB_KEY \
--env DATASET_URL="https://my-s3-bucket..."

14
This command fulfills the user's request: it uploads the configuration (via the image definition) and starts training immediately. The --gpuType flag ensures the job lands on the specific hardware required for speed, while --env passes the necessary secrets securely.

---

4. Fine-Tuning Frameworks: The Engines of Efficiency

To train "as fast as possible" without reinventing the wheel, high-level fine-tuning frameworks are vastly superior to writing raw PyTorch training loops. The two leading contenders for this workflow on RunPod are Axolotl and Unsloth. Each offers distinct advantages for headless execution.

4.1 Axolotl: The Configuration-Driven Powerhouse

Axolotl is designed for users who want to define what to train, not how to code the training loop. It abstracts the complexity of the Hugging Face Trainer into a comprehensive YAML configuration file.15

  • Headless Suitability: Excellent. Because the entire training logic is encapsulated in a single YAML file, "uploading code" simply means injecting this config file into the container. There is no need to maintain complex Python scripts; the logic is declarative.
  • Feature Set: Axolotl supports Full Fine-Tuning (FFT), LoRA, QLoRA, and advanced techniques like Flash Attention 2 and Sample Packing. Sample packing is particularly relevant for speed, as it concatenates multiple short examples into a single sequence, removing padding tokens and significantly increasing training throughput.17
  • Workflow Integration:
    1. User edits config.yaml locally.
    2. User builds Docker image with this config or mounts it at runtime.
    3. Container starts and runs axolotl train config.yaml.
  • Multi-GPU Scaling: Axolotl excels at multi-GPU training using FSDP (Fully Sharded Data Parallel) or DeepSpeed. If the user intends to scale training across an 8x A100 node to maximize speed, Axolotl is the robust choice.17

4.2 Unsloth: The Efficiency Specialist

Unsloth is a framework optimized specifically for speed and memory efficiency on single-GPU setups. It utilizes custom Triton kernels to manually backpropagate gradients, achieving 2-5x faster training speeds and up to 80% less memory usage compared to standard Hugging Face implementations.17

  • Headless Suitability: High. Unsloth provides Docker images that can be easily adapted for headless execution.9 The speed gains directly address the user's requirement to "train as fast as possible."
  • Performance: For single-GPU setups (e.g., one H100 or A100), Unsloth is unrivaled. The memory savings allow users to fit significantly larger batch sizes into VRAM, which directly translates to faster wall-clock training times. For example, on a Llama 3 8B model, Unsloth can enable training with context lengths that would cause OOM (Out of Memory) errors on standard implementations.19
  • Limitation: Historically, Unsloth has been optimized for single-GPU training. While multi-GPU support is evolving, its primary strength remains in maximizing the throughput of a single card. For a user operating on a single powerful node (like an H100), Unsloth is likely the fastest option.18

4.3 Comparative Analysis for the User

Feature Axolotl Unsloth Strategic Recommendation
Configuration YAML-based (Declarative) Python/Script-based Axolotl for strict config management and reproducibility.
Speed (Single GPU) High (uses Flash Attn) Extreme (2x faster than Axolotl) Unsloth for raw speed on single cards (H100/A100).
Multi-GPU Native Support (DeepSpeed/FSDP) Limited/Paid Tier Axolotl for distributed training across clusters.
Ease of Headless Very High High Both are excellent; choice depends on scaling needs.

Expert Insight: Given the user's preference for "fast as possible" and "custom code," if the model fits on a single GPU (e.g., Llama 3 8B or 70B on an H100), Unsloth is the superior choice for raw throughput. If the user requires multi-GPU scaling or complex dataset mixing configurations, Axolotl provides a more robust infrastructure.18

---

5. Data Logistics: Solving the Custom Data Bottleneck

A major challenge in ephemeral, headless training is data logistics. The user specified "custom training data," which implies datasets that are not pre-cached in public hubs. Handling large datasets (100GB+) efficiently is critical to avoiding idle GPU time.

5.1 Storage Architectures: Network Volumes vs. NVMe vs. Object Storage

  • Local Pod Storage (Container Disk): This offers the fastest I/O performance. Data is stored on the NVMe SSD directly attached to the GPU instance. This is ideal for maximizing training speed, as the GPU is not starved of data. However, this storage is ephemeral; data is lost if the pod is terminated without external saving.5
  • RunPod Network Volumes: This is persistent storage that survives pod termination and allows data to be shared across pods.
    • Throughput Bottleneck: Network volumes can suffer from slower throughput (200-400 MB/s) compared to local NVMe, potentially bottlenecking the data loader during training of small models where the GPU processes batches faster than the disk can supply them.22
    • Region Lock: Network volumes are region-locked. If a volume is created in US-NJ, the user is forced to rent GPUs in US-NJ. This severely limits the ability to grab available H100s in other regions, contradicting the "train as fast as possible" goal.22
  • S3 / Object Storage: The most flexible approach. Data is stored in AWS S3 (or compatible) and streamed or downloaded at the start of the session.

5.2 Recommended Data Strategy for Speed

To maximize training speed, Local NVMe Storage is superior to Network Volumes, despite its ephemeral nature. The recommended workflow for headless execution is:

  1. Storage: Store the master dataset in a high-performance S3 bucket or RunPod's S3-compatible object storage layer.25
  2. Ingest: The start.sh script downloads the dataset from S3 to the pod's local /workspace directory (NVMe) at boot time.
  3. Train: The model trains off the fast local NVMe, ensuring the GPU is fully saturated.
  4. Egress: The start.sh uploads the checkpoints and final model back to S3 or Hugging Face.

This approach avoids the region-locking of Network Volumes and the I/O latency penalties, utilizing the immense bandwidth of datacenter GPUs for rapid setup.10

5.3 Transferring Large Data: The 100GB Challenge

For users who must use RunPod storage (e.g., due to compliance or cost), transferring 100GB+ of data from a local machine is non-trivial. The runpodctl send command creates a peer-to-peer transfer tunnel. While effective for smaller files, users have reported slow speeds and timeouts for large datasets.26
Insight: For datasets >100GB, do not upload from a home internet connection directly to a GPU pod. Instead:

  1. Spin up a cheap CPU pod on RunPod.
  2. Use rsync or runpodctl to upload the data to this CPU pod (which sits on the high-speed datacenter backbone).
  3. From the CPU pod, transfer the data to a Network Volume or S3 bucket.
    This leverages the internal network backbone rather than residential ISP uplinks, preventing the GPU pod from sitting idle while waiting for data uploads.

---

6. Monitoring and Observability without Jupyter

In a headless environment, "blind" training is a significant operational risk. Observability must be externalized to ensure the user knows if the model is converging or if the pod has crashed.

6.1 Weights & Biases (WandB)

WandB is the de facto standard for headless monitoring. It integrates natively with both Axolotl and Unsloth (via the Hugging Face Trainer).

  • Real-Time Metrics: Loss curves, GPU utilization, memory usage, and learning rate schedules are streamed to the WandB dashboard in real-time. This allows the user to monitor the "pulse" of the training from a mobile device or laptop.
  • Artifacts: Model checkpoints and config files can be logged as artifacts, providing version control for the models and ensuring reproducibility.

6.2 Remote Logging

RunPod provides a logging driver that captures stdout and stderr from the container.

  • Command: runpodctl logs <pod_id> allows the user to check the console output from their local terminal to verify the script started correctly or to catch crash errors (e.g., CUDA OOM).11
  • Best Practice: The start.sh script should use set -e (exit immediately on error) and trap errors. Advanced users may add a curl command to the script to send a notification (e.g., via a Discord webhook or Slack API) if the training fails or succeeds, ensuring the user is alerted immediately without needing to constantly poll the logs.

---

7. Advanced Optimization and Troubleshooting

7.1 Handling "Cold Starts" and Image Caching

Downloading large Docker images (often 10GB+ for ML images) takes time. RunPod caches images on the host node.

  • Strategy: Stick to a single image tag (e.g., myuser/trainer:v1). Once a specific host has pulled this image, subsequent runs on that same host are instant.
  • Docker Optimization: Use multi-stage builds to keep the final image size small. Remove cache files (pip cache purge) within the Dockerfile to minimize layer size.28

7.2 CUDA Version Mismatches

A common failure mode in custom images is a mismatch between the Docker container's CUDA toolkit and the host driver.

  • RunPod Environment: RunPod hosts generally run the latest NVIDIA drivers.
  • Image Requirement: Ensure the Docker image uses a compatible CUDA version (e.g., CUDA 11.8 or 12.1). Unsloth, for example, has specific requirements for CUDA 12.1 for maximum performance.9 Using the wrong base image will result in runtime errors regarding "Flash Attention" or "Bitsandbytes" compilation.

7.3 Spot Instance Interruptions

If using Community Cloud to save money, the pod may be preempted (shut down) if the provider needs the hardware.

  • Mitigation: Configure the training script to save checkpoints frequently (e.g., every 100 steps) to a mounted Network Volume or upload them immediately to S3.
  • Resume Logic: The start.sh should check for the existence of a checkpoint and automatically pass --resume_from_checkpoint to the training script. This ensures that if a pod dies and a new one is spawned, it picks up exactly where the last one left off.30

---

8. Conclusion and Strategic Roadmap

For a user demanding the fastest possible fine-tuning workflow without the overhead of Jupyter notebooks, RunPod offers a powerful substrate, provided the workflow is architected correctly. The optimal path requires moving away from interactive "pet" instances to ephemeral "cattle" instances managed by code.
The Recommended "Fast Track" Configuration:

  1. Hardware: NVIDIA H100 (Secure Cloud) for speed and reliability, or RTX 4090 (Community Cloud) for cost-efficiency.
  2. Framework: Unsloth for single-GPU jobs (fastest throughput); Axolotl for multi-GPU or complex configurations.
  3. Deployment: Custom Docker image with an ENTRYPOINT script that automates the Download -> Train -> Upload -> Terminate lifecycle.
  4. Interface: runpodctl for deployment; WandB for monitoring; SSH for emergency debugging.
  5. Data: S3-backed ingestion to local NVMe storage to bypass network volume I/O bottlenecks.

By adopting this headless architecture, the user transforms the fine-tuning process from a manual, error-prone task into a scalable, automated engineering operation, fully leveraging the raw compute power of RunPod's infrastructure. This report confirms that while RunPod's interface invites interactive use, its API and CLI capabilities are fully mature for the rigorous demands of headless, high-velocity machine learning operations.

Works cited

  1. Runpod GPU pricing: A complete breakdown and platform comparison | Blog - Northflank, accessed January 12, 2026, https://northflank.com/blog/runpod-gpu-pricing
  2. The NVIDIA H100 GPU Review: Why This AI Powerhouse Dominates (But Costs a Fortune) - Runpod, accessed January 12, 2026, https://www.runpod.io/articles/guides/nvidia-h100
  3. Runpod Secrets: Affordable A100/H100 Instances, accessed January 12, 2026, https://www.runpod.io/articles/guides/affordable-a100-h100-gpu-cloud
  4. Pricing | Runpod GPU cloud computing rates, accessed January 12, 2026, https://www.runpod.io/pricing
  5. RunPod Pricing 2025 Complete Guide (GPU Cloud Costs Breakdown) - Flexprice, accessed January 12, 2026, https://flexprice.io/blog/runprod-pricing-guide-with-gpu-costs
  6. No-Code AI: How I Ran My First LLM Without Coding | Runpod Blog, accessed January 12, 2026, https://www.runpod.io/blog/no-code-ai-run-llm
  7. Dockerfile - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/tutorials/introduction/containers/create-dockerfiles
  8. Deploying AI Apps with Minimal Infrastructure and Docker - Runpod, accessed January 12, 2026, https://www.runpod.io/articles/guides/deploy-ai-apps-minimal-infrastructure-docker
  9. Fine-Tuning Local Models with Docker Offload and Unsloth, accessed January 12, 2026, https://www.docker.com/blog/fine-tuning-models-with-offload-and-unsloth/
  10. Optimize your workers - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/serverless/development/optimization
  11. Manage Pods - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/pods/manage-pods
  12. AI on a Schedule: Using Runpod's API to Run Jobs Only When Needed, accessed January 12, 2026, https://www.runpod.io/articles/guides/ai-on-a-schedule
  13. Overview - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/runpodctl/overview
  14. create pod - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/runpodctl/reference/runpodctl-create-pod
  15. LLM fine-tuning | LLM Inference Handbook - BentoML, accessed January 12, 2026, https://bentoml.com/llm/getting-started/llm-fine-tuning
  16. How to fine-tune a model using Axolotl | Runpod Blog, accessed January 12, 2026, https://www.runpod.io/blog/how-to-fine-tune-a-model-using-axolotl
  17. Best frameworks for fine-tuning LLMs in 2025 - Modal, accessed January 12, 2026, https://modal.com/blog/fine-tuning-llms
  18. Comparing LLM Fine-Tuning Frameworks: Axolotl, Unsloth, and Torchtune in 2025, accessed January 12, 2026, https://blog.spheron.network/comparing-llm-fine-tuning-frameworks-axolotl-unsloth-and-torchtune-in-2025
  19. Axolotl vs LLaMA-Factory vs Unsloth for AI Fine-Tuning 2026 - Index.dev, accessed January 12, 2026, https://www.index.dev/skill-vs-skill/ai-axolotl-vs-llama-factory-vs-unsloth
  20. [TEMPLATE] One-click Unsloth finetuning on RunPod : r/LocalLLaMA - Reddit, accessed January 12, 2026, https://www.reddit.com/r/LocalLLaMA/comments/1nyzzws/template_oneclick_unsloth_finetuning_on_runpod/
  21. unsloth/llama-3-8b-bnb-4bit - Hugging Face, accessed January 12, 2026, https://huggingface.co/unsloth/llama-3-8b-bnb-4bit
  22. PSA: Don't bother with Network Volumes on Runpod : r/StableDiffusion - Reddit, accessed January 12, 2026, https://www.reddit.com/r/StableDiffusion/comments/1nkcgvp/psa_dont_bother_with_network_volumes_on_runpod/
  23. Network volumes - Runpod Documentation, accessed January 12, 2026, https://docs.runpod.io/storage/network-volumes
  24. Using network volume with serverless - Runpod - Answer Overflow, accessed January 12, 2026, https://www.answeroverflow.com/m/1234830020678123610
  25. Streamline Your AI Workflows with RunPod's New S3-Compatible API, accessed January 12, 2026, https://www.runpod.io/blog/streamline-ai-workflows-s3-api
  26. Upload speed - Runpod - Answer Overflow, accessed January 12, 2026, https://www.answeroverflow.com/m/1415080595020709938
  27. `runpodctl send` crawling at <1MB speeds - Runpod - Answer Overflow, accessed January 12, 2026, https://www.answeroverflow.com/m/1208971275163406376
  28. MLOps Workflow for Docker-Based AI Model Deployment - Runpod, accessed January 12, 2026, https://www.runpod.io/articles/guides/mlops-workflow-docker-ai-deployment
  29. Installation - Axolotl Docs, accessed January 12, 2026, https://docs.axolotl.ai/docs/installation.html
  30. Does anyone use RunPod for SFT? If yes, you train via SSH or Jupyter (web-hosted), accessed January 12, 2026, https://www.reddit.com/r/LocalLLaMA/comments/1pd6vxu/does_anyone_use_runpod_for_sft_if_yes_you_train/