Why Small LoRA Datasets Score Better (And Why That Is Misleading)
About This Project
This project investigates a widely repeated heuristic in LoRA training: that smaller datasets produce better results.
I designed a controlled experiment (15 runs, 5×3 grid) to isolate the effects of dataset size, batch size, and epoch count.
The results show that the perceived advantage of small datasets is largely explained by increased per-image exposure (epoch count), not better generalization.
TL;DR
Smaller datasets don't produce better LoRA models; they overtrain more.
When total training samples are held constant, reducing dataset size from 400 to 25 images causes epoch count to increase from 25 to 400.
The resulting "quality gains" in CLIP identity scores are driven by massively increased per-image exposure, not superior learning.
Batch size, the variable practitioners actually control, has almost zero independent effect on fidelity metrics. What looks like a dataset-size effect is actually a training exposure effect (mediated by epoch count).
Key result: CLIP identity correlates with log(epoch count) (r = +0.83) which was stronger than dataset size or batch size alone. Practitioners should treat per-image exposure (epoch count) as the primary tuning variable, rather than relying on dataset size or batch size heuristics.
Right panel shows diversity (intra-set LPIPS) increasing with epoch count (r = +0.57), but substantially more slowly than CLIP identity (r = +0.83). This asymmetric growth, strong identity gains without proportional diversity gains, is consistent with memorization rather than generalization.
Contributions
This work provides:
- A controlled experimental design isolating epoch count as a confound
- A reproducible LoRA evaluation pipeline
- Empirical evidence against batch-size-based tuning heuristics
- Visual evidence that metric dominance does not guarantee output quality, a Goodhart-like failure mode where optimization against standard metrics produces systematically worse visual outputs
1. Introduction
Low-Rank Adaptation (LoRA) has become the standard method for efficiently fine-tuning large diffusion models for specific identities, styles, and concepts. Despite widespread adoption, most training guidance remains anecdotal, with forum posts recommending specific batch sizes or dataset sizes without quantitative backing.
This study provides quantified trade-offs across 15 LoRA training runs spanning 5 dataset sizes (25, 50, 100, 200, 400 images) and 3 batch sizes (1, 2, 4), all with total training samples (steps × batch_size = 10,000) held constant. The key contributions are:
- Formalized evidence that batch size recommendations (e.g., batch size = 4 from Hong & Choo 2025) do not generalize across dataset sizes
- Statistical decomposition showing epoch count is the dominant predictor of CLIP identity, not dataset size or batch size independently
- A reproducible evaluation framework for future LoRA studies
- Evidence that smaller datasets produce higher identity scores through increased per-image exposure, with visual inspection revealing that top-scoring outputs can exhibit artifacts invisible to automated metrics
While this study focuses on a single character and training setup, the experimental design isolates a general mechanism (per-image exposure) that is likely to apply across LoRA training scenarios.
Why This Matters
Most LoRA training guides recommend specific dataset sizes or batch sizes without controlling for total training exposure. This creates a systematic bias: small datasets appear to produce higher-fidelity models, but only because they overtrain by an order of magnitude more epochs. Practitioners following this advice may unknowingly optimize for metric scores that don't reflect actual output quality.
This matters beyond aesthetics. LoRAs are orders of magnitude cheaper than full fine-tuning, making high-quality generative models accessible to small teams and opening applications in low-data domains like medical imaging. Understanding the boundary between personalization and overtraining is essential for responsible deployment and for correctly allocating limited compute budgets.
2. Pipeline Overview
This study implements a full end-to-end LoRA evaluation pipeline, from data curation through statistical analysis:
- Dataset curation and stratified subsampling 400 images scraped and manually curated, five subsets created by successive random halving (400, 200, 100, 50, and 25 images)
- Controlled LoRA training - 15 runs across a 5×3 grid (dataset size × batch size), all with a fixed total step budget
- Deterministic prompt-based generation - 5 evaluation prompts × 8 seeds per run = 40 outputs per LoRA, all with identical seeds across runs
- Multi-metric evaluation - CLIP identity, LPIPS identity, Clean-FID, intra-set diversity, and an overfitting proxy
- Statistical analysis - Log-linear regression decomposing the effects of dataset size, batch size, and epoch count
This pipeline is designed to isolate causal relationships between training variables rather than report raw performance. All code, prompts, metrics, and outputs are publicly available (see Section 7).
3. Methodology
3.1 Dataset
- Subject: Kiss-Shot Acerola-Orion Heart-Under-Blade from Monogatari, chosen because SDXL struggles with her elaborate appearance and name
- Training splits: Five subsets, starting at 400 and halving the images randomly (25, 50, 100, 200, and 400 images)
- Source: Danbooru, Pixiv, Fancaps (Kizumonogatari films), supplemented with Gelbooru, Pinterest
- Curation: Manual deletion of poor-quality/off-subject images, subfolder-based tagging for text/speech bubbles and dress details
- Prompts: Five evaluation cases (close-up, canonical outfit with sword, alternate outfit, full body, upper body), deterministic seeds per prompt
- Reference gallery: 20 renders for CLIP/LPIPS/FID comparisons
Due to copyright restrictions, the full dataset is not redistributed.
The dataset was constructed through manual curation and preprocessing (including filtering, cropping, and tagging), and is not reproducible from source URLs or automated pipelines. The reference image set used for evaluation is also not redistributed.
3.2 Training Configuration
All 15 LoRAs were trained on Illustrious SDXL using CivitAI's Dreambooth/LoRA Trainer (Kohya fork).
Constants across all runs:
| Parameter | Value |
|---|---|
| Checkpoint | Illustrious SDXL |
| Resolution | 2047 × 2047 (as reported by trainer; bucketing enabled) |
| Bucket Enabled | Yes |
| Shuffle Tags | Yes |
| Keep Tokens | 3 |
| Clip Skip | 2 |
| Flip Augmentation | Yes |
| Unet LR | 0.00012 |
| Text Encoder LR | 0.00005 |
| LR Scheduler | cosine_with_restarts (3 cycles) |
| Min SNR Gamma | 5 |
| Network Dim / Alpha | 32 / 16 |
| Optimizer | Adafactor |
Note on resolution: CivitAI's trainer allowed 2047 × 2047 as the resolution setting rather than the expected 2048 × 2048. The exact bucketing behavior under this setting is unclear; all runs used the same configuration.
Experimental grid:
| Data Size | Batch Size | Steps | Epochs | Run ID |
|---|---|---|---|---|
| 400 | 4 | 2,500 | 25 | 1 |
| 400 | 2 | 5,000 | 25 | 2 |
| 400 | 1 | 10,000 | 25 | 3 |
| 200 | 4 | 2,500 | 50 | 4 |
| 200 | 2 | 5,000 | 50 | 5 |
| 200 | 1 | 10,000 | 50 | 6 |
| 100 | 4 | 2,500 | 100 | 7 |
| 100 | 2 | 5,000 | 100 | 8 |
| 100 | 1 | 10,000 | 100 | 9 |
| 50 | 4 | 2,500 | 200 | 10 |
| 50 | 2 | 5,000 | 200 | 11 |
| 50 | 1 | 10,000 | 200 | 12 |
| 25 | 4 | 2,500 | 400 | 13 |
| 25 | 2 | 5,000 | 400 | 14 |
| 25 | 1 | 10,000 | 400 | 15 |
Critical design note: Because total training samples are held constant, epochs = (steps × batch_size) / dataset_size. Shrinking the dataset from 400→25 at batch size 1 causes epochs to increase from 25→400, a 16× increase in per-image exposure.
3.3 Metrics
- CLIP identity: Max cosine similarity of generations vs. references (OpenCLIP ViT-H/14)
- LPIPS identity: Perceptual distance to nearest reference (VGG-based)
- Clean-FID: Distribution distance vs. 20-image reference set
- Intra-set diversity: Average LPIPS between generation pairs (120 sampled pairs/run)
- Overfitting proxy: Fraction of samples with cosine > 0.35 to training set
Important caveats on metrics:
- FID with 20 references vs. 40 generations is below the sample size where Clean-FID estimates are reliable. Treat FID rankings as directional only, small differences (e.g., 189 vs. 194) are likely within noise.
- The overfitting proxy saturated at 1.0 for all runs, meaning the cosine > 0.35 threshold is too permissive for single-character LoRAs. This metric provides no discriminative signal in this study. A continuous mean similarity score would be more informative for future work.
4. Results
4.1 Full Results Table
| run_id | data_size | batch_size | epochs | clip_identity | lpips_identity | diversity_lpips | fid |
|---|---|---|---|---|---|---|---|
| 1 | 400 | 4 | 25 | 0.8470 | 0.5267 | 0.5460 | 191.66 |
| 2 | 400 | 2 | 25 | 0.8430 | 0.5231 | 0.5290 | 207.70 |
| 3 | 400 | 1 | 25 | 0.8530 | 0.5237 | 0.5480 | 198.26 |
| 4 | 200 | 4 | 50 | 0.8650 | 0.5086 | 0.4970 | 193.53 |
| 5 | 200 | 2 | 50 | 0.8580 | 0.5082 | 0.5200 | 200.96 |
| 6 | 200 | 1 | 50 | 0.8550 | 0.5087 | 0.5240 | 191.87 |
| 7 | 100 | 4 | 100 | 0.8590 | 0.5157 | 0.5320 | 199.43 |
| 8 | 100 | 2 | 100 | 0.8530 | 0.5187 | 0.5360 | 210.29 |
| 9 | 100 | 1 | 100 | 0.8590 | 0.5188 | 0.5230 | 207.96 |
| 10 | 50 | 4 | 200 | 0.8580 | 0.5367 | 0.5640 | 189.12 |
| 11 | 50 | 2 | 200 | 0.8600 | 0.5176 | 0.5380 | 192.47 |
| 12 | 50 | 1 | 200 | 0.8680 | 0.5272 | 0.5600 | 189.48 |
| 13 | 25 | 4 | 400 | 0.8760 | 0.5291 | 0.5660 | 194.55 |
| 14 | 25 | 2 | 400 | 0.8860 | 0.5047 | 0.5400 | 189.06 |
| 15 | 25 | 1 | 400 | 0.8720 | 0.5214 | 0.5800 | 201.60 |
4.2 The Epoch Confound
The central finding: epoch count explains CLIP identity better than dataset size and batch size combined.
Log-linear regression results (n = 15 runs):
| Metric | Model A: data × batch (adjusted R²) | Model B: epochs alone (adjusted R²) | Winner |
|---|---|---|---|
| clip_identity | 0.6060 | 0.6666 | Epochs |
| lpips_identity | −0.2059 | −0.0650 | Epochs |
| fid | −0.1024 | 0.0022 | Epochs |
| diversity_lpips | 0.1534 | 0.2679 | Epochs |
Epochs alone wins on every metric. The full interaction model (data × batch) with 4 parameters cannot beat a single-predictor epochs model on adjusted R². This indicates that previously reported dataset-size effects in LoRA training may be largely attributable to uncontrolled variation in epoch count.
Key correlations:
- log(epochs) → CLIP identity: +0.83 (strong positive)
- log(data_size) → CLIP identity: −0.83 (strong negative but this is the same signal, inverted)
- log(batch_size) → CLIP identity: −0.02 (essentially zero)
The equal-magnitude, opposite-sign correlations between log(epoch count) and log(dataset size) with CLIP identity (±0.83) confirm that these variables encode the same underlying factor, total per-image exposure, rather than independent effects. The gap between the epoch–identity correlation (r = +0.83) and the epoch–diversity correlation (r = +0.57) further indicates that increased training exposure disproportionately improves semantic similarity relative to output variation. This is a pattern consistent with partial memorization rather than generalization.
Notably, LPIPS identity, the perceptual distance metric, shows minimal systematic relationship with epoch count (R² = 0.011), while CLIP identity shows a strong one (R² = 0.689). This asymmetry is important: as epoch count increases, CLIP-level similarity rises substantially but perceptual similarity barely changes. The model learns to produce outputs that match the reference distribution in CLIP embedding space without proportional improvement in pixel-level perceptual fidelity. This decoupling between semantic similarity (CLIP) and perceptual fidelity (LPIPS) indicates that increased training exposure primarily drives representation alignment rather than visual quality improvement.
4.3 Run 14 Leads on Automated Metrics, But Metrics Aren't Everything
The smallest dataset size produced the single highest-scoring run on identity metrics, but the picture is more nuanced than it first appears.
Run 14 (25 images, batch size 2, 400 epochs) achieves:
| Metric | Run 14 Value | Rank (of 15) | Direction |
|---|---|---|---|
| CLIP identity | 0.886 | 1st | ↑ higher = better |
| LPIPS identity | 0.5047 | 1st | ↓ lower = better |
| FID vs refs | 189.06 | 1st | ↓ lower = better |
| Diversity LPIPS | 0.540 | 7th | ↑ higher = better |
Run 14 dominates on raw identity and distributional metrics. However, the asymmetry between its CLIP and LPIPS gains is telling: its CLIP identity is the highest of all runs by a wide margin (0.886 vs. the 400-image group mean of 0.848), yet its LPIPS identity improvement is much smaller in relative terms. This asymmetric response suggests the model is optimizing for CLIP-aligned features (high-level semantics) without improving perceptual fidelity, a signature of metric-aligned overfitting.
The best-per-group breakdown reinforces this pattern:
| Data Size | Epochs | Best CLIP Identity | Best LPIPS Identity | Best FID |
|---|---|---|---|---|
| 25 | 400 | 0.886 (run 14) | 0.5047 (run 14) | 189.06 (run 14) |
| 50 | 200 | 0.868 (run 12) | 0.5176 (run 11) | 189.12 (run 10) |
| 100 | 100 | 0.859 (run 7/9) | 0.5157 (run 7) | 199.43 (run 7) |
| 200 | 50 | 0.865 (run 4) | 0.5082 (run 5) | 191.87 (run 6) |
| 400 | 25 | 0.853 (run 3) | 0.5237 (run 2) | 191.66 (run 1) |
On identity metrics alone, this looks like an unambiguous win for small, high-epoch training. But automated metrics only capture part of the picture. As we show in Section 4.6, visual inspection reveals that Run 14's metric dominance does not translate to superior output quality.
Two important caveats apply to Run 14's apparent FID dominance:
- FID on 20 references vs. 40 generations is statistically unreliable. The difference between Run 14's FID (189.06) and Run 1's (191.66) is likely within noise at these sample sizes.
- The overfitting proxy saturated at 1.0 for all runs, so we have no working automated memorization detector in this study.
4.4 Heatmaps
Figure 2: Metric heatmaps showing the interaction between dataset size and batch size. Each cell shows the metric value for that combination. Note how the dominant gradient runs top-to-bottom (dataset size / epochs) rather than left-to-right (batch size).
4.5 Partial Effects
Figure 3: Partial effect of dataset size (marginalized over batch size) on each metric. Error bars show ±1 SD across batch sizes.
Figure 4: Partial effect of batch size (marginalized over dataset size). Note the flat or near-flat slopes, batch size has minimal independent effect.
Figure 5: Partial effect of epoch count, the true confound variable. CLIP identity climbs steadily with epochs while other metrics remain relatively stable.
4.6 Visual Inspection: When Metrics Lie
Run #14 Data Size = 25, Batch = 2, Steps = 5,000
Run #3 Data Size = 400, Batch = 1, Steps = 10,000
Figure 6: Side-by-side comparison of outputs from Run 14 (25 images, 400 epochs, highest metric scores) and Run 3 (400 images, 25 epochs, lower metric scores). Despite dominating on every automated metric, Run 14 exhibits visible artifacting (edge ringing / oversharpening, anatomical inconsistencies, feature repetition, loss of fine detail). Run 3 produces outputs that demonstrate learned understanding of the character's features, proportions, and stylistic variation.
This is the critical disconnect: In our evaluation, Run 14 wins on metrics but loses on visual quality. The model has learned to produce outputs that are numerically similar to the reference set (high CLIP cosine similarity, low LPIPS distance) without faithfully reproducing what a human would recognize as a high-quality rendition of the character.
This finding has two implications:
- CLIP identity and LPIPS are necessary but not sufficient for evaluating LoRA quality. A model that scores well on these metrics may still produce visually unsatisfying outputs, especially when epoch counts are extreme.
- The "small datasets are better" heuristic is self-reinforcing. Practitioners who evaluate LoRAs by metric alone will consistently prefer overtrained models. Only visual inspection, or more sophisticated evaluation methods, breaks the cycle.
5. Discussion
5.1 Why Smaller Datasets Look "Better" on Paper
Practitioners often report that smaller, highly-curated datasets produce "better" LoRAs. Our data shows this is an artifact of overtraining that flatters automated metrics:
- A 25-image dataset at batch size 1 trains for 400 epochs vs. 25 epochs for the 400-image dataset
- The model sees each training image 16× more often
- CLIP identity rises (~0.87–0.89) because the model's outputs converge toward the training distribution
- Yet LPIPS identity, the perceptual distance metric, shows almost no meaningful relationship with epoch count (R² = 0.011; adjusted R² < 0), indicating that the perceptual quality gains do not keep pace with CLIP-level gains
- Visual inspection (Section 4.6) further reveals that the highest-scoring run exhibits artifacts that identity metrics fail to capture
The asymmetry between CLIP and LPIPS responses to epoch count is the key signal. CLIP identity rises strongly with epochs (R² = 0.689) while LPIPS identity is essentially flat. This means overtraining inflates semantic similarity scores without proportional perceptual improvement; a form of shallow memorization that standard evaluation pipelines reward rather than penalize.
This doesn't mean small datasets are useless; curation quality matters enormously. But the raw "higher CLIP score = better model" heuristic is misleading when epoch counts differ by an order of magnitude.
5.2 Batch Size Is Nearly Irrelevant
The correlation of log(batch_size) with CLIP identity is −0.02, effectively zero. This directly challenges recommendations like "batch size = 4" from Hong & Choo (2025), which were derived from a different domain (architectural massing generation on SD 1.5) and clearly do not generalize to character LoRA training on anime-focused SDXL checkpoints. Notably, Hong & Choo themselves acknowledged this limitation, stating that they "did not analyze in detail the changes in model performance according to combinations of hyperparameter settings for Batch Size, Epoch, and Repeats."
Interestingly, Lee & Lee (2026) found that batch size does significantly affect LoRA fine-tuning of LLMs for mathematical reasoning, with an optimal batch size that varies by dataset scale. The discrepancy with our null result may reflect fundamental differences between language model fine-tuning (where batch size interacts with gradient noise and learning rate) and diffusion model LoRA training (where the per-image exposure via epochs dominates). This suggests batch size sensitivity is domain-dependent, a finding consistent with Lee & Lee's finding that the optimal batch size is sensitive to dataset scale, the one factor for which they could not establish reliable transferability.
Batch size does have a practical effect: it determines VRAM requirements. Practitioners on limited hardware can safely use batch size 1 without meaningful quality loss.
5.3 The Gap Between Metrics and Visual Quality
Perhaps the most important finding is that automated metrics and visual quality diverge at high epoch counts. Run 14 achieves the best scores on CLIP identity, LPIPS identity, and FID, yet produces visibly inferior outputs compared to Run 3 (Section 4.6).
The divergence between CLIP and LPIPS behavior across epoch counts is particularly informative. CLIP identity increases strongly and monotonically with log(epochs) (r = +0.83), while LPIPS identity shows essentially no trend (r ≈ 0). This means that at extreme epoch counts, the model produces outputs that are numerically close to the reference distribution at the CLIP embedding level (matching high-level semantic features like hair color, clothing, and pose) without proportional improvement in the perceptual qualities that humans value, such as texture detail, anatomical consistency, and color fidelity.
This is a form of Goodhart's Law applied to LoRA evaluation: when CLIP identity becomes the optimization target (implicitly, through the choice of small datasets and many epochs), it ceases to be a reliable measure of actual model quality.
Future work should explore perceptual quality metrics (e.g., human preference models, FID at adequate sample sizes) that better align with visual assessment.
5.4 Limitations
- Single subject (n = 1 character): Results may not generalize to realistic faces, objects, concepts, or styles
- Overfitting proxy saturated at 1.0: The cosine > 0.35 threshold is too permissive for single-character focus; a continuous mean similarity would be more discriminative
- FID on 20 refs vs. 40 generations: Sample sizes are too small for reliable FID, treat these numbers as directional
- No statistical significance tests: With n = 15 runs and small effect sizes, confidence intervals or permutation tests would strengthen claims
- Visual comparison is qualitative: The artifact assessment in Section 4.6 is based on the author's visual inspection and is inherently subjective. Structured human evaluation (e.g., forced-choice preference studies) would provide stronger evidence
- Diversity sampling: Intra-set diversity was computed from 120 randomly sampled pairs per run (not all possible pairs), introducing minor sampling noise
- Training data sourced from fan art: Copyright considerations limit direct academic publication
- Single training budget: All runs used a fixed total of 10,000 training samples. The interaction between dataset size and epoch count may differ at other budget levels
6. Practical Takeaways
Fix your total step budget, then tune dataset size to control epoch count. These results suggest that total training exposure (steps × dataset size) is the primary tuning variable, not batch size. If your dataset is small, you're overtraining. Consider early stopping, reducing total steps, or using a larger dataset at fewer epochs.
Batch size barely matters for quality. Use whatever your GPU can handle. Batch size 1 is fine.
"Higher CLIP identity" ≠"better model." Always visually inspect outputs alongside metrics. If metrics climb but outputs degrade, you're optimizing for the wrong signal. In this study, CLIP identity rose strongly with epoch count while LPIPS identity remained flat. This means metric gains at high epochs reflected semantic convergence, not perceptual improvement.
Watch for CLIP/LPIPS asymmetry as an overtraining diagnostic. When CLIP identity rises substantially across configurations but LPIPS identity does not, the model is likely memorizing high-level features without learning fine perceptual detail. This asymmetry is more informative than either metric alone.
200-image datasets showed a strong balance between identity and diversity in this study, suggesting a practical middle ground, enough images for diversity, few enough epochs to avoid overtraining at typical budgets. More replication is needed.
7. Reproducibility
GitHub Repo: Link
Contains:
- analysis notebook and statistical evaluation
- prompts and generation settings
- aggregated results (
results.csv) - experimental notebooks and workflow details
Note: Training and reference images are not redistributed due to copyright restrictions.
8. Acknowledgments
This study uses the Kohya sd-scripts training framework via CivitAI's Dreambooth/LoRA Trainer wrapper, and the Illustrious SDXL checkpoint.
References
- Hong, S.M. & Choo, S. (2025). Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models. Buildings, 15(19), 3477.
- Lee, S. & Lee, J. (2026). Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA. arXiv:2602.09492.
- Hessel, J. et al. (2021). CLIPScore: A Reference-free Evaluation Metric for Image Captioning. EMNLP 2021.
- Downloads last month
- 498
Model tree for KS-AO-HUB/lora-dataset-size-vs-epoch-exposure
Base model
KBlueLeaf/kohaku-xl-beta5



