Why Small LoRA Datasets Score Better (And Why That Is Misleading)

About This Project

This project investigates a widely repeated heuristic in LoRA training: that smaller datasets produce better results.

I designed a controlled experiment (15 runs, 5×3 grid) to isolate the effects of dataset size, batch size, and epoch count.

The results show that the perceived advantage of small datasets is largely explained by increased per-image exposure (epoch count), not better generalization.

TL;DR

Smaller datasets don't produce better LoRA models; they overtrain more.

When total training samples are held constant, reducing dataset size from 400 to 25 images causes epoch count to increase from 25 to 400.

The resulting "quality gains" in CLIP identity scores are driven by massively increased per-image exposure, not superior learning.

Batch size, the variable practitioners actually control, has almost zero independent effect on fidelity metrics. What looks like a dataset-size effect is actually a training exposure effect (mediated by epoch count).

Key result: CLIP identity correlates with log(epoch count) (r = +0.83) which was stronger than dataset size or batch size alone. Practitioners should treat per-image exposure (epoch count) as the primary tuning variable, rather than relying on dataset size or batch size heuristics.

Right panel shows diversity (intra-set LPIPS) increasing with epoch count (r = +0.57), but substantially more slowly than CLIP identity (r = +0.83). This asymmetric growth, strong identity gains without proportional diversity gains, is consistent with memorization rather than generalization.

Contributions

This work provides:

A controlled experimental design isolating epoch count as a confound
A reproducible LoRA evaluation pipeline
Empirical evidence against batch-size-based tuning heuristics
Visual evidence that metric dominance does not guarantee output quality, a Goodhart-like failure mode where optimization against standard metrics produces systematically worse visual outputs

1. Introduction

Low-Rank Adaptation (LoRA) has become the standard method for efficiently fine-tuning large diffusion models for specific identities, styles, and concepts. Despite widespread adoption, most training guidance remains anecdotal, with forum posts recommending specific batch sizes or dataset sizes without quantitative backing.

This study provides quantified trade-offs across 15 LoRA training runs spanning 5 dataset sizes (25, 50, 100, 200, 400 images) and 3 batch sizes (1, 2, 4), all with total training samples (steps × batch_size = 10,000) held constant. The key contributions are:

Formalized evidence that batch size recommendations (e.g., batch size = 4 from Hong & Choo 2025) do not generalize across dataset sizes
Statistical decomposition showing epoch count is the dominant predictor of CLIP identity, not dataset size or batch size independently
A reproducible evaluation framework for future LoRA studies
Evidence that smaller datasets produce higher identity scores through increased per-image exposure, with visual inspection revealing that top-scoring outputs can exhibit artifacts invisible to automated metrics

While this study focuses on a single character and training setup, the experimental design isolates a general mechanism (per-image exposure) that is likely to apply across LoRA training scenarios.

Why This Matters

Most LoRA training guides recommend specific dataset sizes or batch sizes without controlling for total training exposure. This creates a systematic bias: small datasets appear to produce higher-fidelity models, but only because they overtrain by an order of magnitude more epochs. Practitioners following this advice may unknowingly optimize for metric scores that don't reflect actual output quality.

This matters beyond aesthetics. LoRAs are orders of magnitude cheaper than full fine-tuning, making high-quality generative models accessible to small teams and opening applications in low-data domains like medical imaging. Understanding the boundary between personalization and overtraining is essential for responsible deployment and for correctly allocating limited compute budgets.

2. Pipeline Overview

This study implements a full end-to-end LoRA evaluation pipeline, from data curation through statistical analysis:

Dataset curation and stratified subsampling 400 images scraped and manually curated, five subsets created by successive random halving (400, 200, 100, 50, and 25 images)
Controlled LoRA training - 15 runs across a 5×3 grid (dataset size × batch size), all with a fixed total step budget
Deterministic prompt-based generation - 5 evaluation prompts × 8 seeds per run = 40 outputs per LoRA, all with identical seeds across runs
Multi-metric evaluation - CLIP identity, LPIPS identity, Clean-FID, intra-set diversity, and an overfitting proxy
Statistical analysis - Log-linear regression decomposing the effects of dataset size, batch size, and epoch count

This pipeline is designed to isolate causal relationships between training variables rather than report raw performance. All code, prompts, metrics, and outputs are publicly available (see Section 7).

3. Methodology

3.1 Dataset

Subject: Kiss-Shot Acerola-Orion Heart-Under-Blade from Monogatari, chosen because SDXL struggles with her elaborate appearance and name
Training splits: Five subsets, starting at 400 and halving the images randomly (25, 50, 100, 200, and 400 images)
Source: Danbooru, Pixiv, Fancaps (Kizumonogatari films), supplemented with Gelbooru, Pinterest
Curation: Manual deletion of poor-quality/off-subject images, subfolder-based tagging for text/speech bubbles and dress details
Prompts: Five evaluation cases (close-up, canonical outfit with sword, alternate outfit, full body, upper body), deterministic seeds per prompt
Reference gallery: 20 renders for CLIP/LPIPS/FID comparisons

Due to copyright restrictions, the full dataset is not redistributed.

The dataset was constructed through manual curation and preprocessing (including filtering, cropping, and tagging), and is not reproducible from source URLs or automated pipelines. The reference image set used for evaluation is also not redistributed.

3.2 Training Configuration

All 15 LoRAs were trained on Illustrious SDXL using CivitAI's Dreambooth/LoRA Trainer (Kohya fork).

Constants across all runs:

Parameter	Value
Checkpoint	Illustrious SDXL
Resolution	2047 × 2047 (as reported by trainer; bucketing enabled)
Bucket Enabled	Yes
Shuffle Tags	Yes
Keep Tokens	3
Clip Skip	2
Flip Augmentation	Yes
Unet LR	0.00012
Text Encoder LR	0.00005
LR Scheduler	cosine_with_restarts (3 cycles)
Min SNR Gamma	5
Network Dim / Alpha	32 / 16
Optimizer	Adafactor

Note on resolution: CivitAI's trainer allowed 2047 × 2047 as the resolution setting rather than the expected 2048 × 2048. The exact bucketing behavior under this setting is unclear; all runs used the same configuration.

Experimental grid:

Data Size	Batch Size	Steps	Epochs	Run ID
400	4	2,500	25	1
400	2	5,000	25	2
400	1	10,000	25	3
200	4	2,500	50	4
200	2	5,000	50	5
200	1	10,000	50	6
100	4	2,500	100	7
100	2	5,000	100	8
100	1	10,000	100	9
50	4	2,500	200	10
50	2	5,000	200	11
50	1	10,000	200	12
25	4	2,500	400	13
25	2	5,000	400	14
25	1	10,000	400	15

Critical design note: Because total training samples are held constant, epochs = (steps × batch_size) / dataset_size. Shrinking the dataset from 400→25 at batch size 1 causes epochs to increase from 25→400, a 16× increase in per-image exposure.

3.3 Metrics

CLIP identity: Max cosine similarity of generations vs. references (OpenCLIP ViT-H/14)
LPIPS identity: Perceptual distance to nearest reference (VGG-based)
Clean-FID: Distribution distance vs. 20-image reference set
Intra-set diversity: Average LPIPS between generation pairs (120 sampled pairs/run)
Overfitting proxy: Fraction of samples with cosine > 0.35 to training set

Important caveats on metrics:

FID with 20 references vs. 40 generations is below the sample size where Clean-FID estimates are reliable. Treat FID rankings as directional only, small differences (e.g., 189 vs. 194) are likely within noise.

The overfitting proxy saturated at 1.0 for all runs, meaning the cosine > 0.35 threshold is too permissive for single-character LoRAs. This metric provides no discriminative signal in this study. A continuous mean similarity score would be more informative for future work.

4. Results

4.1 Full Results Table

run_id	data_size	batch_size	epochs	clip_identity	lpips_identity	diversity_lpips	fid
1	400	4	25	0.8470	0.5267	0.5460	191.66
2	400	2	25	0.8430	0.5231	0.5290	207.70
3	400	1	25	0.8530	0.5237	0.5480	198.26
4	200	4	50	0.8650	0.5086	0.4970	193.53
5	200	2	50	0.8580	0.5082	0.5200	200.96
6	200	1	50	0.8550	0.5087	0.5240	191.87
7	100	4	100	0.8590	0.5157	0.5320	199.43
8	100	2	100	0.8530	0.5187	0.5360	210.29
9	100	1	100	0.8590	0.5188	0.5230	207.96
10	50	4	200	0.8580	0.5367	0.5640	189.12
11	50	2	200	0.8600	0.5176	0.5380	192.47
12	50	1	200	0.8680	0.5272	0.5600	189.48
13	25	4	400	0.8760	0.5291	0.5660	194.55
14	25	2	400	0.8860	0.5047	0.5400	189.06
15	25	1	400	0.8720	0.5214	0.5800	201.60

4.2 The Epoch Confound

The central finding: epoch count explains CLIP identity better than dataset size and batch size combined.

Log-linear regression results (n = 15 runs):

Metric	Model A: data × batch (adjusted R²)	Model B: epochs alone (adjusted R²)	Winner
clip_identity	0.6060	0.6666	Epochs
lpips_identity	−0.2059	−0.0650	Epochs
fid	−0.1024	0.0022	Epochs
diversity_lpips	0.1534	0.2679	Epochs

Epochs alone wins on every metric. The full interaction model (data × batch) with 4 parameters cannot beat a single-predictor epochs model on adjusted R². This indicates that previously reported dataset-size effects in LoRA training may be largely attributable to uncontrolled variation in epoch count.

Key correlations:

log(epochs) → CLIP identity: +0.83 (strong positive)
log(data_size) → CLIP identity: −0.83 (strong negative but this is the same signal, inverted)
log(batch_size) → CLIP identity: −0.02 (essentially zero)

The equal-magnitude, opposite-sign correlations between log(epoch count) and log(dataset size) with CLIP identity (±0.83) confirm that these variables encode the same underlying factor, total per-image exposure, rather than independent effects. The gap between the epoch–identity correlation (r = +0.83) and the epoch–diversity correlation (r = +0.57) further indicates that increased training exposure disproportionately improves semantic similarity relative to output variation. This is a pattern consistent with partial memorization rather than generalization.

Notably, LPIPS identity, the perceptual distance metric, shows minimal systematic relationship with epoch count (R² = 0.011), while CLIP identity shows a strong one (R² = 0.689). This asymmetry is important: as epoch count increases, CLIP-level similarity rises substantially but perceptual similarity barely changes. The model learns to produce outputs that match the reference distribution in CLIP embedding space without proportional improvement in pixel-level perceptual fidelity. This decoupling between semantic similarity (CLIP) and perceptual fidelity (LPIPS) indicates that increased training exposure primarily drives representation alignment rather than visual quality improvement.

4.3 Run 14 Leads on Automated Metrics, But Metrics Aren't Everything

The smallest dataset size produced the single highest-scoring run on identity metrics, but the picture is more nuanced than it first appears.

Run 14 (25 images, batch size 2, 400 epochs) achieves:

Metric	Run 14 Value	Rank (of 15)	Direction
CLIP identity	0.886	1st	↑ higher = better
LPIPS identity	0.5047	1st	↓ lower = better
FID vs refs	189.06	1st	↓ lower = better
Diversity LPIPS	0.540	7th	↑ higher = better

Run 14 dominates on raw identity and distributional metrics. However, the asymmetry between its CLIP and LPIPS gains is telling: its CLIP identity is the highest of all runs by a wide margin (0.886 vs. the 400-image group mean of 0.848), yet its LPIPS identity improvement is much smaller in relative terms. This asymmetric response suggests the model is optimizing for CLIP-aligned features (high-level semantics) without improving perceptual fidelity, a signature of metric-aligned overfitting.

The best-per-group breakdown reinforces this pattern:

Data Size	Epochs	Best CLIP Identity	Best LPIPS Identity	Best FID
25	400	0.886 (run 14)	0.5047 (run 14)	189.06 (run 14)
50	200	0.868 (run 12)	0.5176 (run 11)	189.12 (run 10)
100	100	0.859 (run 7/9)	0.5157 (run 7)	199.43 (run 7)
200	50	0.865 (run 4)	0.5082 (run 5)	191.87 (run 6)
400	25	0.853 (run 3)	0.5237 (run 2)	191.66 (run 1)

On identity metrics alone, this looks like an unambiguous win for small, high-epoch training. But automated metrics only capture part of the picture. As we show in Section 4.6, visual inspection reveals that Run 14's metric dominance does not translate to superior output quality.

Two important caveats apply to Run 14's apparent FID dominance:

FID on 20 references vs. 40 generations is statistically unreliable. The difference between Run 14's FID (189.06) and Run 1's (191.66) is likely within noise at these sample sizes.
The overfitting proxy saturated at 1.0 for all runs, so we have no working automated memorization detector in this study.

4.4 Heatmaps

Figure 2: Metric heatmaps showing the interaction between dataset size and batch size. Each cell shows the metric value for that combination. Note how the dominant gradient runs top-to-bottom (dataset size / epochs) rather than left-to-right (batch size).

4.5 Partial Effects

Figure 3: Partial effect of dataset size (marginalized over batch size) on each metric. Error bars show ±1 SD across batch sizes.

Figure 4: Partial effect of batch size (marginalized over dataset size). Note the flat or near-flat slopes, batch size has minimal independent effect.

Figure 5: Partial effect of epoch count, the true confound variable. CLIP identity climbs steadily with epochs while other metrics remain relatively stable.

4.6 Visual Inspection: When Metrics Lie

Run #14 Data Size = 25, Batch = 2, Steps = 5,000 Run #3 Data Size = 400, Batch = 1, Steps = 10,000

Figure 6: Side-by-side comparison of outputs from Run 14 (25 images, 400 epochs, highest metric scores) and Run 3 (400 images, 25 epochs, lower metric scores). Despite dominating on every automated metric, Run 14 exhibits visible artifacting (edge ringing / oversharpening, anatomical inconsistencies, feature repetition, loss of fine detail). Run 3 produces outputs that demonstrate learned understanding of the character's features, proportions, and stylistic variation.

This is the critical disconnect: In our evaluation, Run 14 wins on metrics but loses on visual quality. The model has learned to produce outputs that are numerically similar to the reference set (high CLIP cosine similarity, low LPIPS distance) without faithfully reproducing what a human would recognize as a high-quality rendition of the character.

This finding has two implications:

CLIP identity and LPIPS are necessary but not sufficient for evaluating LoRA quality. A model that scores well on these metrics may still produce visually unsatisfying outputs, especially when epoch counts are extreme.
The "small datasets are better" heuristic is self-reinforcing. Practitioners who evaluate LoRAs by metric alone will consistently prefer overtrained models. Only visual inspection, or more sophisticated evaluation methods, breaks the cycle.

5. Discussion

5.1 Why Smaller Datasets Look "Better" on Paper

Practitioners often report that smaller, highly-curated datasets produce "better" LoRAs. Our data shows this is an artifact of overtraining that flatters automated metrics:

A 25-image dataset at batch size 1 trains for 400 epochs vs. 25 epochs for the 400-image dataset
The model sees each training image 16× more often
CLIP identity rises (~0.87–0.89) because the model's outputs converge toward the training distribution
Yet LPIPS identity, the perceptual distance metric, shows almost no meaningful relationship with epoch count (R² = 0.011; adjusted R² < 0), indicating that the perceptual quality gains do not keep pace with CLIP-level gains
Visual inspection (Section 4.6) further reveals that the highest-scoring run exhibits artifacts that identity metrics fail to capture

The asymmetry between CLIP and LPIPS responses to epoch count is the key signal. CLIP identity rises strongly with epochs (R² = 0.689) while LPIPS identity is essentially flat. This means overtraining inflates semantic similarity scores without proportional perceptual improvement; a form of shallow memorization that standard evaluation pipelines reward rather than penalize.

This doesn't mean small datasets are useless; curation quality matters enormously. But the raw "higher CLIP score = better model" heuristic is misleading when epoch counts differ by an order of magnitude.

5.2 Batch Size Is Nearly Irrelevant

The correlation of log(batch_size) with CLIP identity is −0.02, effectively zero. This directly challenges recommendations like "batch size = 4" from Hong & Choo (2025), which were derived from a different domain (architectural massing generation on SD 1.5) and clearly do not generalize to character LoRA training on anime-focused SDXL checkpoints. Notably, Hong & Choo themselves acknowledged this limitation, stating that they "did not analyze in detail the changes in model performance according to combinations of hyperparameter settings for Batch Size, Epoch, and Repeats."

Interestingly, Lee & Lee (2026) found that batch size does significantly affect LoRA fine-tuning of LLMs for mathematical reasoning, with an optimal batch size that varies by dataset scale. The discrepancy with our null result may reflect fundamental differences between language model fine-tuning (where batch size interacts with gradient noise and learning rate) and diffusion model LoRA training (where the per-image exposure via epochs dominates). This suggests batch size sensitivity is domain-dependent, a finding consistent with Lee & Lee's finding that the optimal batch size is sensitive to dataset scale, the one factor for which they could not establish reliable transferability.

Batch size does have a practical effect: it determines VRAM requirements. Practitioners on limited hardware can safely use batch size 1 without meaningful quality loss.

5.3 The Gap Between Metrics and Visual Quality

Perhaps the most important finding is that automated metrics and visual quality diverge at high epoch counts. Run 14 achieves the best scores on CLIP identity, LPIPS identity, and FID, yet produces visibly inferior outputs compared to Run 3 (Section 4.6).

The divergence between CLIP and LPIPS behavior across epoch counts is particularly informative. CLIP identity increases strongly and monotonically with log(epochs) (r = +0.83), while LPIPS identity shows essentially no trend (r ≈ 0). This means that at extreme epoch counts, the model produces outputs that are numerically close to the reference distribution at the CLIP embedding level (matching high-level semantic features like hair color, clothing, and pose) without proportional improvement in the perceptual qualities that humans value, such as texture detail, anatomical consistency, and color fidelity.

This is a form of Goodhart's Law applied to LoRA evaluation: when CLIP identity becomes the optimization target (implicitly, through the choice of small datasets and many epochs), it ceases to be a reliable measure of actual model quality.

Future work should explore perceptual quality metrics (e.g., human preference models, FID at adequate sample sizes) that better align with visual assessment.

5.4 Limitations

Single subject (n = 1 character): Results may not generalize to realistic faces, objects, concepts, or styles
Overfitting proxy saturated at 1.0: The cosine > 0.35 threshold is too permissive for single-character focus; a continuous mean similarity would be more discriminative
FID on 20 refs vs. 40 generations: Sample sizes are too small for reliable FID, treat these numbers as directional
No statistical significance tests: With n = 15 runs and small effect sizes, confidence intervals or permutation tests would strengthen claims
Visual comparison is qualitative: The artifact assessment in Section 4.6 is based on the author's visual inspection and is inherently subjective. Structured human evaluation (e.g., forced-choice preference studies) would provide stronger evidence
Diversity sampling: Intra-set diversity was computed from 120 randomly sampled pairs per run (not all possible pairs), introducing minor sampling noise
Training data sourced from fan art: Copyright considerations limit direct academic publication
Single training budget: All runs used a fixed total of 10,000 training samples. The interaction between dataset size and epoch count may differ at other budget levels

6. Practical Takeaways

Fix your total step budget, then tune dataset size to control epoch count. These results suggest that total training exposure (steps × dataset size) is the primary tuning variable, not batch size. If your dataset is small, you're overtraining. Consider early stopping, reducing total steps, or using a larger dataset at fewer epochs.
Batch size barely matters for quality. Use whatever your GPU can handle. Batch size 1 is fine.
"Higher CLIP identity" ≠ "better model." Always visually inspect outputs alongside metrics. If metrics climb but outputs degrade, you're optimizing for the wrong signal. In this study, CLIP identity rose strongly with epoch count while LPIPS identity remained flat. This means metric gains at high epochs reflected semantic convergence, not perceptual improvement.
Watch for CLIP/LPIPS asymmetry as an overtraining diagnostic. When CLIP identity rises substantially across configurations but LPIPS identity does not, the model is likely memorizing high-level features without learning fine perceptual detail. This asymmetry is more informative than either metric alone.
200-image datasets showed a strong balance between identity and diversity in this study, suggesting a practical middle ground, enough images for diversity, few enough epochs to avoid overtraining at typical budgets. More replication is needed.

7. Reproducibility

GitHub Repo: Link

Contains:
- analysis notebook and statistical evaluation
- prompts and generation settings
- aggregated results (results.csv)
- experimental notebooks and workflow details
Note: Training and reference images are not redistributed due to copyright restrictions.

8. Acknowledgments

This study uses the Kohya sd-scripts training framework via CivitAI's Dreambooth/LoRA Trainer wrapper, and the Illustrious SDXL checkpoint.

References

Hong, S.M. & Choo, S. (2025). Systematic Parameter Optimization for LoRA-Based Architectural Massing Generation Using Diffusion Models. Buildings, 15(19), 3477.
Lee, S. & Lee, J. (2026). Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA. arXiv:2602.09492.
Hessel, J. et al. (2021). CLIPScore: A Reference-free Evaluation Metric for Image Captioning. EMNLP 2021.

Downloads last month: 498

Model tree for KS-AO-HUB/lora-dataset-size-vs-epoch-exposure

Base model

KBlueLeaf/kohaku-xl-beta5

Finetuned

OnomaAIResearch/Illustrious-xl-early-release-v0

Finetuned

AiAF/Diffusers_Illustrious-XL-v0.1

Adapter

(1)

this model

Paper for KS-AO-HUB/lora-dataset-size-vs-epoch-exposure

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

Paper • 2602.09492 • Published Feb 10