Update README.md
Browse files
README.md
CHANGED
|
@@ -9,10 +9,10 @@ short_description: VLMs for Garment Attribute Extraction
|
|
| 9 |
|
| 10 |
**Advancing structured attribute extraction from garment images through multi-stage reinforcement learning**
|
| 11 |
|
| 12 |
-
[](https://huggingface.co/datasets/Denali-AI/eval-hard-3500)
|
| 14 |
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 15 |
-
[ for **structured garment attribute extraction** β the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.
|
| 24 |
|
| 25 |
-
We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Group Relative Policy Optimization (GRPO)**, and **Group-relative Trajectory-based Policy Optimization (GTPO)** across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2) and scales (
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
@@ -30,16 +30,32 @@ We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Gro
|
|
| 30 |
|
| 31 |

|
| 32 |
|
| 33 |
-
| Rank | Model | Architecture | Params | Training | Weighted | SBERT+NLI | JSON
|
| 34 |
-
|:----:|-------|------------
|
| 35 |
-
| 1 |
|
| 36 |
-
| 2 | [
|
| 37 |
-
| 3 | [
|
| 38 |
-
| 4 | [Qwen3
|
| 39 |
-
| 5 | [Qwen3
|
| 40 |
-
| 6 | [Qwen3.5-
|
| 41 |
-
| 7 | Qwen3
|
| 42 |
-
| 8 | Qwen3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
---
|
| 45 |
|
|
@@ -89,7 +105,7 @@ Not all fields are equally important. The weighted score uses domain-specific mu
|
|
| 89 |
|
| 90 |

|
| 91 |
|
| 92 |
-
**Key finding:** Qwen3-VL-2B v9 achieves the best accuracy-throughput trade-off at
|
| 93 |
|
| 94 |
### Structured Output Reliability
|
| 95 |
|
|
@@ -113,9 +129,9 @@ Fine-tuned models achieve **100% JSON parse rate**, while zero-shot baselines (G
|
|
| 113 |
|
| 114 |
| Collection | Models | Description |
|
| 115 |
|------------|:------:|-------------|
|
| 116 |
-
| [**Qwen3-VL**](https://huggingface.co/collections/Denali-AI/qwen3-vl-models-69c70950fca01f437228c29b) |
|
| 117 |
-
| [**Qwen3.5-VL**](https://huggingface.co/collections/Denali-AI/qwen35-vl-models-69c70802ab21ae73a116cc92) |
|
| 118 |
-
| [**InternVL3**](https://huggingface.co/collections/Denali-AI/internvl3-models-69c70803ab21ae73a116cca2) |
|
| 119 |
| [**Florence-2**](https://huggingface.co/collections/Denali-AI/florence-2-models-69c70802f1456fd2264216e8) | 3 | Florence-2 encoder-decoder models |
|
| 120 |
| [**Benchmarks**](https://huggingface.co/collections/Denali-AI/benchmarks-and-datasets-69c708037d77aba79963c1a7) | 2 | Evaluation and training datasets |
|
| 121 |
|
|
@@ -176,7 +192,7 @@ All models are evaluated on [**eval_hard_3500**](https://huggingface.co/datasets
|
|
| 176 |
|
| 177 |
### Metrics
|
| 178 |
|
| 179 |
-
We employ a **comprehensive multi-metric evaluation framework** rather than relying on exact match:
|
| 180 |
|
| 181 |
| Metric | Model | Description |
|
| 182 |
|--------|-------|-------------|
|
|
@@ -187,6 +203,35 @@ We employ a **comprehensive multi-metric evaluation framework** rather than rely
|
|
| 187 |
| **SBERT+NLI Combined** | β | Primary metric: average of SBERT cosine and NLI |
|
| 188 |
| **Weighted Score** | β | Field-weighted aggregate (see weights above) |
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.
|
| 191 |
|
| 192 |
### Evaluation Protocol
|
|
@@ -200,15 +245,19 @@ This multi-metric approach captures semantic similarity rather than requiring ex
|
|
| 200 |
|
| 201 |
## Key Findings
|
| 202 |
|
| 203 |
-
1. **
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
|
| 205 |
-
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
|
| 212 |
|
| 213 |
---
|
| 214 |
|
|
@@ -218,8 +267,10 @@ This multi-metric approach captures semantic similarity rather than requiring ex
|
|
| 218 |
|
| 219 |
| Direction | Expected Impact | Effort |
|
| 220 |
|-----------|:--------------:|:------:|
|
|
|
|
| 221 |
| **GTPO on Qwen3-VL-2B v9** | +2-4pp weighted (currently SFT+GRPO only) | Low |
|
| 222 |
-
| **
|
|
|
|
| 223 |
| **OCR pre-processing pipeline** | Fix brand/size for Qwen3.5 models (+30-60pp on those fields) | Medium |
|
| 224 |
| **Higher LoRA rank (r=32/64)** | +1-3pp from increased adapter capacity | Low |
|
| 225 |
| **Guided JSON decoding** | Force 100% JSON parse on zero-shot models without training | Low |
|
|
@@ -230,13 +281,11 @@ Models we haven't tested but are strong candidates:
|
|
| 230 |
|
| 231 |
| Model | Parameters | Why Promising |
|
| 232 |
|-------|:----------:|---------------|
|
| 233 |
-
| **[Qwen3-VL-7B](https://huggingface.co/Qwen/Qwen3-VL-7B)** | 7B | Larger Qwen3-VL β our best architecture. Could push past 90% |
|
| 234 |
| **[InternVL3-4B](https://huggingface.co/OpenGVLab/InternVL3-4B)** | 4B | Mid-range InternVL β may close gap to Qwen3-VL |
|
| 235 |
| **[SmolVLM2-2.2B](https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct)** | 2.2B | HuggingFace's efficient VLM β strong structured output |
|
| 236 |
| **[PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448)** | 3B | Google VLM with excellent OCR β may solve brand/size |
|
| 237 |
-
| **[Phi-4-multimodal](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)** | 5.6B | Microsoft's latest β strong structured output |
|
| 238 |
| **[MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)** | 2.8B | Strong small VLM with good OCR capabilities |
|
| 239 |
-
| **[
|
| 240 |
|
| 241 |
### Long-Term Research
|
| 242 |
|
|
@@ -249,8 +298,8 @@ Models we haven't tested but are strong candidates:
|
|
| 249 |
### Key Open Questions
|
| 250 |
|
| 251 |
- Why does Qwen3-VL dramatically outperform Qwen3.5-VL at the same scale? Is it the vision encoder, the cross-attention mechanism, or training data?
|
| 252 |
-
- Can RL gains be amplified beyond +1.
|
| 253 |
-
- Is there a parameter count sweet spot between
|
| 254 |
- Would instruction-tuned base models (vs base models) yield better SFT starting points?
|
| 255 |
|
| 256 |
---
|
|
|
|
| 9 |
|
| 10 |
**Advancing structured attribute extraction from garment images through multi-stage reinforcement learning**
|
| 11 |
|
| 12 |
+
[](https://huggingface.co/Denali-AI)
|
| 13 |
[](https://huggingface.co/datasets/Denali-AI/eval-hard-3500)
|
| 14 |
[](https://www.apache.org/licenses/LICENSE-2.0)
|
| 15 |
+
[](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier)
|
| 16 |
|
| 17 |
</div>
|
| 18 |
|
|
|
|
| 22 |
|
| 23 |
Denali AI develops and benchmarks vision-language models (VLMs) for **structured garment attribute extraction** β the task of analyzing a garment image and producing a complete JSON object describing 9 key attributes: type, color, pattern, neckline, sleeve length, closure, brand, size, and defect type.
|
| 24 |
|
| 25 |
+
We systematically evaluate the impact of **supervised fine-tuning (SFT)**, **Group Relative Policy Optimization (GRPO)**, and **Group-relative Trajectory-based Policy Optimization (GTPO)** across multiple model architectures (Qwen3-VL, Qwen3.5-VL, InternVL3, Florence-2, Moondream2, Phi-4) and scales (1.6B to 122B parameters). Our best model, **[Qwen3-VL-8B SFT+GRPO](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier)**, achieves **91.3% weighted score** with **100% JSON parse rate** on the eval_hard_3500 benchmark.
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
| 30 |
|
| 31 |

|
| 32 |
|
| 33 |
+
| Rank | Model | Architecture | Params | Training | Weighted | SBERT+NLI | JSON Parse | Throughput |
|
| 34 |
+
|:----:|-------|:------------:|:------:|:--------:|:--------:|:---------:|:----------:|:----------:|
|
| 35 |
+
| 1 | [Qwen3-VL-8B SFT+GRPO](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier) | Qwen3-VL | 8B | SFT+GRPO | 91.3% | 78.7% | 100% | 7.5/s |
|
| 36 |
+
| 2 | [Qwen3-VL-2B-SFT-GRPO-v9](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9) | Qwen3-VL | 2B | SFT+GRPO | 89.5% | 78.5% | 100% | 15.9/s |
|
| 37 |
+
| 3 | [Qwen3-VL-8B SFT+GRPO NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-8b-garment-classifier-nvfp4) | Qwen3-VL | 8B | SFT+GRPO | 89.5% | 77.0% | 100% | 12.1/s |
|
| 38 |
+
| 4 | [Qwen3-VL-8B-Instruct-Base](https://huggingface.co/Denali-AI/qwen3-vl-8b-instruct-base) | Qwen3-VL | 8B | Zero-shot | 87.5% | 75.6% | 100% | 5.5/s |
|
| 39 |
+
| 5 | [Qwen3-VL-8B-Instruct NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-8b-instruct-nvfp4) | Qwen3-VL | 8B | Zero-shot | 87.2% | 75.0% | 100% | 8.2/s |
|
| 40 |
+
| 6 | [Qwen3.5-VL-2B Base](https://huggingface.co/Denali-AI/qwen35-2b-base) | Qwen3.5-VL | 2B | Zero-shot | 84.4% | 73.0% | 100% | 6.6/s |
|
| 41 |
+
| 7 | [Qwen3-VL-2B SFT+GRPO v9 NVFP4](https://huggingface.co/Denali-AI/qwen3-vl-2b-sft-grpo-v9-nvfp4) | Qwen3-VL | 2B | SFT+GRPO | 84.2% | 74.1% | 100% | 17.2/s |
|
| 42 |
+
| 8 | [Qwen3-VL-2B-Instruct Base](https://huggingface.co/Denali-AI/qwen3-vl-2b-instruct-base) | Qwen3-VL | 2B | Zero-shot | 76.4% | 66.7% | 100% | 15.1/s |
|
| 43 |
+
| 9 | [InternVL3-2B GRPO+GTPO Full](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-full) | InternVL3 | 2B | GRPO+GTPO | 72.7% | 64.3% | 100% | 11.8/s |
|
| 44 |
+
| 10 | [InternVL3-2B GRPO+GTPO FP8](https://huggingface.co/Denali-AI/internvl3-2b-grpo-gtpo-fp8) | InternVL3 | 2B | GRPO+GTPO | 72.2% | 63.8% | 100% | 14.3/s |
|
| 45 |
+
| 11 | [InternVL3-2B Base](https://huggingface.co/Denali-AI/internvl3-2b-base) | InternVL3 | 2B | Zero-shot | 71.8% | 63.7% | 100% | 11.8/s |
|
| 46 |
+
| 12 | [Moondream2 Base](https://huggingface.co/Denali-AI/moondream2-base) | Moondream2 | 1.6B | Zero-shot | 69.8% | 61.8% | 100% | 1.4/s |
|
| 47 |
+
| 13 | [Qwen3.5-VL-2B SFT+GRPO+GTPO](https://huggingface.co/Denali-AI/qwen35-2b-sft-grpo-gtpo-merged) | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | 65.3% | 60.1% | 100% | 11.3/s |
|
| 48 |
+
| 14 | [Qwen3.5-VL-2B SFT](https://huggingface.co/Denali-AI/qwen35-2b-sft-merged) | Qwen3.5-VL | 2B | SFT | 63.7% | 58.9% | 100% | 11.6/s |
|
| 49 |
+
| 15 | [Qwen3.5-VL-35B GPTQ-Int4](https://huggingface.co/Denali-AI/qwen35-35b-a3b-gptq-int4) | Qwen3.5-VL MoE | 35B (3B) | Zero-shot | 50.7% | 48.7% | 14% | 1.6/s |
|
| 50 |
+
| 16 | Qwen3.5-VL-9B NVFP4 | Qwen3.5-VL | 9B | Zero-shot | 47.0% | 46.0% | 8% | 1.7/s |
|
| 51 |
+
| 17 | [Qwen3.5-VL-9B SFT NVFP4](https://huggingface.co/Denali-AI/qwen35-9b-sft-nvfp4) | Qwen3.5-VL | 9B | SFT | 46.3% | 45.5% | 8% | 1.7/s |
|
| 52 |
+
| 18 | Qwen3.5-VL-2B Base NVFP4 | Qwen3.5-VL | 2B | Zero-shot | 42.9% | 42.9% | 0% | 4.0/s |
|
| 53 |
+
| 19 | [Qwen3.5-VL-122B NVFP4](https://huggingface.co/Denali-AI/qwen35-122b-a10b-nvfp4) | Qwen3.5-VL MoE | 122B (10B) | Zero-shot | 42.9% | 42.9% | 0% | 1.2/s |
|
| 54 |
+
| 20 | [Qwen3.5-VL-2B SFT NVFP4](https://huggingface.co/Denali-AI/qwen35-2b-sft-nvfp4) | Qwen3.5-VL | 2B | SFT | 42.9% | 42.9% | 0% | 4.0/s |
|
| 55 |
+
| 21 | Qwen3.5-VL-2B SFT+GRPO+GTPO NVFP4 | Qwen3.5-VL | 2B | SFT+GRPO+GTPO | 42.9% | 42.9% | 0% | 3.9/s |
|
| 56 |
+
| 22 | [Phi-4-Multimodal NVFP4](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | Phi-4 | 5.6B | Zero-shot | 42.9% | 42.9% | 0% | β |
|
| 57 |
+
|
| 58 |
+
> **Note:** Models ranked 18-22 have 0% JSON parse rate under NVFP4 quantization, meaning they cannot produce valid structured output β their weighted scores reflect the 42.9% floor from partial field matches in malformed outputs. Fine-tuning is required to unlock their potential.
|
| 59 |
|
| 60 |
---
|
| 61 |
|
|
|
|
| 105 |
|
| 106 |

|
| 107 |
|
| 108 |
+
**Key finding:** Qwen3-VL-2B v9 NVFP4 achieves the best accuracy-throughput trade-off at 84.2% weighted score and 17.2 samples/s β making it the Pareto-optimal choice for production deployment. For maximum accuracy, the Qwen3-VL-8B SFT+GRPO reaches 91.3% at 7.5 samples/s.
|
| 109 |
|
| 110 |
### Structured Output Reliability
|
| 111 |
|
|
|
|
| 129 |
|
| 130 |
| Collection | Models | Description |
|
| 131 |
|------------|:------:|-------------|
|
| 132 |
+
| [**Qwen3-VL**](https://huggingface.co/collections/Denali-AI/qwen3-vl-models-69c70950fca01f437228c29b) | 7 | Top-performing Qwen3-VL based models (2B and 8B) |
|
| 133 |
+
| [**Qwen3.5-VL**](https://huggingface.co/collections/Denali-AI/qwen35-vl-models-69c70802ab21ae73a116cc92) | 10 | Qwen3.5-VL models (0.8B to 122B) |
|
| 134 |
+
| [**InternVL3**](https://huggingface.co/collections/Denali-AI/internvl3-models-69c70803ab21ae73a116cca2) | 6 | InternVL3 models (1B, 2B) |
|
| 135 |
| [**Florence-2**](https://huggingface.co/collections/Denali-AI/florence-2-models-69c70802f1456fd2264216e8) | 3 | Florence-2 encoder-decoder models |
|
| 136 |
| [**Benchmarks**](https://huggingface.co/collections/Denali-AI/benchmarks-and-datasets-69c708037d77aba79963c1a7) | 2 | Evaluation and training datasets |
|
| 137 |
|
|
|
|
| 192 |
|
| 193 |
### Metrics
|
| 194 |
|
| 195 |
+
We employ a **comprehensive multi-metric evaluation framework** rather than relying on exact match. Each metric captures a different dimension of prediction quality:
|
| 196 |
|
| 197 |
| Metric | Model | Description |
|
| 198 |
|--------|-------|-------------|
|
|
|
|
| 203 |
| **SBERT+NLI Combined** | β | Primary metric: average of SBERT cosine and NLI |
|
| 204 |
| **Weighted Score** | β | Field-weighted aggregate (see weights above) |
|
| 205 |
|
| 206 |
+
<details>
|
| 207 |
+
<summary><b>Metric Definitions (click to expand)</b></summary>
|
| 208 |
+
|
| 209 |
+
#### SBERT Cosine Similarity
|
| 210 |
+
Measures how semantically close the predicted value is to the ground truth by encoding both strings into dense vector embeddings using the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence-transformer model and computing their cosine similarity. A score of 1.0 means the embeddings are identical in direction (semantically equivalent), while 0.0 means they are orthogonal (unrelated). This captures meaning-level similarity β for example, "navy blue" and "dark blue" score high despite being different strings. Values are thresholded: scores above 0.85 map to full credit, scores below 0.50 map to zero, and values in between are linearly scaled.
|
| 211 |
+
|
| 212 |
+
#### NLI Score (Natural Language Inference)
|
| 213 |
+
Uses a cross-encoder NLI model ([nli-MiniLM2-L6-H768](https://huggingface.co/cross-encoder/nli-MiniLM2-L6-H768)) to determine whether the predicted value *entails*, *contradicts*, or is *neutral* to the ground truth. The model evaluates the pair as a premise-hypothesis pair (e.g., "the color is navy blue" vs "the color is dark blue"). Entailment probability above 0.6 yields a score of at least 0.8; contradiction probability above 0.6 heavily penalizes the score (scaled down to 30% of base). This metric is particularly valuable for detecting semantic contradictions that string-level metrics would miss β e.g., "long sleeve" vs "short sleeve" are textually similar but semantically opposite.
|
| 214 |
+
|
| 215 |
+
#### Levenshtein Ratio
|
| 216 |
+
Computes the normalized edit distance between the predicted and ground-truth strings (after lowercasing and stripping). The ratio is `1 - (edit_distance / max_length)`, ranging from 0.0 (completely different) to 1.0 (identical). This character-level metric catches minor spelling variations and typos β for example, "pullover" vs "pull-over" score nearly 1.0. It complements the semantic metrics by providing a surface-level similarity signal that is model-free and deterministic.
|
| 217 |
+
|
| 218 |
+
#### Token F1
|
| 219 |
+
Computes token-level precision and recall by treating the predicted and ground-truth strings as bags of whitespace-delimited tokens. Precision is the fraction of predicted tokens that appear in the ground truth; recall is the fraction of ground-truth tokens that appear in the prediction. F1 is their harmonic mean. This metric handles multi-word values well β "light blue cotton" vs "blue cotton" gets partial credit for the overlapping tokens, unlike exact match which would score 0. Particularly useful for defect descriptions and color fields where partial matches are meaningful.
|
| 220 |
+
|
| 221 |
+
#### SBERT+NLI Combined
|
| 222 |
+
The **primary evaluation metric** used for ranking models. It combines SBERT cosine similarity and NLI scoring in a cascaded approach inspired by the training reward engine: first, the SBERT cosine score is mapped to a base score (1.0 if cosine >= 0.85, linearly scaled between 0.50-0.85, 0.0 below 0.50). Then, NLI adjusts this base: if the NLI model detects strong entailment (>0.6), the score is boosted to at least 0.8; if it detects strong contradiction (>0.6), the score is reduced to 30% of the base. This two-stage approach leverages both embedding similarity and logical inference for robust evaluation.
|
| 223 |
+
|
| 224 |
+
#### Weighted Score
|
| 225 |
+
The **headline metric** for model comparison. It multiplies each field's SBERT+NLI Combined score by its domain-specific importance weight (type=2.5x, defect=2.0x, brand=1.5x, size=1.5x, others=1.0x) and normalizes by the total weight. This reflects real-world value β correctly identifying garment type and defects matters more than getting the closure style right. A hallucination (predicting a value when ground truth is null) incurs a -0.3 penalty to discourage false positives. The weighted score ranges from 0% to 100%, with our best model achieving 91.3%.
|
| 226 |
+
|
| 227 |
+
#### JSON Parse Rate
|
| 228 |
+
The percentage of model outputs that are valid, parseable JSON objects. Fine-tuned models achieve 100%; zero-shot models often fail at 0-14%. This is a binary pass/fail gate β if the output cannot be parsed as JSON, all field scores for that sample are 0.
|
| 229 |
+
|
| 230 |
+
#### Throughput
|
| 231 |
+
End-to-end inference speed measured in samples per second, including network overhead, across 8 concurrent workers hitting a vLLM server. Higher throughput indicates better production viability. Measured on NVIDIA RTX PRO 6000 Blackwell (98 GB VRAM).
|
| 232 |
+
|
| 233 |
+
</details>
|
| 234 |
+
|
| 235 |
This multi-metric approach captures semantic similarity rather than requiring exact string matches, which is critical for fields like color ("navy blue" vs "dark blue") and defect descriptions.
|
| 236 |
|
| 237 |
### Evaluation Protocol
|
|
|
|
| 245 |
|
| 246 |
## Key Findings
|
| 247 |
|
| 248 |
+
1. **Qwen3-VL-8B SFT+GRPO is the new champion at 91.3%.** Fine-tuning the 8B model with SFT+GRPO surpasses the previous best (2B v9 at 89.5%) while maintaining 100% JSON parse rate.
|
| 249 |
+
|
| 250 |
+
2. **Architecture matters more than scale.** The 2B Qwen3-VL (89.5%) outperforms the 35B Qwen3.5 MoE (50.7%) by a wide margin, and even the zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned Qwen3.5-VL models.
|
| 251 |
+
|
| 252 |
+
3. **SFT is non-negotiable for structured output.** All fine-tuned models achieve 100% JSON parse rate; all zero-shot NVFP4/GPTQ models fail at 0-14%. No amount of model scale compensates for the lack of format training.
|
| 253 |
|
| 254 |
+
4. **NVFP4 quantization preserves accuracy for Qwen3-VL.** The 8B NVFP4 variant loses only 1.8pp (91.3% vs 89.5%) while gaining 61% throughput (7.5 vs 12.1 samples/s). The 2B NVFP4 loses 5.3pp but gains 8% throughput.
|
| 255 |
|
| 256 |
+
5. **FP8 quantization is effectively free.** InternVL3-2B loses <1% accuracy with FP8, while gaining 21% throughput improvement (11.8 vs 14.3 samples/s).
|
| 257 |
|
| 258 |
+
6. **Qwen3-VL dominates all scales.** The top 8 models are all Qwen3-VL variants. Even zero-shot Qwen3-VL-8B (87.5%) outperforms all fine-tuned InternVL3 and Qwen3.5-VL models.
|
| 259 |
|
| 260 |
+
7. **RL provides meaningful but modest gains.** GRPO+GTPO adds +1.6% weighted score over SFT-only for Qwen3.5-2B, with the largest gains on brand (+9.2pp) and defect (+5.6pp).
|
| 261 |
|
| 262 |
---
|
| 263 |
|
|
|
|
| 267 |
|
| 268 |
| Direction | Expected Impact | Effort |
|
| 269 |
|-----------|:--------------:|:------:|
|
| 270 |
+
| **GTPO on Qwen3-VL-8B SFT+GRPO** | +1-3pp weighted (add trajectory optimization to the #1 model) | Low |
|
| 271 |
| **GTPO on Qwen3-VL-2B v9** | +2-4pp weighted (currently SFT+GRPO only) | Low |
|
| 272 |
+
| **SFT on Qwen3-VL-8B from zero-shot** | Push past 91.3% with better starting point | Low |
|
| 273 |
+
| **QLoRA on Qwen3.5-35B GPTQ** | JSON parse 14% -> 100%, weighted 50% -> ~80%+ | Low |
|
| 274 |
| **OCR pre-processing pipeline** | Fix brand/size for Qwen3.5 models (+30-60pp on those fields) | Medium |
|
| 275 |
| **Higher LoRA rank (r=32/64)** | +1-3pp from increased adapter capacity | Low |
|
| 276 |
| **Guided JSON decoding** | Force 100% JSON parse on zero-shot models without training | Low |
|
|
|
|
| 281 |
|
| 282 |
| Model | Parameters | Why Promising |
|
| 283 |
|-------|:----------:|---------------|
|
|
|
|
| 284 |
| **[InternVL3-4B](https://huggingface.co/OpenGVLab/InternVL3-4B)** | 4B | Mid-range InternVL β may close gap to Qwen3-VL |
|
| 285 |
| **[SmolVLM2-2.2B](https://huggingface.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct)** | 2.2B | HuggingFace's efficient VLM β strong structured output |
|
| 286 |
| **[PaliGemma2-3B](https://huggingface.co/google/paligemma2-3b-pt-448)** | 3B | Google VLM with excellent OCR β may solve brand/size |
|
|
|
|
| 287 |
| **[MiniCPM-V-2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)** | 2.8B | Strong small VLM with good OCR capabilities |
|
| 288 |
+
| **[Qwen3-VL-32B](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)** | 32B | Largest Qwen3-VL β given 8B dominance, could push past 95% |
|
| 289 |
|
| 290 |
### Long-Term Research
|
| 291 |
|
|
|
|
| 298 |
### Key Open Questions
|
| 299 |
|
| 300 |
- Why does Qwen3-VL dramatically outperform Qwen3.5-VL at the same scale? Is it the vision encoder, the cross-attention mechanism, or training data?
|
| 301 |
+
- Can RL gains be amplified beyond +1.8pp on the 8B model? Current GRPO hyperparameters may be suboptimal
|
| 302 |
+
- Is there a parameter count sweet spot between 8B and 32B where accuracy saturates?
|
| 303 |
- Would instruction-tuned base models (vs base models) yield better SFT starting points?
|
| 304 |
|
| 305 |
---
|