Qwen3.5 Grocery Multi-task Model
A pruned Qwen3.5-0.8B vision-language model (12 text layers, down from 24) fine-tuned for grocery product detection and classification.
Model Details
- Base model: Qwen/Qwen3.5-0.8B (pruned to 12 text layers)
- Tasks: Product classification (356 classes) + grid-based object detection
- Parameters: 860.7M total (backbone: 858.2M, cls head: 1.4M, det head: 1.0M)
- Training step: 7500
- Validation accuracy (classification): 0.801
- Validation loss: 0.6929
Architecture
- Backbone: Qwen3.5 vision encoder (12 ViT layers, 768 hidden) + merger + 12 text transformer blocks (1024 hidden)
- Classification head: Linear(1024, 1024) โ GELU โ Dropout โ Linear(1024, 356) on mean-pooled features
- Detection head: Anchor-free 14x14 grid, predicts [conf, x_off, y_off, w, h] per cell
Files
model.safetensors- Backbone weights (vision encoder + text layers)cls_head.safetensors- Classification head weightsdet_head.safetensors- Detection head weightsconfig.json- Model config (pruned 12-layer Qwen3.5)tokenizer.json/tokenizer_config.json- Tokenizer files
Usage
import torch
from safetensors.torch import load_file
from transformers import AutoModelForImageTextToText, AutoProcessor
# Load backbone
model = AutoModelForImageTextToText.from_pretrained(
"heiertech/qwen35-grocery-multitask",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# Load classification head
cls_state = load_file("cls_head.safetensors")
# ... attach to your ClassificationHead module
# Load detection head
det_state = load_file("det_head.safetensors")
# ... attach to your DetectionHead module
- Downloads last month
- 24