Qwen3.5 Grocery Multi-task Model

A pruned Qwen3.5-0.8B vision-language model (12 text layers, down from 24) fine-tuned for grocery product detection and classification.

Model Details

Base model: Qwen/Qwen3.5-0.8B (pruned to 12 text layers)
Tasks: Product classification (356 classes) + grid-based object detection
Parameters: 860.7M total (backbone: 858.2M, cls head: 1.4M, det head: 1.0M)
Training step: 7500
Validation accuracy (classification): 0.801
Validation loss: 0.6929

Architecture

Backbone: Qwen3.5 vision encoder (12 ViT layers, 768 hidden) + merger + 12 text transformer blocks (1024 hidden)
Classification head: Linear(1024, 1024) → GELU → Dropout → Linear(1024, 356) on mean-pooled features
Detection head: Anchor-free 14x14 grid, predicts [conf, x_off, y_off, w, h] per cell

Files

model.safetensors - Backbone weights (vision encoder + text layers)
cls_head.safetensors - Classification head weights
det_head.safetensors - Detection head weights
config.json - Model config (pruned 12-layer Qwen3.5)
tokenizer.json / tokenizer_config.json - Tokenizer files

Usage

import torch
from safetensors.torch import load_file
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load backbone
model = AutoModelForImageTextToText.from_pretrained(
    "heiertech/qwen35-grocery-multitask",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# Load classification head
cls_state = load_file("cls_head.safetensors")
# ... attach to your ClassificationHead module

# Load detection head
det_state = load_file("det_head.safetensors")
# ... attach to your DetectionHead module

Downloads last month: 6

Safetensors

Model size

0.9B params

Tensor type

F16