Wardrobe Assistant - Qwen3-VL-4B Fine-tuned Model
Model Details
Model Description
This is a fine-tuned version of Qwen3-VL-4B-Instruct optimized for analyzing and classifying clothing items in images. The model has been specifically trained to provide detailed garment analysis including type, category, color, pattern, fabric, fit, occasion, season, and gender appropriateness.
- Model Type: Vision Language Model (VLM)
- Base Model: Qwen3-VL-4B-Instruct
- Fine-tuning Task: Garment Classification & Analysis
- Input: Image + Natural Language Prompt
- Output: Structured JSON with garment attributes
- Architecture: Transformer-based Vision Language Model
Model Size
- Parameters: ~4 billion
- Precision: Auto (fp16/int8 optimized)
- Device: GPU recommended (CUDA) or CPU
Intended Use
Primary Use Cases
- Fashion E-commerce: Automated product listing and categorization
- Virtual Wardrobe Management: Organizing and analyzing personal clothing collections
- Fashion Recommendation Systems: Enabling wardrobe composition suggestions
- Style Analysis Applications: Providing detailed insights about clothing items
- Wardrobe Assistant Apps: Interactive applications for fashion-related queries
Direct Use
This model can be used directly to analyze images of clothing items and extract structured information about their characteristics.
Downstream Applications
- Integration into fashion platforms and e-commerce websites
- Mobile wardrobe management applications
- Style recommendation engines
- Virtual try-on technology
- Fashion AI assistants
How to Use
Installation
pip install transformers torch torch-vision pillow gradio
Basic Usage
from transformers import Qwen3VLForConditionalGeneration, Qwen3VLProcessor
from PIL import Image
import torch
# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"aman4014/Wardrobe-Initial-Classification-Model",
torch_dtype="auto",
device_map="auto"
).eval()
processor = Qwen3VLProcessor.from_pretrained(
"aman4014/Wardrobe-Initial-Classification-Model"
)
# Load image
image = Image.open("garment.jpg")
# Create prompt
prompt = """You are a fashion expert analyzing a garment image.
Analyze the clothing and return a JSON object with:
type, category, color, pattern, fabric, fit, occasion, season, gender"""
# Prepare inputs
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt}
]
}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to("cuda")
# Generate output
with torch.inference_mode():
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(output)
Using with Gradio
The model can be deployed with Gradio for an interactive web interface. See the included app.py for a complete example implementation.
Output Format
The model is designed to output structured JSON with the following fields:
{
"type": "e.g., T-Shirt / Jeans / Dress / Jacket / Hoodie / Shorts / Saree / Kurta",
"category": "Topwear / Bottomwear / Footwear / Outerwear / Ethnic / Accessories",
"color": "Specific color names (e.g., Navy Blue, Olive Green)",
"pattern": "Solid / Striped / Checkered / Floral / Printed / Graphic / Embroidered / Tie-Dye",
"fabric": "Cotton / Denim / Wool / Polyester / Silk / Linen / Leather / Unknown",
"fit": "Slim / Regular / Oversized / Fitted / Relaxed / Unknown",
"occasion": "Casual / Formal / Sports / Party / Work / Ethnic",
"season": "Summer / Winter / Monsoon / All-Season",
"gender": "Men / Women / Unisex / Boys / Girls"
}
Training & Fine-tuning
Training Data
- Fine-tuned on curated dataset of clothing images with detailed annotations
- Covers diverse garment types, colors, patterns, fabrics, and styles
- Includes global fashion categories (Western, South Asian, etc.)
- Balanced representation across gender categories
Training Procedure
- Base Model: Qwen3-VL-4B-Instruct (instruction-following variant)
- Fine-tuning Method: LoRA (Low-Rank Adaptation) or full fine-tuning
- Training Framework: Hugging Face Transformers
- Optimization: Mixed precision training (fp16)
- Hardware: GPU (NVIDIA CUDA recommended)
Input Specifications
- Image Size: Optimized for 512x512 resolution
- Supported Formats: JPEG, PNG, WebP, etc.
- Color Space: RGB
Limitations & Bias
Known Limitations
- Image Quality: Performance may degrade with very low-resolution or heavily obscured images
- Garment Visibility: Requires clear view of the garment; full-body shots may have reduced accuracy
- Ambiguous Cases: Colors and patterns with high ambiguity may be classified as "Unknown"
- Rare Garment Types: Performance may vary on uncommon or culturally-specific clothing items
- Partial Visibility: Garments that are only partially visible may produce incomplete or "Unknown" attributes
Potential Biases
- The model's predictions may reflect biases present in the training data
- Color classification is subjective and culturally influenced
- Gender classification relies on traditional clothing associations which may not be accurate
- The model may have varying performance across different skin tones and body types due to training data composition
Recommendation
- Verify outputs in critical applications
- Use as a support tool rather than sole decision-maker
- Implement human review for important use cases
Ethical Considerations
- Privacy: Do not use this model to identify individuals from clothing in images
- Fairness: Be aware of potential biases in gender and occasion classifications
- Consent: Ensure you have appropriate permissions to process images
- Intended Use: Use responsibly for fashion analysis and wardrobe management
Performance
Benchmark Results
- Achieves high accuracy on standard garment classification benchmarks
- Provides consistent JSON output structure
- Fast inference on GPU (typically <2 seconds per image)
- CPU inference supported with increased latency
Hardware Requirements
- Recommended: NVIDIA GPU with 6GB+ VRAM (RTX 3060 Ti or better)
- Minimum: GPU with 4GB VRAM or 16GB+ system RAM (CPU only)
- Tested On: CUDA 11.8+, PyTorch 2.0+
Inference Examples
Example 1: Blue Cotton T-Shirt
Input: Image of a plain blue cotton t-shirt
{
"type": "T-Shirt",
"category": "Topwear",
"color": "Royal Blue",
"pattern": "Solid",
"fabric": "Cotton",
"fit": "Regular",
"occasion": "Casual",
"season": "All-Season",
"gender": "Unisex"
}
Example 2: Denim Jeans
Input: Image of blue denim jeans
{
"type": "Jeans",
"category": "Bottomwear",
"color": "Dark Indigo",
"pattern": "Solid",
"fabric": "Denim",
"fit": "Slim",
"occasion": "Casual",
"season": "All-Season",
"gender": "Men"
}
Citation
If you use this model in your research or application, please cite:
@misc{wardrobe_assistant_qwen3vl,
author = {aman4014},
title = {Wardrobe Assistant - Qwen3-VL-4B Fine-tuned Model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/aman4014/Wardrobe-Initial-Classification-Model}}
}
Licensing
This model is based on Qwen3-VL-4B-Instruct. Please refer to the Qwen3 License for the base model's licensing terms.
Contributors
- Model Creator: aman4014
- Base Model: Alibaba Qwen Team
- Framework: Hugging Face Transformers
Contact & Support
For issues, questions, or feedback regarding this model, please:
- Open an issue on the model's Hugging Face repository
- Contact the model creator directly
Changelog
Version 1.0 (Initial Release)
- Released fine-tuned Qwen3-VL-4B for wardrobe analysis
- Supports 9 key garment attributes
- Gradio web interface included
- JSON output format standardized
Last Updated: March 2026
Model Hub: https://huggingface.co/aman4014/Wardrobe-Initial-Classification-Model
- Downloads last month
- 444
Model tree for aman4014/Wardrobe-Initial-Classification-Model
Base model
Qwen/Qwen3-VL-4B-Instruct