visionquery / README.md
Saptadip Saha
Update readme
faf9430
---
title: VisionQuery
emoji: πŸ”
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
short_description: SigLIP based zero-shot image classification
tags:
- vision
- zero-shot
- siglip
- taipy
- image-classification
- transformers
---
# VisionQuery
### Zero-Shot Image Understanding with Google SigLIP + Taipy
## Problem Statement
Traditional image classification systems demand:
- **Thousands of labeled images** per category
- **Expensive GPU training pipelines**
- **Re-training** every time you add a new category
- **ML expertise** to build and maintain
This makes vision AI inaccessible for most real-world use cases.
## Solution
**VisionQuery AI** uses **SigLIP** (Sigmoid Loss for Language-Image Pre-Training by Google DeepMind) to deliver **zero-shot image classification**:
- Describe what you're looking for in **plain English**
- No training data or fine-tuning β€” ever
- Add **unlimited categories** on the fly
- Multilingual: supports **100+ languages**
## How to Use
1. **Upload** any image (JPG, PNG, WebP)
2. **Enter text labels** as comma-separated descriptions
e.g. `a cat, a dog, a person walking, a sunset`
3. Click **Analyze Image**
4. Instantly see **similarity scores** for every label
## How SigLIP Works
```
Image ──► ViT Encoder ──► Image Embedding ──┐
β”œβ”€β”€β–Ί Sigmoid Score per pair
Text ──► BERT Encoder ──► Text Embedding β”€β”€β”˜
```
Unlike CLIP's softmax loss (which normalises scores globally), SigLIP uses a **sigmoid loss** β€” each image-text pair is scored independently. This gives:
- Better calibration
- True multi-label support
- Stronger zero-shot accuracy
**Model used:** `google/siglip-base-patch16-224`
## Tech Stack
| Layer | Technology |
|---|---|
| Vision-Language Model | Google SigLIP via πŸ€— Transformers |
| GUI Framework | [Taipy](https://github.com/Avaiga/taipy) |
| Charts | Plotly |
| Deployment | Hugging Face Spaces (Docker) |
| Backend | PyTorch |
## Applications
| Domain | Use Case |
|---|---|
| πŸ₯ Healthcare | Describe symptoms β†’ find matching scan types |
| πŸ›’ E-Commerce | Natural language visual product search |
| πŸ”’ Security | Detect unusual scenes with text descriptions |
| 🎨 Asset Management | Auto-tag image libraries |
| β™Ώ Accessibility | Auto-describe images for visually impaired |
| πŸ”¬ Research | Classify microscopy / satellite imagery |
## Local Setup
```bash
git clone https://huggingface.co/spaces/YOUR_USERNAME/visionquery-ai
cd visionquery-ai
pip install -r requirements.txt
python app.py
```
App runs at `http://localhost:7860`
## Citation
```
@article{zhai2023sigmoid,
title = {Sigmoid Loss for Language Image Pre-Training},
author = {Zhai, Xiaohua and others},
journal = {arXiv:2303.15343},
year = {2023},
publisher = {Google DeepMind}
}
```