Spaces:

SciX2050
/

visionquery

Sleeping

App Files Files Community

visionquery / README.md

Saptadip Saha

Update readme

faf9430 3 months ago

preview code

raw

history blame contribute delete

2.92 kB

	---
	title: VisionQuery
	emoji: 🔍
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	app_port: 7860
	short_description: SigLIP based zero-shot image classification
	tags:
	- vision
	- zero-shot
	- siglip
	- taipy
	- image-classification
	- transformers
	---

	# VisionQuery
	### Zero-Shot Image Understanding with Google SigLIP + Taipy



	## Problem Statement

	Traditional image classification systems demand:
	- Thousands of labeled images per category
	- Expensive GPU training pipelines
	- Re-training every time you add a new category
	- ML expertise to build and maintain

	This makes vision AI inaccessible for most real-world use cases.



	## Solution

	VisionQuery AI uses SigLIP (Sigmoid Loss for Language-Image Pre-Training by Google DeepMind) to deliver zero-shot image classification:

	- Describe what you're looking for in plain English
	- No training data or fine-tuning — ever
	- Add unlimited categories on the fly
	- Multilingual: supports 100+ languages



	## How to Use

	1. Upload any image (JPG, PNG, WebP)
	2. Enter text labels as comma-separated descriptions
	e.g. `a cat, a dog, a person walking, a sunset`
	3. Click Analyze Image
	4. Instantly see similarity scores for every label



	## How SigLIP Works

	```
	Image ──► ViT Encoder ──► Image Embedding ──┐
	├──► Sigmoid Score per pair
	Text ──► BERT Encoder ──► Text Embedding ──┘
	```

	Unlike CLIP's softmax loss (which normalises scores globally), SigLIP uses a sigmoid loss — each image-text pair is scored independently. This gives:
	- Better calibration
	- True multi-label support
	- Stronger zero-shot accuracy

	Model used: `google/siglip-base-patch16-224`



	## Tech Stack

	\| Layer \| Technology \|
	\|---\|---\|
	\| Vision-Language Model \| Google SigLIP via 🤗 Transformers \|
	\| GUI Framework \| [Taipy](https://github.com/Avaiga/taipy) \|
	\| Charts \| Plotly \|
	\| Deployment \| Hugging Face Spaces (Docker) \|
	\| Backend \| PyTorch \|



	## Applications

	\| Domain \| Use Case \|
	\|---\|---\|
	\| 🏥 Healthcare \| Describe symptoms → find matching scan types \|
	\| 🛒 E-Commerce \| Natural language visual product search \|
	\| 🔒 Security \| Detect unusual scenes with text descriptions \|
	\| 🎨 Asset Management \| Auto-tag image libraries \|
	\| ♿ Accessibility \| Auto-describe images for visually impaired \|
	\| 🔬 Research \| Classify microscopy / satellite imagery \|



	## Local Setup

	```bash
	git clone https://huggingface.co/spaces/YOUR_USERNAME/visionquery-ai
	cd visionquery-ai
	pip install -r requirements.txt
	python app.py
	```

	App runs at `http://localhost:7860`



	## Citation

	```
	@article{zhai2023sigmoid,
	title = {Sigmoid Loss for Language Image Pre-Training},
	author = {Zhai, Xiaohua and others},
	journal = {arXiv:2303.15343},
	year = {2023},
	publisher = {Google DeepMind}
	}
	```