Spaces:
Sleeping
Sleeping
| title: VisionQuery | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| short_description: SigLIP based zero-shot image classification | |
| tags: | |
| - vision | |
| - zero-shot | |
| - siglip | |
| - taipy | |
| - image-classification | |
| - transformers | |
| # VisionQuery | |
| ### Zero-Shot Image Understanding with Google SigLIP + Taipy | |
| ## Problem Statement | |
| Traditional image classification systems demand: | |
| - **Thousands of labeled images** per category | |
| - **Expensive GPU training pipelines** | |
| - **Re-training** every time you add a new category | |
| - **ML expertise** to build and maintain | |
| This makes vision AI inaccessible for most real-world use cases. | |
| ## Solution | |
| **VisionQuery AI** uses **SigLIP** (Sigmoid Loss for Language-Image Pre-Training by Google DeepMind) to deliver **zero-shot image classification**: | |
| - Describe what you're looking for in **plain English** | |
| - No training data or fine-tuning β ever | |
| - Add **unlimited categories** on the fly | |
| - Multilingual: supports **100+ languages** | |
| ## How to Use | |
| 1. **Upload** any image (JPG, PNG, WebP) | |
| 2. **Enter text labels** as comma-separated descriptions | |
| e.g. `a cat, a dog, a person walking, a sunset` | |
| 3. Click **Analyze Image** | |
| 4. Instantly see **similarity scores** for every label | |
| ## How SigLIP Works | |
| ``` | |
| Image βββΊ ViT Encoder βββΊ Image Embedding βββ | |
| ββββΊ Sigmoid Score per pair | |
| Text βββΊ BERT Encoder βββΊ Text Embedding βββ | |
| ``` | |
| Unlike CLIP's softmax loss (which normalises scores globally), SigLIP uses a **sigmoid loss** β each image-text pair is scored independently. This gives: | |
| - Better calibration | |
| - True multi-label support | |
| - Stronger zero-shot accuracy | |
| **Model used:** `google/siglip-base-patch16-224` | |
| ## Tech Stack | |
| | Layer | Technology | | |
| |---|---| | |
| | Vision-Language Model | Google SigLIP via π€ Transformers | | |
| | GUI Framework | [Taipy](https://github.com/Avaiga/taipy) | | |
| | Charts | Plotly | | |
| | Deployment | Hugging Face Spaces (Docker) | | |
| | Backend | PyTorch | | |
| ## Applications | |
| | Domain | Use Case | | |
| |---|---| | |
| | π₯ Healthcare | Describe symptoms β find matching scan types | | |
| | π E-Commerce | Natural language visual product search | | |
| | π Security | Detect unusual scenes with text descriptions | | |
| | π¨ Asset Management | Auto-tag image libraries | | |
| | βΏ Accessibility | Auto-describe images for visually impaired | | |
| | π¬ Research | Classify microscopy / satellite imagery | | |
| ## Local Setup | |
| ```bash | |
| git clone https://huggingface.co/spaces/YOUR_USERNAME/visionquery-ai | |
| cd visionquery-ai | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| App runs at `http://localhost:7860` | |
| ## Citation | |
| ``` | |
| @article{zhai2023sigmoid, | |
| title = {Sigmoid Loss for Language Image Pre-Training}, | |
| author = {Zhai, Xiaohua and others}, | |
| journal = {arXiv:2303.15343}, | |
| year = {2023}, | |
| publisher = {Google DeepMind} | |
| } | |
| ``` | |