Update README.md
Browse filesPromptRIS (Prompt-Driven Referring Image Segmentation) is a multimodal deep learning model designed to segment objects in an image based on natural language prompts.
Unlike traditional segmentation models that rely only on visual cues, PromptRIS integrates vision and language understanding to localize and segment objects referred to by descriptive text (e.g., “the man wearing a red shirt”).
The model combines:
CLIP for text–image semantic alignment
Segment Anything Model for high-quality mask generation
Cross-modal attention fusion for prompt-guided feature refinement
Contrastive learning to improve referring expression grounding
PromptRIS is capable of:
🖼 Referring Expression Segmentation
🔍 Prompt-guided Object Localization
🧠 Cross-modal Visual Reasoning
🎯 Fine-grained Instance Segmentation
The architecture enables precise mask prediction by aligning textual embeddings with visual feature maps before mask refinement through SAM.
This repository provides:
Trained PromptRIS model weights
SAM checkpoint integration
Inference pipeline
Reproducible training structure
PromptRIS is suitable for research, AI assistants, interactive vision systems, and advanced human–AI visual interaction tasks.
|
@@ -1,3 +1,66 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- deep-learning
|
| 7 |
+
- computer-vision
|
| 8 |
+
- vision-language
|
| 9 |
+
- segmentation
|
| 10 |
+
- multimodal
|
| 11 |
+
- pytorch
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# DEEP – Vision-Language Intelligence Framework
|
| 16 |
+
|
| 17 |
+
## 🔥 Overview
|
| 18 |
+
|
| 19 |
+
DEEP is a multimodal AI framework that integrates computer vision and language understanding to perform intelligent visual reasoning tasks.
|
| 20 |
+
|
| 21 |
+
The system is designed for:
|
| 22 |
+
|
| 23 |
+
- 🧠 Vision-Language Understanding
|
| 24 |
+
- 🖼 Image Segmentation
|
| 25 |
+
- 📝 Visual Question Answering
|
| 26 |
+
- 🔍 Prompt-driven Object Localization
|
| 27 |
+
- 🤖 AI Agent-based Visual Reasoning
|
| 28 |
+
|
| 29 |
+
This repository contains model weights, training scripts, and inference pipeline.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 🏗 Architecture
|
| 34 |
+
|
| 35 |
+
The architecture integrates:
|
| 36 |
+
|
| 37 |
+
- Vision Encoder (CNN / ViT)
|
| 38 |
+
- Text Encoder (Transformer-based)
|
| 39 |
+
- Cross-Modal Attention Fusion
|
| 40 |
+
- Task-specific Heads (Segmentation / QA / Classification)
|
| 41 |
+
|
| 42 |
+
Pipeline Flow:
|
| 43 |
+
|
| 44 |
+
Image → Vision Encoder
|
| 45 |
+
Text Prompt → Text Encoder
|
| 46 |
+
Fusion → Cross Attention
|
| 47 |
+
Output → Task Head
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 📊 Training Details
|
| 52 |
+
|
| 53 |
+
- Framework: PyTorch
|
| 54 |
+
- Optimizer: AdamW
|
| 55 |
+
- Loss: Cross-Entropy / Contrastive Loss
|
| 56 |
+
- Training Strategy: Supervised Learning
|
| 57 |
+
- Hardware: GPU-based Training
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## 🚀 Usage
|
| 62 |
+
|
| 63 |
+
### Install Dependencies
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
pip install torch torchvision transformers
|