athul020 commited on
Commit
163d6b3
·
verified ·
1 Parent(s): 0ff0b22

Update README.md

Browse files

PromptRIS (Prompt-Driven Referring Image Segmentation) is a multimodal deep learning model designed to segment objects in an image based on natural language prompts.

Unlike traditional segmentation models that rely only on visual cues, PromptRIS integrates vision and language understanding to localize and segment objects referred to by descriptive text (e.g., “the man wearing a red shirt”).

The model combines:

CLIP for text–image semantic alignment

Segment Anything Model for high-quality mask generation

Cross-modal attention fusion for prompt-guided feature refinement

Contrastive learning to improve referring expression grounding

PromptRIS is capable of:

🖼 Referring Expression Segmentation

🔍 Prompt-guided Object Localization

🧠 Cross-modal Visual Reasoning

🎯 Fine-grained Instance Segmentation

The architecture enables precise mask prediction by aligning textual embeddings with visual feature maps before mask refinement through SAM.

This repository provides:

Trained PromptRIS model weights

SAM checkpoint integration

Inference pipeline

Reproducible training structure

PromptRIS is suitable for research, AI assistants, interactive vision systems, and advanced human–AI visual interaction tasks.

Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - deep-learning
7
+ - computer-vision
8
+ - vision-language
9
+ - segmentation
10
+ - multimodal
11
+ - pytorch
12
+ library_name: pytorch
13
+ ---
14
+
15
+ # DEEP – Vision-Language Intelligence Framework
16
+
17
+ ## 🔥 Overview
18
+
19
+ DEEP is a multimodal AI framework that integrates computer vision and language understanding to perform intelligent visual reasoning tasks.
20
+
21
+ The system is designed for:
22
+
23
+ - 🧠 Vision-Language Understanding
24
+ - 🖼 Image Segmentation
25
+ - 📝 Visual Question Answering
26
+ - 🔍 Prompt-driven Object Localization
27
+ - 🤖 AI Agent-based Visual Reasoning
28
+
29
+ This repository contains model weights, training scripts, and inference pipeline.
30
+
31
+ ---
32
+
33
+ ## 🏗 Architecture
34
+
35
+ The architecture integrates:
36
+
37
+ - Vision Encoder (CNN / ViT)
38
+ - Text Encoder (Transformer-based)
39
+ - Cross-Modal Attention Fusion
40
+ - Task-specific Heads (Segmentation / QA / Classification)
41
+
42
+ Pipeline Flow:
43
+
44
+ Image → Vision Encoder
45
+ Text Prompt → Text Encoder
46
+ Fusion → Cross Attention
47
+ Output → Task Head
48
+
49
+ ---
50
+
51
+ ## 📊 Training Details
52
+
53
+ - Framework: PyTorch
54
+ - Optimizer: AdamW
55
+ - Loss: Cross-Entropy / Contrastive Loss
56
+ - Training Strategy: Supervised Learning
57
+ - Hardware: GPU-based Training
58
+
59
+ ---
60
+
61
+ ## 🚀 Usage
62
+
63
+ ### Install Dependencies
64
+
65
+ ```bash
66
+ pip install torch torchvision transformers