VNITx-Image / README.md
arshan123's picture
Update README.md
311f565 verified
metadata
title: Vnitx Image
emoji: 🐢
colorFrom: blue
colorTo: blue
sdk: docker
pinned: false
license: mit
short_description: AI vs Human Image Prompt Detection

Multimodal Visual Security Engine (EasyOCR + ONNX DeBERTa + CLIP + BLIP)

System Architecture

graph TD
    Input[Input: Image/Video Frame] --> Split{Parallel Process}

    %% Engine D Logic
    Split --> EngineD[Engine D: Prompt Injection]
    EngineD --> OCR[EasyOCR: Extract Text]
    OCR --> Norm[Normalization Layer]
    Norm --> InjectModel[DeBERTa Prompt Injection - ONNX]
    InjectModel --> ThreatCheck{Threat Dictionary - aux}
    ThreatCheck --> RiskScore[Risk Score + Reason]

    %% Engine E Logic
    Split --> EngineE[Engine E: Cross-Modal]
    EngineE --> BLIP[BLIP: Image Caption]
    InputAudio[Input: Audio Transcript] --> CLIP_Text[CLIP Text Encoder]
    EngineE --> CLIP_Img[CLIP Image Encoder]
    CLIP_Text --> Cosine[Cosine Similarity Calc]
    CLIP_Img --> Cosine
    Cosine --> Threshold{Is Score < 0.18?}
    Threshold -- Yes --> Mismatch[Status: MISMATCH - Deepfake]
    Threshold -- No --> Match[Status: MATCH - Genuine]

Engine D (Visual Prompt Injection)
OCR-based text extraction + ML classification. EasyOCR extracts visible or hidden text (with CLAHE + Otsu binarization for low-contrast regions), a normalization layer de-obfuscates tokens, and a DeBERTa prompt‑injection classifier (ONNX runtime) scores risk. A small threat dictionary is used as auxiliary evidence in the reason string, not as the primary detector.

Engine E (Cross-Modal Consistency)
Semantic-based (not OCR). CLIP (ViT-B/32) embeds both the video frame and the audio transcript into a shared vector space to verify that the visual context matches the spoken context. BLIP generates an image caption and we compare it with OCR text to detect prompt/scene misalignment.

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the Visual Engine Test
python -m src.engines.visual_engine