raheebhassan's picture
Add streaming video API endpoints for real-time processing and compression
8177c87

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Contextual Communication Demo
emoji: πŸ“‘
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false

Contextual Communication Demo

An interactive demo for contextual communication in bandwidth-degraded environments (e.g., ISR collection from drones). The core idea is context-aware compression: transmit an extremely compact latent representation while ensuring the decoded output remains useful for downstream decision-making (e.g., object detection).

This repository implements contextual spatial compression for EO/IR-style imagery using an ROI-aware learned image compression model (TIC-style VAE) guided by segmentation masks.

Features

  • Contextual (ROI) compression: preserves fidelity in mission-relevant regions while aggressively compressing non-relevant background.
  • Mission-driven context extraction: map a mission prompt to ROI masks via multiple segmentation strategies:
    • Class-based segmentation (SegFormer / YOLO / Mask2Former / Mask R-CNN)
    • Prompt/referring segmentation (SAM3)
    • Optional object detection overlays to visualize task retention on the decoded image
  • Two operator knobs for bandwidth adaptation:
    • Background preservation ($\sigma$, 0.01–1.0): lower = more background degradation
    • Transmission quality (checkpoint/lambda selection): higher = larger payload / better reconstruction
  • CLI tools for segmentation, ROI compression, and before/after detection retention.

Setup

pip install -r requirements.txt

Checkpoints are expected under checkpoints/ (e.g., checkpoints/tic_lambda_0.0483.pth.tar).

By default, model weights/caches downloaded by detection/segmentation backends are also stored under checkpoints/:

  • Hugging Face models under checkpoints/hf/
  • Torch/torchvision weights under checkpoints/torch/

Usage

Interactive Demo (Hugging Face Spaces / Local)

This repo includes a Gradio app intended for Hugging Face Spaces (app_file: app.py). To run locally:

python app.py

In the UI:

  • Enter a Mission and choose a Context Extraction Method (ROI).
  • Tune the two knobs to match bandwidth constraints:
    • Transmission quality (checkpoint selection)
    • Background preservation ($\sigma$)
  • Optionally enable object detection overlays.

Note: the app includes a Video tab placeholder (inactive).

CLI: Contextual Spatial Compression (Images)

python roi_compressor.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/compressed.jpg \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo \
    --seg-classes car \
    --highlight

Key arguments:

  • --sigma: background quality (lower = more compression)
  • --seg-method: segformer, yolo, mask2former, maskrcnn
  • --load-mask: bypass segmentation using a precomputed mask

CLI: Segmentation Only

python roi_segmenter.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/mask.png \
    --method segformer \
    --classes car \
    --visualize

Prompt-based segmentation (SAM3):

python roi_segmenter.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/mask.png \
    --method sam3 \
    --prompt "a car" \
    --visualize

CLI: Detection Retention (Before vs After)

Compare original vs already-compressed:

python roi_detection_eval.py \
    --before data/images/car/0016cf15fa4d4e16.jpg \
    --after results/compressed.jpg \
    --detectors yolo fasterrcnn detr \
    --viz-dir results/det_viz

Or generate the "after" image via ROI compression and then evaluate:

python roi_detection_eval.py \
    --before data/images/car/0016cf15fa4d4e16.jpg \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo --seg-classes car \
    --detectors yolo fasterrcnn \
    --save-after results/after.jpg \
    --viz-dir results/det_viz

Open-vocabulary example (YOLO-World):

python roi_detection_eval.py \
    --before data/images/person/kodim04.png \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo --seg-classes person \
    --detectors yolo_world \
    --open-vocab-classes "person,car" \
    --viz-dir results/det_viz

Project Structure

.
β”œβ”€β”€ app.py                    # Gradio demo (Hugging Face Spaces)
β”œβ”€β”€ model_cache.py            # Cache routing to `checkpoints/`
β”œβ”€β”€ roi_compressor.py         # CLI: contextual (ROI) image compression
β”œβ”€β”€ roi_segmenter.py          # CLI: ROI mask generation
β”œβ”€β”€ roi_detection_eval.py     # CLI: before/after detection retention
β”œβ”€β”€ segmentation/             # Segmenters + factory
β”œβ”€β”€ detection/                # Detectors + factory
β”œβ”€β”€ vae/                      # ROI-aware TIC model + compression utils
β”œβ”€β”€ checkpoints/              # Compression checkpoints + model caches
β”œβ”€β”€ data/images/                   # Sample images
β”œβ”€β”€ examples.sh
└── _segmentation_comparison.ipynb

Modular API

Segmentation:

from segmentation import create_segmenter

segmenter = create_segmenter("yolo", device="cuda", conf_threshold=0.3)
mask = segmenter(image, target_classes=["car", "person"])

Compression:

from vae import load_checkpoint, compress_image

model = load_checkpoint("checkpoints/tic_lambda_0.0483.pth.tar", device="cuda")
out = compress_image(image, mask, model, sigma=0.3, device="cuda")
compressed = out["compressed"]
bpp = out["bpp"]

Notes

  • OpenCV is included via opencv-python-headless (recommended for server/Spaces environments).
  • Some backends download weights on first use; caches are routed under checkpoints/.
  • Output directories like results/ are created at runtime by the CLIs.

title: Contextual Communication Demo emoji: "πŸ“‘" colorFrom: blue colorTo: purple sdk: gradio sdk_version: "6.2.0" app_file: app.py pinned: false

Contextual Communication Demo

An interactive demo for contextual communication in bandwidth-degraded environments (e.g., ISR collection from drones). The core idea is context-aware compression: transmit an extremely compact latent representation while ensuring the decoded output remains useful for downstream decision-making (e.g., object detection).

This repository implements contextual spatial compression for EO/IR-style imagery using an ROI-aware learned image compression model (TIC-style VAE) guided by segmentation masks.

Features

  • Contextual (ROI) compression: Preserves fidelity in mission-relevant regions while aggressively compressing non-relevant background.
  • Mission-driven context extraction: A mission prompt can be mapped to ROI masks via multiple segmentation strategies:
    • Class-based segmentation (e.g., SegFormer / YOLO / Mask2Former / Mask R-CNN)
    • Prompt/referring segmentation (SAM3)
    • Optional object detection overlays to evaluate task retention on decoded outputs
  • Two operator knobs for bandwidth adaptation:
    • Background preservation (sigma, 0.01–1.0): lower = more background degradation
    • Overall quality level (checkpoint/lambda selection): higher = larger file / better reconstruction
  • Visualization: Compare input vs decoded output and optionally highlight context regions.
  • CLI tools: Scripts for segmentation, ROI compression, and before/after detection eval.

Setup

  1. Install Dependencies:

    pip install -r requirements.txt
    
  2. Model Checkpoints: Checkpoints are located in checkpoints/ directory. Main checkpoint: checkpoints/tic_lambda_0.0483.pth.tar

    By default, model weights/caches downloaded by detection/segmentation backends are also stored under checkpoints/ (Hugging Face models under checkpoints/hf/, torchvision weights under checkpoints/torch/).

Usage

Interactive Demo (Hugging Face Spaces / Local)

This repo includes a Gradio app intended for Hugging Face Spaces (app_file: app.py). To run locally:

python app.py

In the UI:

  • Enter a Mission and choose a Context Extraction Method (ROI).
  • Tune the two knobs to match bandwidth constraints:
    • Transmission quality (checkpoint selection)
    • Background preservation ($\sigma$)
  • Optionally enable object detection overlays to visualize task retention on the decoded image.

Note: the app includes a Video tab placeholder (inactive).

Contextual Spatial Compression (Images)

Run the compression script with an input image:

python roi_compressor.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/compressed.jpg \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-classes car \
    --highlight

Arguments:

  • --input: Path to input image.
  • --output: Path to save compressed image.
  • --checkpoint: Path to model checkpoint.
  • --sigma: Background quality factor (lower = more compression). Default: 0.3.
  • --lambda: Rate-distortion tradeoff parameter (default: 0.0483).
  • --seg-method: Segmentation method (segformer, yolo, mask2former, maskrcnn). Default: segformer.
  • --seg-classes: List of classes to treat as ROI (e.g., car, person).
  • --highlight: Save a comparison grid with ROI highlighted.

Tip: you can bypass segmentation by providing --load-mask.

Segmentation Only

Generate segmentation masks without compression:

python roi_segmenter.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/mask.png \
    --method segformer \
    --classes car \
    --visualize

Prompt-based segmentation (SAM3):

python roi_segmenter.py \
    --input data/images/car/0016cf15fa4d4e16.jpg \
    --output results/mask.png \
    --method sam3 \
    --prompt "a car" \
    --visualize

Project Structure

.
β”œβ”€β”€ app.py                    # Gradio demo (Hugging Face Spaces)
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ model_cache.py            # Cache routing to `checkpoints/`
β”œβ”€β”€ examples.sh               # Example CLI commands
β”œβ”€β”€ _segmentation_comparison.ipynb
β”œβ”€β”€ roi_compressor.py         # CLI: contextual (ROI) image compression
β”œβ”€β”€ roi_segmenter.py          # CLI: ROI mask generation
β”œβ”€β”€ roi_detection_eval.py     # CLI: before/after detection retention
β”œβ”€β”€ checkpoints/              # Compression checkpoints + model caches
β”œβ”€β”€ data/images/                   # Sample images
β”œβ”€β”€ segmentation/             # Segmenters + factory
β”œβ”€β”€ detection/                # Detectors + factory
└── vae/                      # ROI-aware TIC model + compression utils

Modular API

Using Segmentation Module

from segmentation import create_segmenter

# Create a segmenter
segmenter = create_segmenter('yolo', device='cuda', conf_threshold=0.3)

# Segment image
mask = segmenter(image, target_classes=['car', 'person'])

Using Compression Module

from vae import load_checkpoint, compress_image
from PIL import Image

# Load model
model = load_checkpoint('checkpoints/tic_lambda_0.0483.pth.tar', device='cuda')

# Compress with ROI mask
result = compress_image(image, mask, model, sigma=0.3, device='cuda')
compressed_img = result['compressed']
bpp = result['bpp']

Object Detection (New)

An extendable object detection module is available in detection/ with multiple implemented backends:

  • YOLO (Ultralytics)
  • YOLO-World (Ultralytics, open-vocabulary)
  • Faster R-CNN (torchvision)
  • RetinaNet (torchvision)
  • SSD (torchvision)
  • FCOS (torchvision)
  • DETR (transformers)
  • Deformable DETR (transformers, if supported by your installed version)
  • EfficientDet (optional, requires effdet)
  • Grounding DINO (transformers, open-vocabulary)

Open-vocabulary detectors (YOLO-World / Grounding DINO) require text prompts/classes at runtime.

Evaluate Detection Before/After ROI Compression

Compare an original image vs an already-compressed image:

python roi_detection_eval.py \
    --before data/images/car/0016cf15fa4d4e16.jpg \
    --after results/compressed.jpg \
    --detectors yolo fasterrcnn detr \
    --viz-dir results/det_viz

Or generate the "after" image via ROI compression and then evaluate:

python roi_detection_eval.py \
    --before data/images/car/0016cf15fa4d4e16.jpg \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo --seg-classes car \
    --detectors yolo fasterrcnn \
    --save-after results/after.jpg \
    --viz-dir results/det_viz

Open-vocabulary example (YOLO-World):

python roi_detection_eval.py \
    --before data/images/person/kodim04.png \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo --seg-classes person \
    --detectors yolo_world \
    --open-vocab-classes "person,car" \
    --viz-dir results/det_viz

Open-vocabulary example (Grounding DINO):

python roi_detection_eval.py \
    --before data/images/car/0016cf15fa4d4e16.jpg \
    --checkpoint checkpoints/tic_lambda_0.0483.pth.tar \
    --sigma 0.3 \
    --seg-method yolo --seg-classes car \
    --detectors grounding_dino \
    --open-vocab-classes "car,person" \
    --viz-dir results/det_viz

Programmatic API

The application exposes a Gradio API for programmatic access to all features:

Image API

  • /segment - Segment image β†’ mask or overlay
  • /compress - Compress image with optional ROI mask
  • /detect - Run object detection β†’ JSON or overlay
  • /process - Full pipeline: segment β†’ compress β†’ detect

Video API (Buffered)

  • /segment_video - Segment video β†’ mask file or overlay video
  • /compress_video - Compress video with optional cached masks
  • /detect_video - Run detection on video β†’ JSON or overlay video
  • /process_video - Full pipeline with static/dynamic modes

Video API (Streaming - NEW!)

  • /stream_process_video - Stream compressed chunks progressively (HLS-style)
  • /stream_compress_video - Stream chunks with pre-computed masks

Key difference: Streaming endpoints yield chunks as they're produced (low latency, ~1 second for first chunk) instead of buffering the entire video. Perfect for real-time streaming applications.

See API.md for complete documentation with examples.
See STREAMING_API.md for streaming API guide and comparison.

Quick Example

from gradio_client import Client, handle_file

client = Client("http://localhost:7860")

# Image: segment β†’ compress β†’ detect
compressed, mask, bpp, ratio, coverage, detections = client.predict(
    handle_file("image.jpg"),
    "car, person",  # mission prompt
    "sam3",         # ROI method
    4,              # quality level (1-5)
    0.3,            # sigma (background preservation)
    True,           # run detection
    "yolo",         # detection method
    "",             # detection classes
    api_name="/process"
)

# Video: streaming compression (chunk-by-chunk)
chunk_stream = client.submit(
    handle_file("video.mp4"),
    "person, car",
    "sam3", "static",
    4, 0.3, 15.0,
    api_name="/stream_process_video"
)

for chunk_json in chunk_stream:
    chunk = json.loads(chunk_json)
    if chunk.get("status") == "complete":
        break
    print(f"Chunk {chunk['chunk_index']}: {len(chunk['frames'])} frames")

JavaScript/Frontend Integration

Yes, streaming works great with JavaScript! The @gradio/client package fully supports async iterators for streaming:

import { Client } from "@gradio/client";

const client = await Client.connect("http://localhost:7860");
const stream = client.submit("/stream_process_video", {
  video_path: videoFile,
  prompt: "person, car",
  segmentation_method: "sam3",
  mode: "static",
  quality: 4,
  sigma: 0.3,
  output_fps: 15.0,
  frame_format: "jpeg",
  frame_quality: 85
});

for await (const msg of stream) {
  const chunk = JSON.parse(msg.data);
  if (chunk.status === "complete") break;
  
  // Display frames immediately
  displayFrame(`data:image/jpeg;base64,${chunk.frames[0]}`);
}

Complete examples available:

See STREAMING_API.md for detailed streaming guide.```