Spaces:
Running
Running
metadata
title: SemSorter
emoji: ♻️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
SemSorter — AI Hazard Sorting System
Real-time robotic arm simulation controlled by a multimodal AI agent using the Vision-Agents SDK by GetStream.
🤖 Overview
SemSorter is an AI-powered hazardous waste sorting system where a Franka Panda robotic arm, simulated in MuJoCo, is controlled by a multimodal AI agent. The agent:
- Watches the conveyor belt via a live camera feed
- Detects hazardous items (flammable / chemical) using Gemini VLM
- Plans and executes pick-and-place operations via Gemini LLM function-calling
- Speaks back results using ElevenLabs TTS
- Listens to voice commands via Deepgram STT
All orchestration uses the Vision-Agents SDK by GetStream.
🏗 Architecture
Browser ←─── WebSocket ───→ FastAPI Server
│
Vision-Agents SDK Agent
┌─────────┴──────────┐
gemini.LLM deepgram.STT
(tool-calling) (voice→text)
│
VLM Bridge
│
MuJoCo Sim (Franka Panda)
🚀 Quick Start
Prerequisites
- Python 3.10+
- MuJoCo 3.x
- EGL (headless GPU rendering)
Local Setup
# Clone
git clone https://github.com/KaustubhUp025/SemSorter.git
cd SemSorter
# Install dependencies
pip install -r requirements-server.txt
# Configure API keys
cp .env.example .env
# Edit .env with your keys:
# GOOGLE_API_KEY, DEEPGRAM_API_KEY, ELEVENLABS_API_KEY
# STREAM_API_KEY, STREAM_API_SECRET
# Run
MUJOCO_GL=egl uvicorn SemSorter.server.app:app --host 0.0.0.0 --port 8000
# Open http://localhost:8000
Voice Agent (Vision-Agents SDK CLI)
cd Vision-Agents
MUJOCO_GL=egl uv run python ../SemSorter/agent/agent.py run
📦 Project Structure
SemSorter/
├── SemSorter/
│ ├── simulation/
│ │ ├── controller.py # MuJoCo sim + IK + pick-and-place
│ │ └── semsorter_scene.xml # MJCF scene (Panda + conveyor + bins)
│ ├── vision/
│ │ ├── vision_pipeline.py # Gemini VLM hazard detection
│ │ └── vlm_bridge.py # VLM → sim item matching
│ ├── agent/
│ │ ├── agent.py # Vision-Agents SDK agent
│ │ └── semsorter_instructions.md
│ └── server/
│ ├── app.py # FastAPI + WebSocket video stream
│ ├── agent_bridge.py # SDK bridge + quota detection
│ └── static/index.html # Web UI
├── Vision-Agents/ # GetStream Vision-Agents SDK
├── Dockerfile
├── render.yaml
└── requirements-server.txt
🔑 API Keys Required
| Service | Purpose | Free tier |
|---|---|---|
| Google Gemini | LLM orchestration + VLM detection | 15 RPM |
| Deepgram | Speech-to-Text | 45 min/month |
| ElevenLabs | Text-to-Speech | ~10k chars/month |
| GetStream | Real-time video call (Voice agent) | Free tier available |
API exhaustion handling: The server detects quota errors (
429 / ResourceExhausted) and automatically switches to demo-mode per service, showing a banner in the UI.
🐳 Deploy to Render
- Fork this repo
- Create a new Web Service on Render.com pointing to your fork
- Add your API keys as Environment Variables in the Render dashboard
- Done — Render auto-deploys from
render.yaml