SAM 3 / 3.1 + Sapiens Pose ZeroGPU
Track objects in videos and annotate images with text prompts
Generate semantic embeddings for videos and text
Analyze images or videos to get descriptions, answers, OCR
Classify actions in videos with custom or default categories
Transcribe audio to text with precise word timestamps
Find when events happen, count them, and get a timeline from videos