# Saken OmniFF — Architecture Whitepaper **Full name:** Saken OmniFF **Runtime name:** OmniFF Runtime **Kazakh-first variant:** OmniFF-KZ **Core formula:** FFmpeg for AI inference, generation, and multimodal transformation **Document status:** architectural doctrine **Purpose:** single canonical document for design, implementation, publication, and explanation --- ## Naming Policy | Surface | Identifier | |---------|-----------| | CLI binary | `omniff` | | Python package | `saken-omniff` | | Rust crates | `omniff-core`, `omniff-graph`, `omniff-runtime`, `omniff-cli` | | NPM package | `@saken/omniff` | | GitHub | `stukenov/omniff` | | Hugging Face | `stukenov/omniff-runtime` | All public APIs, imports, configs, and docs must use these canonical names. No aliases. --- ## 1. Summary OmniFF Runtime is not a neural network model. It is a universal multimodal runtime that accepts any input — text, audio, image, video, documents, structured data — and transforms them into any output modality through a managed graph of models, filters, validators, and planners. Architecture inspired by FFmpeg: ``` container → demuxer → decoder → filtergraph → encoder → muxer ``` For AI this becomes: ``` input container → demuxer → modality decoder → OmniFrame normalization → Thinking+ planner → AI filtergraph → model experts → validators → output encoder → muxer ``` Key distinction from ordinary LLM systems: the model is not the center of the product. The model is one computational node. The center is the runtime, the graph, and the unified format for processing multimodal streams. ### Full modality matrix ``` text → text, image, video, audio image → text, image, video video → text, image, video audio → text, audio, video document → text, document mixed input → mixed output ``` The system does not pretend to be one monolithic model. It is a unified runtime with many specialized models inside. --- ## 2. Core Doctrine ### 2.1. This is a runtime, not a model OmniFF Runtime is not one Transformer, not one `.safetensors`, not one decoder-only LLM. Correct definitions: ``` Omni inference runtime Multimodal graph engine AI media processing framework Routed multimodal model system FFmpeg-like AI runtime ``` A model is only one type of node inside the runtime. ### 2.2. FFmpeg Principle FFmpeg became foundational not because it was one codec. It became foundational because it gave a common language for containers, streams, codecs, filters, transformations, and output. OmniFF Runtime does the same for AI: ``` media streams → AI streams frames → OmniFrames filters → AI filters codecs → model experts filtergraph → OmniGraph metadata → prompt/control side data muxing → multimodal output assembly ``` ### 2.3. Models are codecs ``` LLM = text codec / reasoning codec Whisper = audio perception codec VLM = vision perception codec Image diffusion = image generation codec Video diffusion = video generation codec TTS = speech generation codec Encoder router = routing codec OCR = document perception codec ``` A model does not control the system. A model executes its role in the graph. --- ## 3. Architectural Laws ### Law 1. Runtime over model The product must not be hostage to one model, one provider, one tokenizer, one inference engine, or one weight format. Models can be swapped. The runtime must remain. ### Law 2. Graph over pipeline The system must not be a collection of hardcoded pipeline scripts. All transformations must be expressed through a graph. Wrong: ``` if image then do this if video then do that if audio then do another script ``` Right: ``` input → graph planner → DAG → execution → validation → output ``` ### Law 3. Prompt is a control layer, not just a string Prompt must be represented as structured side data: - user prompt - system prompt - task prompt - modality prompt - generation prompt - negative prompt - control prompt - validator prompt - constraints, seed, strength, masks - reference assets, preservation rules - thinking budget ### Law 4. Thinking+ is a planner, not a final answer Thinking+ is a control module that builds an execution plan, selects a graph, assigns models, sets constraints, launches validators, and decides on retry. The user receives a brief execution summary, not the internal chain of reasoning. ### Law 5. Router must be cheap Router must not be a large LLM. Router must be an encoder-only or other cheap classifier. ``` prompt / normalized semantic state → encoder-only classifier → selected model / route / graph ``` Router does not generate answers. Router selects routes. ### Law 6. Do not run all models always Wrong: ``` 0.6B → 4B → 14B → 32B always ``` Right: ``` router → minimum sufficient model ``` Only on failure: ``` fallback / escalation / validation retry ``` ### Law 7. Omni-directions require separate generative branches An LLM or VLM alone cannot close image-to-image and video-to-video. Separate branches needed: - image generation, editing, inpainting, ControlNet-like control - video generation, video-to-video, temporal consistency - audio generation, TTS - document rendering ### Law 8. One external product, many internal experts Outside: one API, one CLI, one SDK, one HF repo, one Docker. Inside: modular — ``` router, ASR, VLM, LLM, image generator, video generator, TTS, OCR, document parser, validators, scheduler ``` ### Law 9. Architectural honesty over marketing Cannot pretend this is one monolithic neural network. ``` A FFmpeg-like multimodal AI runtime with routed model experts and thinking-controlled graph execution. ``` --- ## 4. Core Architecture ### 4.1. Top-level flow ``` User input ↓ Input container ↓ Demuxer ↓ Modality decoder ↓ OmniPacket / OmniFrame ↓ Normalization ↓ Thinking+ Planner ↓ Router ↓ OmniGraph ↓ Model/filter execution ↓ Validators ↓ Output encoder ↓ Muxer ↓ Final output ``` ### 4.2. Core libraries By analogy with FFmpeg: | Library | Responsibility | |---------|---------------| | `libomniformat` | input/output containers, demux/mux | | `libomnimodel` | model loading and execution | | `libomnifilter` | AI filters and transformations | | `libomnigraph` | DAG planning and execution | | `libomnimemory` | tensors, frames, cache, device placement | | `libomnivalidate` | validators, critics, constraint checks | | `libomnischedule` | scheduling, batching, GPU/CPU placement | | `libomniapi` | CLI, SDK, HTTP API | ### 4.3. Runtime entities #### OmniPacket Raw input fragment: ``` text chunk, audio bytes, video packet, image bytes, PDF page, JSON message, subtitle segment, metadata block ``` #### OmniFrame Normalized processing object: ``` text tokens, audio PCM, image tensor, video frame, embedding, mask, depth map, pose map, transcript, OCR layer, scene graph, semantic state, control map ``` #### OmniGraph DAG describing task execution: ``` nodes = models / filters / tools / validators edges = data dependencies side data = prompts / controls / constraints ``` #### OmniNode One executable node: ``` ASR node, VLM node, LLM node, image generation node, video generation node, OCR node, validator node, ffmpeg utility node, scheduler node ``` #### OmniModel Model wrapper: ``` load, unload, infer, generate, stream, batch, quantize, cache ``` #### OmniFilter Data transformation: ``` resize, crop, normalize, extract_depth, extract_edges, extract_pose, detect_faces, track_objects, split_shots, summarize, translate, style_transfer ``` #### OmniValidator Result verification: ``` language check, schema check, visual prompt adherence, face preservation, temporal consistency, OCR correctness, toxicity/safety check, factuality check, format validation ``` --- ## 5. Routing ### 5.1. Router role Router does not answer the user. Router selects: - which graph template to use - which models to invoke - which thinking level to enable - which validator is needed - which escalation policy applies ### 5.2. Encoder-only router Preferred architecture: ``` XLM-R / ModernBERT / BGE-style encoder + classification head → route class ``` Output: ```json { "selected_route": "image_to_image", "confidence": 0.87, "risk": "low", "thinking": "normal" } ``` ### 5.3. Route classes ``` TEXT_SIMPLE TEXT_NORMAL TEXT_COMPLEX AUDIO_TRANSCRIBE_ONLY AUDIO_QA IMAGE_CAPTION IMAGE_EDIT TEXT_TO_IMAGE TEXT_TO_VIDEO IMAGE_TO_VIDEO VIDEO_SUMMARY VIDEO_TO_VIDEO DOCUMENT_OCR_QA DOCUMENT_TO_DOCUMENT REJECT_OR_HUMAN_REVIEW ``` ### 5.4. Model ladder (text/reasoning) If using Qwen family as text/reasoning backbone: ``` Qwen3-0.6B router-assistant / cheap tasks Qwen3-4B normal assistant Qwen3-14B hard tasks Qwen3-32B local high-quality / judge ``` Production minimum: ``` Qwen3-0.6B (cheap/fast) Qwen3-4B (normal) Qwen3-14B (complex) Qwen3-32B (judge) ``` Router selects the minimum sufficient model. Escalation only on failure. --- ## 6. Thinking+ ### 6.1. Purpose Thinking+ is a control layer for planning and execution control, not just "the model thinks longer." Thinking+ must: 1. Understand the task 2. Determine input/output modalities 3. Select graph template 4. Choose models 5. Assign validators 6. Set constraints 7. Define retry policy 8. Form execution plan ### 6.2. Thinking levels ``` thinking=off fast routing, single pass, minimal checking thinking=fast router + simple graph thinking=normal planner + executor + validator thinking=deep planner + executor + critic + retry thinking=research multiple candidates + judge + detailed validation ``` ### 6.3. Execution plan example ```json { "task": "video_to_video", "preserve": ["faces", "voice", "camera_structure"], "style": "premium minimal corporate", "required_nodes": [ "shot_detection", "face_tracking", "audio_transcription", "style_transfer_video", "temporal_validator", "audio_mux" ], "risk": "high", "validator": "vlm_video_judge", "retry_policy": "up_to_2" } ``` ### 6.4. User-facing output Internal reasoning chain is never mandatory output. User receives brief route explanation: ```json { "mode": "deep", "route": "image_to_image", "controls": ["depth", "mask", "reference"], "generator": "image_edit_model", "validator": "vision_validator" } ``` --- ## 7. Prompt Control ### 7.1. Prompt as side data Every OmniFrame carries side data: ```json { "prompt": "matte graphite car wrap", "negative_prompt": "cartoon, distorted wheels, wrong car shape", "seed": 42, "strength": 0.35, "preserve_identity": true, "preserve_layout": true, "control_maps": ["depth", "canny", "mask"], "thinking_budget": 2048, "validator_threshold": 0.82 } ``` ### 7.2. Prompt layers ``` system prompt → global behavior user prompt → user intent task prompt → task-specific instructions modality prompt → per-modality hints generation prompt → enriched generation instruction negative prompt → what to avoid control prompt → structural control validator prompt → validation criteria ``` ### 7.3. Image prompt control ```json { "user_prompt": "Make the car matte graphite", "generation_prompt": "black Hyundai Elantra 2021, matte graphite wrap, premium realistic studio lighting", "negative_prompt": "cartoon, damaged car, wrong wheels, deformed body", "preserve": { "car_model": true, "camera_angle": true, "body_shape": true, "background": false }, "strength": 0.38, "controls": ["canny", "depth"] } ``` ### 7.4. Video prompt control ```json { "global_prompt": "cinematic corporate video, clean premium style", "shot_prompts": [ {"shot": 1, "prompt": "slow dolly-in, preserve subject identity"}, {"shot": 2, "prompt": "soft lighting, premium office mood"} ], "negative_prompt": "flickering, face distortion, unstable hands, warped text", "temporal_consistency": "high", "style_strength": 0.45 } ``` --- ## 8. Omni-Directions ### 8.1. Text → Text Purpose: answers, analysis, translation, correction, classification, legal, code, RAG, structured output. ``` text input → language detection → router → LLM expert → validator → text output ``` ### 8.2. Audio → Text Purpose: transcription, speech translation, meeting summary, call center, lectures. ``` audio → VAD/chunking → ASR → transcript cleanup → language correction → LLM/summary/QA → text output ``` ### 8.3. Audio → Audio Purpose: speech-to-speech assistant, dubbing, voice translation, call center automation. ``` audio → ASR → LLM → TTS → audio encoder ``` ### 8.4. Audio → Video Purpose: podcast visualization, music video generation, audio-driven animation. ``` audio → ASR/analysis → scene planner → video generation → audio mux → output ``` ### 8.5. Image → Text Purpose: captioning, OCR, visual QA, document analysis, screenshot understanding. ``` image → image decoder → VLM/OCR → normalized text/scene graph → router → LLM → text output ``` ### 8.6. Text → Image Purpose: image generation, visual concepts, design, ads, UI mockups. ``` text prompt → prompt planner → image generation model → image validator → image encoder ``` ### 8.7. Image → Image Purpose: image editing, stylization, inpainting, outpainting, color change, shape preservation, reference-based generation. ``` image + prompt → image analysis → mask/control extraction → edit planner → image edit model → vision validator → output image ``` Controls: mask, depth, canny, pose, segmentation, reference image, style strength, identity preservation, layout preservation, negative prompt, seed. ### 8.8. Text → Video Purpose: clip generation, ads, storyboard-to-video, presentation videos. ``` text prompt → scene planner → shot list → video generation model → temporal validator → video encoder ``` ### 8.9. Image → Video Purpose: image animation, motion prompt, avatar video, product animation. ``` image + motion prompt → image analysis → motion planner → image-to-video model → temporal validator → video output ``` ### 8.10. Video → Text Purpose: video summary, lecture analysis, surveillance analysis, meeting extraction, content indexing. ``` video → demux audio/video → shot detection → keyframe extraction → ASR → VLM analysis → multimodal summary → text output ``` ### 8.11. Video → Image Purpose: keyframe extraction, thumbnail generation, scene capture. ``` video → shot detection → keyframe selection → VLM analysis → best frame selection → optional image enhancement → image output ``` ### 8.12. Video → Video Purpose: style transfer, enhancement, cinematic transformation, face/body/background preservation, corporate video transformation, generative editing. ``` video → demux → shot detection → keyframe extraction → audio transcription → motion analysis → depth/pose/edge maps → video edit planner → video generation/editing model → temporal consistency filter → audio restoration/mux → video output ``` Controls: global prompt, per-shot prompt, negative prompt, motion strength, style strength, identity preservation, camera preservation, seed per shot, control maps per frame, mask tracks, temporal consistency. ### 8.13. Document → Text Purpose: PDF analysis, contract review, law analysis, table extraction, OCR, document QA. ``` document → parser/OCR → layout extraction → chunks → retrieval/reasoning → text output ``` ### 8.14. Document → Document Purpose: contracts, whitepapers, PRDs, technical specs, government letters, Word/PDF/slides generation. ``` document/input text → structure planner → content generator → format renderer → validator → output document ``` --- ## 9. Image-to-Image Architecture ### 9.1. Why image-to-image is not LLM-only LLM can understand an instruction but must not be the sole image generator. ``` LLM/VLM = understand and plan Image model = generate/edit Validator = check ``` ### 9.2. Image-to-image nodes ``` decode_image analyze_image_with_vlm extract_mask extract_depth extract_edges extract_pose plan_edit_with_thinking run_image_edit_model validate_prompt_adherence validate_preservation encode_image ``` ### 9.3. Example graph ```json { "nodes": [ {"id": "analyze_image", "model": "vlm"}, {"id": "extract_depth", "model": "depth"}, {"id": "extract_mask", "model": "sam"}, {"id": "plan_edit", "model": "llm_thinking"}, {"id": "generate_image", "model": "image_edit"}, {"id": "validate", "model": "vision_validator"} ], "edges": [ ["analyze_image", "plan_edit"], ["extract_depth", "generate_image"], ["extract_mask", "generate_image"], ["plan_edit", "generate_image"], ["generate_image", "validate"] ] } ``` --- ## 10. Video-to-Video Architecture ### 10.1. Why video-to-video is harder than image-to-image Processing each frame independently causes: - flickering - identity loss - face distortion - motion destruction - unstable style - inter-frame artifacts Video-to-video requires temporal consistency. ### 10.2. Video-to-video nodes ``` demux_video_audio decode_video_frames shot_detection keyframe_selection transcribe_audio analyze_keyframes extract_depth_sequence extract_pose_sequence extract_edges_sequence track_faces track_objects plan_shots_with_thinking run_video_edit_model temporal_consistency_filter restore_audio encode_video mux_audio_video validate_video ``` ### 10.3. Example graph ```json { "nodes": [ {"id": "split_video", "tool": "ffmpeg"}, {"id": "analyze_keyframes", "model": "vlm"}, {"id": "transcribe_audio", "model": "asr"}, {"id": "extract_motion", "tool": "optical_flow"}, {"id": "make_control_maps", "models": ["depth", "canny", "pose"]}, {"id": "plan_shots", "model": "llm_thinking"}, {"id": "generate_video", "model": "video_diffusion"}, {"id": "restore_audio", "tool": "ffmpeg"}, {"id": "validate_video", "model": "video_validator"} ] } ``` --- ## 11. Runtime Scheduling ### 11.1. Scheduler as critical layer Without a scheduler the system becomes a slow Python script. Scheduler must manage: - CPU/GPU placement - model loading and unloading - batch processing and streaming - caching and retry - memory pressure and prioritization - long-running jobs and device affinity ### 11.2. Example device distribution ``` CPU: demux, decode, OCR preprocessing, ffmpeg ops, graph planning GPU 0: ASR / Whisper GPU 1: VLM / image analysis GPU 2: LLM / planner / router GPU 3: image/video generation ``` ### 11.3. Model loading policy Not all models should be in memory at all times. ``` hot models: router, small LLM, ASR small warm models: VLM, medium LLM, image edit cold models: video generation, huge judge, rare experts ``` Scheduler capabilities: ``` preload, lazy load, unload, pin to GPU, move to CPU, quantized load, batch requests, reuse cache ``` --- ## 12. CLI ### 12.1. FFmpeg-like CLI ```bash omniff -i input.jpg \ -prompt "make it matte graphite, preserve body and angle" \ -of image \ -o result.png ``` ```bash omniff -i input.mp4 \ -prompt "premium corporate ad style" \ -thinking deep \ -preserve faces,voice,structure \ -strength 0.42 \ -o output.mp4 ``` ```bash omniff -i lesson_audio.mp3 \ -task summarize \ -lang kk \ -model auto \ -o output.md ``` ```bash omniff -i contract.pdf \ -task "find risks and write brief summary" \ -thinking deep \ -o review.docx ``` ### 12.2. Explicit graph CLI ```bash omniff -i input.mp4 \ -graph graphs/video_to_video_premium.yaml \ -prompt "premium Apple-like corporate style" \ -o output.mp4 ``` --- ## 13. SDK / API ### 13.1. Python API ```python from omniff import OmniFFRuntime runtime = OmniFFRuntime.from_pretrained("stukenov/omniff-runtime") result = runtime.run( input="input.mp4", prompt="Video-to-video in premium style, preserve faces", output_modality="video", thinking="deep", controls={ "preserve_identity": True, "preserve_audio": True, "style_strength": 0.45, "temporal_consistency": "high", }, ) result.save("output.mp4") ``` ### 13.2. Planning API ```python graph = runtime.plan( input="car.jpg", prompt="make it matte graphite", output_modality="image", ) print(graph) result = runtime.execute(graph) ``` ### 13.3. HTTP API ``` POST /v1/omniff/run ``` ```json { "input": "file://input.mp4", "prompt": "premium style video", "output_modality": "video", "thinking": "deep", "controls": { "preserve_faces": true, "preserve_voice": true, "style_strength": 0.45 } } ``` --- ## 14. Packaging ### 14.1. Not one safetensors Production packaging: ``` omniff-runtime/ omniff.yaml graph_templates/ models/ router/ asr/ vlm/ llm_small/ llm_large/ image_generator/ video_generator/ tts/ processors/ validators/ runtime/ README.md ``` ### 14.2. omniff.yaml ```yaml name: omniff-runtime version: 0.1 router: type: encoder_classifier path: models/router experts: text_small: type: causal_lm path: models/llm_small text_large: type: causal_lm path: models/llm_large asr: type: speech_to_text path: models/asr vision: type: vision_language path: models/vlm image_edit: type: diffusion_image_edit path: models/image_generator video_edit: type: diffusion_video_edit path: models/video_generator tts: type: text_to_speech path: models/tts ``` ### 14.3. Hugging Face custom architecture For research/demo — HF repo with custom code: ``` configuration_omniff.py modeling_omniff.py processing_omniff.py config.json routing_config.yaml ``` Loading: ```python model = OmniFFRuntime.from_pretrained( "stukenov/omniff-runtime", trust_remote_code=True, ) ``` Production must not depend on loading everything as one `AutoModelForCausalLM`. --- ## 15. Cascade Routing ### 15.1. Principle ``` simple → small model normal → medium model complex → large model critical → judge / human review ``` Saves cost by orders of magnitude on real traffic — large model often never starts. Quality depends on router accuracy. ### 15.2. Escalation flow ``` Router selects minimum sufficient model → model executes → validator checks result → on failure: escalate to stronger model or retry with adjusted controls → on repeated failure: mark as failed or route to human review ``` --- ## 16. Safety and Quality ### 16.1. Validator-first philosophy Every complex graph must have a validator. **Text validators:** ``` language, format, JSON schema, citation, risk, factuality ``` **Image validators:** ``` prompt adherence, identity preservation, layout preservation, NSFW/safety, artifact detection, OCR/text correctness ``` **Video validators:** ``` temporal consistency, face preservation, flicker detection, motion coherence, audio-video sync, prompt adherence ``` ### 16.2. Escalation On validator failure: ``` retry same graph with adjusted controls → or escalate to stronger model → or ask for clarification → or mark as failed → or route to human review ``` --- ## 17. Logging and Router Training ### 17.1. What to log ``` request_id, user_id/tenant_id, input modalities, output modality, prompt hash, language, task_type, selected_route, selected_models, thinking_mode, latency, cost estimate, success/failure, validator scores, fallbacks, retries, user feedback, output metadata ``` ### 17.2. Router training Primary label: **cheapest sufficient route** — the cheapest model/graph that produced acceptable quality. Process: ``` 1. Collect real and synthetic prompts 2. Run through different route/model variants 3. Score outputs with judge model + partial human eval 4. Assign cheapest-sufficient label 5. Train encoder-only classifier 6. Export to ONNX/Candle 7. Embed in runtime 8. Continuously retrain on logs ``` --- ## 18. Technical Stack ### 18.1. Runtime core Start: ``` Python + PyTorch + Transformers + Diffusers + FFmpeg bindings ``` Production: ``` Rust/Go runtime shell Python model workers where needed ONNX/Candle/TensorRT acceleration where justified ``` ### 18.2. No dependency on vLLM/SGLang vLLM and SGLang may be optional backends but never the foundation. ``` OmniGraph owns routing, planning, graph execution, and scheduling. Model backends are replaceable. ``` ### 18.3. Model backend types ``` PyTorch, Transformers, Diffusers, ONNX Runtime, Candle, GGUF/llama.cpp-style, custom CUDA kernels, external API adapter ``` --- ## 19. MVP Roadmap ### v0.1 — Prove the architecture ``` text → text image → text audio → text → text image → image ``` Components: ``` OmniFF CLI OmniFrame / OmniPacket OmniGraph / OmniNode OmniFFRuntime encoder-only router ASR module, VLM module, LLM module image edit module, validator module ``` ### v0.2 ``` text → image video → text image → video document → text ``` ### v0.3 ``` video → video audio → audio document → document multi-pass validation scheduler model hot/warm/cold loading ``` ### v1.0 ``` universal graph planner Thinking+ controller prompt-control side data full modality matrix validators production scheduler CLI + SDK + HTTP API plugin model interface ``` --- ## 20. Product Identity ### Short positioning ``` OmniFF Runtime is a FFmpeg-like multimodal AI processing engine. ``` ### Extended positioning ``` A multimodal graph runtime with encoder-only routing, thinking-controlled planning, and pluggable model experts for text, speech, vision, image generation, video generation, documents, and structured outputs. ``` ### Kazakh-first variant (OmniFF-KZ) ``` OmniFF-KZ is a Kazakh-first multimodal AI runtime that combines Qwen expert hierarchy, ASR, vision, image/video generation, and document intelligence through a unified graph execution engine with native Kazakh language support. ``` --- ## 21. What This Must Not Be OmniFF Runtime must not be: ``` a LangChain pipeline a collection of Python scripts a wrapper over vLLM a chatbot a HuggingFace demo a ComfyUI clone one big safetensors a gateway a multimodal LLM an agent framework ``` It must be: ``` runtime, format, graph engine, model orchestration layer, scheduler, prompt-control system, validator system, CLI/API/SDK ``` --- ## 22. Canonical Formula ``` input → demux → decode → normalize into OmniFrames → plan with Thinking+ → execute graph of AI filters/models → validate → encode → mux → output ``` One repo. One config. One processor. One runtime. One CLI. One API. Many experts. One graph executor. --- ## 23. Conclusion OmniFF Runtime is an infrastructure system of a new class: a multimodal AI runtime that relates to models the way FFmpeg relates to codecs. It does not compete with Qwen, Whisper, VLMs, diffusion models, or TTS. It uses them as interchangeable computational nodes. Its value is not "one model that does everything." Its value is a unified engineering way to build any transformation: ``` text ↔ audio ↔ image ↔ video ↔ document ↔ structured data ``` With control: ``` prompt, negative prompt, thinking, router, models, validators, constraints, schedulers, quality thresholds ``` **Saken OmniFF Runtime:** A FFmpeg-like AI runtime for routed, thinking-controlled, multimodal generation and transformation. Not "one model." A stronger category: **an operating environment for multimodal AI inference and generation.**