Saken OmniFF β Architecture Whitepaper
Full name: Saken OmniFF Runtime name: OmniFF Runtime Kazakh-first variant: OmniFF-KZ Core formula: FFmpeg for AI inference, generation, and multimodal transformation
Document status: architectural doctrine Purpose: single canonical document for design, implementation, publication, and explanation
Naming Policy
| Surface | Identifier |
|---|---|
| CLI binary | omniff |
| Python package | saken-omniff |
| Rust crates | omniff-core, omniff-graph, omniff-runtime, omniff-cli |
| NPM package | @saken/omniff |
| GitHub | stukenov/omniff |
| Hugging Face | stukenov/omniff-runtime |
All public APIs, imports, configs, and docs must use these canonical names. No aliases.
1. Summary
OmniFF Runtime is not a neural network model. It is a universal multimodal runtime that accepts any input β text, audio, image, video, documents, structured data β and transforms them into any output modality through a managed graph of models, filters, validators, and planners.
Architecture inspired by FFmpeg:
container β demuxer β decoder β filtergraph β encoder β muxer
For AI this becomes:
input container
β demuxer
β modality decoder
β OmniFrame normalization
β Thinking+ planner
β AI filtergraph
β model experts
β validators
β output encoder
β muxer
Key distinction from ordinary LLM systems: the model is not the center of the product. The model is one computational node. The center is the runtime, the graph, and the unified format for processing multimodal streams.
Full modality matrix
text β text, image, video, audio
image β text, image, video
video β text, image, video
audio β text, audio, video
document β text, document
mixed input β mixed output
The system does not pretend to be one monolithic model. It is a unified runtime with many specialized models inside.
2. Core Doctrine
2.1. This is a runtime, not a model
OmniFF Runtime is not one Transformer, not one .safetensors, not one decoder-only LLM.
Correct definitions:
Omni inference runtime
Multimodal graph engine
AI media processing framework
Routed multimodal model system
FFmpeg-like AI runtime
A model is only one type of node inside the runtime.
2.2. FFmpeg Principle
FFmpeg became foundational not because it was one codec. It became foundational because it gave a common language for containers, streams, codecs, filters, transformations, and output.
OmniFF Runtime does the same for AI:
media streams β AI streams
frames β OmniFrames
filters β AI filters
codecs β model experts
filtergraph β OmniGraph
metadata β prompt/control side data
muxing β multimodal output assembly
2.3. Models are codecs
LLM = text codec / reasoning codec
Whisper = audio perception codec
VLM = vision perception codec
Image diffusion = image generation codec
Video diffusion = video generation codec
TTS = speech generation codec
Encoder router = routing codec
OCR = document perception codec
A model does not control the system. A model executes its role in the graph.
3. Architectural Laws
Law 1. Runtime over model
The product must not be hostage to one model, one provider, one tokenizer, one inference engine, or one weight format. Models can be swapped. The runtime must remain.
Law 2. Graph over pipeline
The system must not be a collection of hardcoded pipeline scripts. All transformations must be expressed through a graph.
Wrong:
if image then do this
if video then do that
if audio then do another script
Right:
input β graph planner β DAG β execution β validation β output
Law 3. Prompt is a control layer, not just a string
Prompt must be represented as structured side data:
- user prompt
- system prompt
- task prompt
- modality prompt
- generation prompt
- negative prompt
- control prompt
- validator prompt
- constraints, seed, strength, masks
- reference assets, preservation rules
- thinking budget
Law 4. Thinking+ is a planner, not a final answer
Thinking+ is a control module that builds an execution plan, selects a graph, assigns models, sets constraints, launches validators, and decides on retry. The user receives a brief execution summary, not the internal chain of reasoning.
Law 5. Router must be cheap
Router must not be a large LLM. Router must be an encoder-only or other cheap classifier.
prompt / normalized semantic state
β encoder-only classifier
β selected model / route / graph
Router does not generate answers. Router selects routes.
Law 6. Do not run all models always
Wrong:
0.6B β 4B β 14B β 32B always
Right:
router β minimum sufficient model
Only on failure:
fallback / escalation / validation retry
Law 7. Omni-directions require separate generative branches
An LLM or VLM alone cannot close image-to-image and video-to-video. Separate branches needed:
- image generation, editing, inpainting, ControlNet-like control
- video generation, video-to-video, temporal consistency
- audio generation, TTS
- document rendering
Law 8. One external product, many internal experts
Outside: one API, one CLI, one SDK, one HF repo, one Docker.
Inside: modular β
router, ASR, VLM, LLM, image generator, video generator,
TTS, OCR, document parser, validators, scheduler
Law 9. Architectural honesty over marketing
Cannot pretend this is one monolithic neural network.
A FFmpeg-like multimodal AI runtime with routed model experts
and thinking-controlled graph execution.
4. Core Architecture
4.1. Top-level flow
User input
β
Input container
β
Demuxer
β
Modality decoder
β
OmniPacket / OmniFrame
β
Normalization
β
Thinking+ Planner
β
Router
β
OmniGraph
β
Model/filter execution
β
Validators
β
Output encoder
β
Muxer
β
Final output
4.2. Core libraries
By analogy with FFmpeg:
| Library | Responsibility |
|---|---|
libomniformat |
input/output containers, demux/mux |
libomnimodel |
model loading and execution |
libomnifilter |
AI filters and transformations |
libomnigraph |
DAG planning and execution |
libomnimemory |
tensors, frames, cache, device placement |
libomnivalidate |
validators, critics, constraint checks |
libomnischedule |
scheduling, batching, GPU/CPU placement |
libomniapi |
CLI, SDK, HTTP API |
4.3. Runtime entities
OmniPacket
Raw input fragment:
text chunk, audio bytes, video packet, image bytes,
PDF page, JSON message, subtitle segment, metadata block
OmniFrame
Normalized processing object:
text tokens, audio PCM, image tensor, video frame,
embedding, mask, depth map, pose map, transcript,
OCR layer, scene graph, semantic state, control map
OmniGraph
DAG describing task execution:
nodes = models / filters / tools / validators
edges = data dependencies
side data = prompts / controls / constraints
OmniNode
One executable node:
ASR node, VLM node, LLM node, image generation node,
video generation node, OCR node, validator node,
ffmpeg utility node, scheduler node
OmniModel
Model wrapper:
load, unload, infer, generate, stream,
batch, quantize, cache
OmniFilter
Data transformation:
resize, crop, normalize, extract_depth, extract_edges,
extract_pose, detect_faces, track_objects, split_shots,
summarize, translate, style_transfer
OmniValidator
Result verification:
language check, schema check, visual prompt adherence,
face preservation, temporal consistency, OCR correctness,
toxicity/safety check, factuality check, format validation
5. Routing
5.1. Router role
Router does not answer the user. Router selects:
- which graph template to use
- which models to invoke
- which thinking level to enable
- which validator is needed
- which escalation policy applies
5.2. Encoder-only router
Preferred architecture:
XLM-R / ModernBERT / BGE-style encoder
+ classification head
β route class
Output:
{
"selected_route": "image_to_image",
"confidence": 0.87,
"risk": "low",
"thinking": "normal"
}
5.3. Route classes
TEXT_SIMPLE
TEXT_NORMAL
TEXT_COMPLEX
AUDIO_TRANSCRIBE_ONLY
AUDIO_QA
IMAGE_CAPTION
IMAGE_EDIT
TEXT_TO_IMAGE
TEXT_TO_VIDEO
IMAGE_TO_VIDEO
VIDEO_SUMMARY
VIDEO_TO_VIDEO
DOCUMENT_OCR_QA
DOCUMENT_TO_DOCUMENT
REJECT_OR_HUMAN_REVIEW
5.4. Model ladder (text/reasoning)
If using Qwen family as text/reasoning backbone:
Qwen3-0.6B router-assistant / cheap tasks
Qwen3-4B normal assistant
Qwen3-14B hard tasks
Qwen3-32B local high-quality / judge
Production minimum:
Qwen3-0.6B (cheap/fast)
Qwen3-4B (normal)
Qwen3-14B (complex)
Qwen3-32B (judge)
Router selects the minimum sufficient model. Escalation only on failure.
6. Thinking+
6.1. Purpose
Thinking+ is a control layer for planning and execution control, not just "the model thinks longer."
Thinking+ must:
- Understand the task
- Determine input/output modalities
- Select graph template
- Choose models
- Assign validators
- Set constraints
- Define retry policy
- Form execution plan
6.2. Thinking levels
thinking=off
fast routing, single pass, minimal checking
thinking=fast
router + simple graph
thinking=normal
planner + executor + validator
thinking=deep
planner + executor + critic + retry
thinking=research
multiple candidates + judge + detailed validation
6.3. Execution plan example
{
"task": "video_to_video",
"preserve": ["faces", "voice", "camera_structure"],
"style": "premium minimal corporate",
"required_nodes": [
"shot_detection",
"face_tracking",
"audio_transcription",
"style_transfer_video",
"temporal_validator",
"audio_mux"
],
"risk": "high",
"validator": "vlm_video_judge",
"retry_policy": "up_to_2"
}
6.4. User-facing output
Internal reasoning chain is never mandatory output. User receives brief route explanation:
{
"mode": "deep",
"route": "image_to_image",
"controls": ["depth", "mask", "reference"],
"generator": "image_edit_model",
"validator": "vision_validator"
}
7. Prompt Control
7.1. Prompt as side data
Every OmniFrame carries side data:
{
"prompt": "matte graphite car wrap",
"negative_prompt": "cartoon, distorted wheels, wrong car shape",
"seed": 42,
"strength": 0.35,
"preserve_identity": true,
"preserve_layout": true,
"control_maps": ["depth", "canny", "mask"],
"thinking_budget": 2048,
"validator_threshold": 0.82
}
7.2. Prompt layers
system prompt β global behavior
user prompt β user intent
task prompt β task-specific instructions
modality prompt β per-modality hints
generation prompt β enriched generation instruction
negative prompt β what to avoid
control prompt β structural control
validator prompt β validation criteria
7.3. Image prompt control
{
"user_prompt": "Make the car matte graphite",
"generation_prompt": "black Hyundai Elantra 2021, matte graphite wrap, premium realistic studio lighting",
"negative_prompt": "cartoon, damaged car, wrong wheels, deformed body",
"preserve": {
"car_model": true,
"camera_angle": true,
"body_shape": true,
"background": false
},
"strength": 0.38,
"controls": ["canny", "depth"]
}
7.4. Video prompt control
{
"global_prompt": "cinematic corporate video, clean premium style",
"shot_prompts": [
{"shot": 1, "prompt": "slow dolly-in, preserve subject identity"},
{"shot": 2, "prompt": "soft lighting, premium office mood"}
],
"negative_prompt": "flickering, face distortion, unstable hands, warped text",
"temporal_consistency": "high",
"style_strength": 0.45
}
8. Omni-Directions
8.1. Text β Text
Purpose: answers, analysis, translation, correction, classification, legal, code, RAG, structured output.
text input β language detection β router β LLM expert β validator β text output
8.2. Audio β Text
Purpose: transcription, speech translation, meeting summary, call center, lectures.
audio β VAD/chunking β ASR β transcript cleanup β language correction
β LLM/summary/QA β text output
8.3. Audio β Audio
Purpose: speech-to-speech assistant, dubbing, voice translation, call center automation.
audio β ASR β LLM β TTS β audio encoder
8.4. Audio β Video
Purpose: podcast visualization, music video generation, audio-driven animation.
audio β ASR/analysis β scene planner β video generation β audio mux β output
8.5. Image β Text
Purpose: captioning, OCR, visual QA, document analysis, screenshot understanding.
image β image decoder β VLM/OCR β normalized text/scene graph β router β LLM β text output
8.6. Text β Image
Purpose: image generation, visual concepts, design, ads, UI mockups.
text prompt β prompt planner β image generation model β image validator β image encoder
8.7. Image β Image
Purpose: image editing, stylization, inpainting, outpainting, color change, shape preservation, reference-based generation.
image + prompt β image analysis β mask/control extraction β edit planner
β image edit model β vision validator β output image
Controls: mask, depth, canny, pose, segmentation, reference image, style strength, identity preservation, layout preservation, negative prompt, seed.
8.8. Text β Video
Purpose: clip generation, ads, storyboard-to-video, presentation videos.
text prompt β scene planner β shot list β video generation model
β temporal validator β video encoder
8.9. Image β Video
Purpose: image animation, motion prompt, avatar video, product animation.
image + motion prompt β image analysis β motion planner
β image-to-video model β temporal validator β video output
8.10. Video β Text
Purpose: video summary, lecture analysis, surveillance analysis, meeting extraction, content indexing.
video β demux audio/video β shot detection β keyframe extraction β ASR
β VLM analysis β multimodal summary β text output
8.11. Video β Image
Purpose: keyframe extraction, thumbnail generation, scene capture.
video β shot detection β keyframe selection β VLM analysis β best frame selection
β optional image enhancement β image output
8.12. Video β Video
Purpose: style transfer, enhancement, cinematic transformation, face/body/background preservation, corporate video transformation, generative editing.
video β demux β shot detection β keyframe extraction β audio transcription
β motion analysis β depth/pose/edge maps β video edit planner
β video generation/editing model β temporal consistency filter
β audio restoration/mux β video output
Controls: global prompt, per-shot prompt, negative prompt, motion strength, style strength, identity preservation, camera preservation, seed per shot, control maps per frame, mask tracks, temporal consistency.
8.13. Document β Text
Purpose: PDF analysis, contract review, law analysis, table extraction, OCR, document QA.
document β parser/OCR β layout extraction β chunks β retrieval/reasoning β text output
8.14. Document β Document
Purpose: contracts, whitepapers, PRDs, technical specs, government letters, Word/PDF/slides generation.
document/input text β structure planner β content generator
β format renderer β validator β output document
9. Image-to-Image Architecture
9.1. Why image-to-image is not LLM-only
LLM can understand an instruction but must not be the sole image generator.
LLM/VLM = understand and plan
Image model = generate/edit
Validator = check
9.2. Image-to-image nodes
decode_image
analyze_image_with_vlm
extract_mask
extract_depth
extract_edges
extract_pose
plan_edit_with_thinking
run_image_edit_model
validate_prompt_adherence
validate_preservation
encode_image
9.3. Example graph
{
"nodes": [
{"id": "analyze_image", "model": "vlm"},
{"id": "extract_depth", "model": "depth"},
{"id": "extract_mask", "model": "sam"},
{"id": "plan_edit", "model": "llm_thinking"},
{"id": "generate_image", "model": "image_edit"},
{"id": "validate", "model": "vision_validator"}
],
"edges": [
["analyze_image", "plan_edit"],
["extract_depth", "generate_image"],
["extract_mask", "generate_image"],
["plan_edit", "generate_image"],
["generate_image", "validate"]
]
}
10. Video-to-Video Architecture
10.1. Why video-to-video is harder than image-to-image
Processing each frame independently causes:
- flickering
- identity loss
- face distortion
- motion destruction
- unstable style
- inter-frame artifacts
Video-to-video requires temporal consistency.
10.2. Video-to-video nodes
demux_video_audio
decode_video_frames
shot_detection
keyframe_selection
transcribe_audio
analyze_keyframes
extract_depth_sequence
extract_pose_sequence
extract_edges_sequence
track_faces
track_objects
plan_shots_with_thinking
run_video_edit_model
temporal_consistency_filter
restore_audio
encode_video
mux_audio_video
validate_video
10.3. Example graph
{
"nodes": [
{"id": "split_video", "tool": "ffmpeg"},
{"id": "analyze_keyframes", "model": "vlm"},
{"id": "transcribe_audio", "model": "asr"},
{"id": "extract_motion", "tool": "optical_flow"},
{"id": "make_control_maps", "models": ["depth", "canny", "pose"]},
{"id": "plan_shots", "model": "llm_thinking"},
{"id": "generate_video", "model": "video_diffusion"},
{"id": "restore_audio", "tool": "ffmpeg"},
{"id": "validate_video", "model": "video_validator"}
]
}
11. Runtime Scheduling
11.1. Scheduler as critical layer
Without a scheduler the system becomes a slow Python script. Scheduler must manage:
- CPU/GPU placement
- model loading and unloading
- batch processing and streaming
- caching and retry
- memory pressure and prioritization
- long-running jobs and device affinity
11.2. Example device distribution
CPU: demux, decode, OCR preprocessing, ffmpeg ops, graph planning
GPU 0: ASR / Whisper
GPU 1: VLM / image analysis
GPU 2: LLM / planner / router
GPU 3: image/video generation
11.3. Model loading policy
Not all models should be in memory at all times.
hot models: router, small LLM, ASR small
warm models: VLM, medium LLM, image edit
cold models: video generation, huge judge, rare experts
Scheduler capabilities:
preload, lazy load, unload, pin to GPU, move to CPU,
quantized load, batch requests, reuse cache
12. CLI
12.1. FFmpeg-like CLI
omniff -i input.jpg \
-prompt "make it matte graphite, preserve body and angle" \
-of image \
-o result.png
omniff -i input.mp4 \
-prompt "premium corporate ad style" \
-thinking deep \
-preserve faces,voice,structure \
-strength 0.42 \
-o output.mp4
omniff -i lesson_audio.mp3 \
-task summarize \
-lang kk \
-model auto \
-o output.md
omniff -i contract.pdf \
-task "find risks and write brief summary" \
-thinking deep \
-o review.docx
12.2. Explicit graph CLI
omniff -i input.mp4 \
-graph graphs/video_to_video_premium.yaml \
-prompt "premium Apple-like corporate style" \
-o output.mp4
13. SDK / API
13.1. Python API
from omniff import OmniFFRuntime
runtime = OmniFFRuntime.from_pretrained("stukenov/omniff-runtime")
result = runtime.run(
input="input.mp4",
prompt="Video-to-video in premium style, preserve faces",
output_modality="video",
thinking="deep",
controls={
"preserve_identity": True,
"preserve_audio": True,
"style_strength": 0.45,
"temporal_consistency": "high",
},
)
result.save("output.mp4")
13.2. Planning API
graph = runtime.plan(
input="car.jpg",
prompt="make it matte graphite",
output_modality="image",
)
print(graph)
result = runtime.execute(graph)
13.3. HTTP API
POST /v1/omniff/run
{
"input": "file://input.mp4",
"prompt": "premium style video",
"output_modality": "video",
"thinking": "deep",
"controls": {
"preserve_faces": true,
"preserve_voice": true,
"style_strength": 0.45
}
}
14. Packaging
14.1. Not one safetensors
Production packaging:
omniff-runtime/
omniff.yaml
graph_templates/
models/
router/
asr/
vlm/
llm_small/
llm_large/
image_generator/
video_generator/
tts/
processors/
validators/
runtime/
README.md
14.2. omniff.yaml
name: omniff-runtime
version: 0.1
router:
type: encoder_classifier
path: models/router
experts:
text_small:
type: causal_lm
path: models/llm_small
text_large:
type: causal_lm
path: models/llm_large
asr:
type: speech_to_text
path: models/asr
vision:
type: vision_language
path: models/vlm
image_edit:
type: diffusion_image_edit
path: models/image_generator
video_edit:
type: diffusion_video_edit
path: models/video_generator
tts:
type: text_to_speech
path: models/tts
14.3. Hugging Face custom architecture
For research/demo β HF repo with custom code:
configuration_omniff.py
modeling_omniff.py
processing_omniff.py
config.json
routing_config.yaml
Loading:
model = OmniFFRuntime.from_pretrained(
"stukenov/omniff-runtime",
trust_remote_code=True,
)
Production must not depend on loading everything as one AutoModelForCausalLM.
15. Cascade Routing
15.1. Principle
simple β small model
normal β medium model
complex β large model
critical β judge / human review
Saves cost by orders of magnitude on real traffic β large model often never starts.
Quality depends on router accuracy.
15.2. Escalation flow
Router selects minimum sufficient model
β model executes
β validator checks result
β on failure: escalate to stronger model or retry with adjusted controls
β on repeated failure: mark as failed or route to human review
16. Safety and Quality
16.1. Validator-first philosophy
Every complex graph must have a validator.
Text validators:
language, format, JSON schema, citation, risk, factuality
Image validators:
prompt adherence, identity preservation, layout preservation,
NSFW/safety, artifact detection, OCR/text correctness
Video validators:
temporal consistency, face preservation, flicker detection,
motion coherence, audio-video sync, prompt adherence
16.2. Escalation
On validator failure:
retry same graph with adjusted controls
β or escalate to stronger model
β or ask for clarification
β or mark as failed
β or route to human review
17. Logging and Router Training
17.1. What to log
request_id, user_id/tenant_id, input modalities, output modality,
prompt hash, language, task_type, selected_route, selected_models,
thinking_mode, latency, cost estimate, success/failure,
validator scores, fallbacks, retries, user feedback, output metadata
17.2. Router training
Primary label: cheapest sufficient route β the cheapest model/graph that produced acceptable quality.
Process:
1. Collect real and synthetic prompts
2. Run through different route/model variants
3. Score outputs with judge model + partial human eval
4. Assign cheapest-sufficient label
5. Train encoder-only classifier
6. Export to ONNX/Candle
7. Embed in runtime
8. Continuously retrain on logs
18. Technical Stack
18.1. Runtime core
Start:
Python + PyTorch + Transformers + Diffusers + FFmpeg bindings
Production:
Rust/Go runtime shell
Python model workers where needed
ONNX/Candle/TensorRT acceleration where justified
18.2. No dependency on vLLM/SGLang
vLLM and SGLang may be optional backends but never the foundation.
OmniGraph owns routing, planning, graph execution, and scheduling.
Model backends are replaceable.
18.3. Model backend types
PyTorch, Transformers, Diffusers, ONNX Runtime,
Candle, GGUF/llama.cpp-style, custom CUDA kernels,
external API adapter
19. MVP Roadmap
v0.1 β Prove the architecture
text β text
image β text
audio β text β text
image β image
Components:
OmniFF CLI
OmniFrame / OmniPacket
OmniGraph / OmniNode
OmniFFRuntime
encoder-only router
ASR module, VLM module, LLM module
image edit module, validator module
v0.2
text β image
video β text
image β video
document β text
v0.3
video β video
audio β audio
document β document
multi-pass validation
scheduler
model hot/warm/cold loading
v1.0
universal graph planner
Thinking+ controller
prompt-control side data
full modality matrix
validators
production scheduler
CLI + SDK + HTTP API
plugin model interface
20. Product Identity
Short positioning
OmniFF Runtime is a FFmpeg-like multimodal AI processing engine.
Extended positioning
A multimodal graph runtime with encoder-only routing,
thinking-controlled planning, and pluggable model experts
for text, speech, vision, image generation, video generation,
documents, and structured outputs.
Kazakh-first variant (OmniFF-KZ)
OmniFF-KZ is a Kazakh-first multimodal AI runtime that combines
Qwen expert hierarchy, ASR, vision, image/video generation,
and document intelligence through a unified graph execution engine
with native Kazakh language support.
21. What This Must Not Be
OmniFF Runtime must not be:
a LangChain pipeline
a collection of Python scripts
a wrapper over vLLM
a chatbot
a HuggingFace demo
a ComfyUI clone
one big safetensors
a gateway
a multimodal LLM
an agent framework
It must be:
runtime, format, graph engine, model orchestration layer,
scheduler, prompt-control system, validator system, CLI/API/SDK
22. Canonical Formula
input
β demux
β decode
β normalize into OmniFrames
β plan with Thinking+
β execute graph of AI filters/models
β validate
β encode
β mux
β output
One repo. One config. One processor. One runtime. One CLI. One API. Many experts. One graph executor.
23. Conclusion
OmniFF Runtime is an infrastructure system of a new class: a multimodal AI runtime that relates to models the way FFmpeg relates to codecs.
It does not compete with Qwen, Whisper, VLMs, diffusion models, or TTS. It uses them as interchangeable computational nodes.
Its value is not "one model that does everything." Its value is a unified engineering way to build any transformation:
text β audio β image β video β document β structured data
With control:
prompt, negative prompt, thinking, router, models,
validators, constraints, schedulers, quality thresholds
Saken OmniFF Runtime: A FFmpeg-like AI runtime for routed, thinking-controlled, multimodal generation and transformation.
Not "one model." A stronger category: an operating environment for multimodal AI inference and generation.