zenlm
/

zen-foley

@@ -1,114 +1,55 @@
 ---
-library_name: diffusers
-pipeline_tag: text-to-audio
-language:
-  - en
-license: other
 tags:
-  - video-to-audio
-  - audio-generation
-  - foley
-  - sound-effects
   - zen
   - zenlm
   - hanzo
 ---
 # Zen Foley
-**Zen LM by Hanzo AI** — Automatic foley sound and audio effect generation from video content.
-## Specs
-| Property | Value |
-|----------|-------|
-| Architecture | Zen Foley Transformer (Flow Matching) |
-| Task | Video-to-Audio / Foley Sound Generation |
-| Model Type | Audio Diffusion Transformer |
-| Precision | BF16 |
-## Model Variants
-This repository contains two model variants:
-| File | Variant | Depth (triple/single) | Hidden Size | Heads |
-|------|---------|----------------------|-------------|-------|
-| `hunyuanvideo_foley.pth` | XXL | 18 triple + 36 single | 1536 | 12 |
-| `hunyuanvideo_foley_xl.pth` | XL | 12 triple + 24 single | 1408 | 11 |
-| `synchformer_state_dict.pth` | Sync encoder | - | - | - |
-| `vae_128d_48k.pth` | VAE (48kHz, 128-dim) | - | - | - |
-## Model Files
-| File | Role | Notes |
-|------|------|-------|
-| `hunyuanvideo_foley.pth` | Main XXL foley model | Best quality |
-| `hunyuanvideo_foley_xl.pth` | XL foley model | Faster inference |
-| `synchformer_state_dict.pth` | Audio-visual synchronization encoder | Required |
-| `vae_128d_48k.pth` | 48kHz audio VAE (128-dim latents) | Required |
-| `config.yaml` | XXL model configuration | Architecture params |
-| `config_xl.yaml` | XL model configuration | Architecture params |
-## API Access (Recommended)
-```python
-from openai import OpenAI
-client = OpenAI(
-    base_url='https://api.hanzo.ai/v1',
-    api_key='your-api-key',
-)
-# Generate foley audio from video description
-response = client.audio.speech.create(
-    model='zen-foley',
-    input='footsteps on gravel with ambient wind',
-    voice='foley',
-)
-response.stream_to_file('foley.wav')
-```
-## Local Usage
 ```python
 import torch
-device = 'cuda' if torch.cuda.is_available() else 'cpu'
-# Load XXL model (best quality)
-foley_model = torch.load(
-    'hunyuanvideo_foley.pth',
-    map_location=device,
-    weights_only=False,
-)
-# Load auxiliary models
-vae = torch.load('vae_128d_48k.pth', map_location=device, weights_only=False)
-sync_encoder = torch.load(
-    'synchformer_state_dict.pth',
-    map_location=device,
-    weights_only=False,
-)
 ```
-See [github.com/zenlm/zen-audio](https://github.com/zenlm/zen-audio) for the full inference pipeline.
-## Capabilities
-- Automatic foley sound synthesis from video frames
-- Audio-visual synchronization (24fps video alignment)
-- 48kHz high-fidelity audio output
-- Environmental ambience generation
-- Object interaction sounds (footsteps, impacts, rustling)
-- Custom sound effect generation from text description
-## Hardware Requirements
-| Variant | VRAM |
-|---------|------|
-| XXL (recommended) | 24GB+ |
-| XL (faster) | 16GB+ |
 ## License
-Community License — model weights are subject to the upstream community license terms.

 ---
+language: en
+license: apache-2.0
 tags:
+  - text-to-audio
   - zen
   - zenlm
   - hanzo
+  - foley
+  - sound-effects
+  - audio
+pipeline_tag: text-to-audio
+library_name: transformers
 ---
 # Zen Foley
+Foley sound effects generation model for video and interactive media production.
+## Overview
+Built on **Zen MoDE (Mixture of Distilled Experts)** architecture with 1B parameters.
+Developed by [Hanzo AI](https://hanzo.ai) and the [Zoo Labs Foundation](https://zoo.ngo).
+## Quick Start
 ```python
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
 import torch
+model_id = "zenlm/zen-foley"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
+# Load audio
+import librosa
+audio, sr = librosa.load("audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs)
+print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
 ```
+## Model Details
+| Attribute | Value |
+|-----------|-------|
+| Parameters | 1B |
+| Architecture | Zen MoDE |
+| Context | 10s audio |
+| License | Apache 2.0 |
 ## License
+Apache 2.0