zeekay commited on
Commit
f7dd322
·
verified ·
1 Parent(s): 23b3cf0

Update model card: add zen/zenlm tags, fix branding

Browse files
Files changed (1) hide show
  1. README.md +31 -90
README.md CHANGED
@@ -1,114 +1,55 @@
1
  ---
2
- library_name: diffusers
3
- pipeline_tag: text-to-audio
4
- language:
5
- - en
6
- license: other
7
  tags:
8
- - video-to-audio
9
- - audio-generation
10
- - foley
11
- - sound-effects
12
  - zen
13
  - zenlm
14
  - hanzo
 
 
 
 
 
15
  ---
16
 
17
  # Zen Foley
18
 
19
- **Zen LM by Hanzo AI** Automatic foley sound and audio effect generation from video content.
20
-
21
- ## Specs
22
-
23
- | Property | Value |
24
- |----------|-------|
25
- | Architecture | Zen Foley Transformer (Flow Matching) |
26
- | Task | Video-to-Audio / Foley Sound Generation |
27
- | Model Type | Audio Diffusion Transformer |
28
- | Precision | BF16 |
29
-
30
- ## Model Variants
31
-
32
- This repository contains two model variants:
33
 
34
- | File | Variant | Depth (triple/single) | Hidden Size | Heads |
35
- |------|---------|----------------------|-------------|-------|
36
- | `hunyuanvideo_foley.pth` | XXL | 18 triple + 36 single | 1536 | 12 |
37
- | `hunyuanvideo_foley_xl.pth` | XL | 12 triple + 24 single | 1408 | 11 |
38
- | `synchformer_state_dict.pth` | Sync encoder | - | - | - |
39
- | `vae_128d_48k.pth` | VAE (48kHz, 128-dim) | - | - | - |
40
 
41
- ## Model Files
42
 
43
- | File | Role | Notes |
44
- |------|------|-------|
45
- | `hunyuanvideo_foley.pth` | Main XXL foley model | Best quality |
46
- | `hunyuanvideo_foley_xl.pth` | XL foley model | Faster inference |
47
- | `synchformer_state_dict.pth` | Audio-visual synchronization encoder | Required |
48
- | `vae_128d_48k.pth` | 48kHz audio VAE (128-dim latents) | Required |
49
- | `config.yaml` | XXL model configuration | Architecture params |
50
- | `config_xl.yaml` | XL model configuration | Architecture params |
51
 
52
- ## API Access (Recommended)
53
-
54
- ```python
55
- from openai import OpenAI
56
-
57
- client = OpenAI(
58
- base_url='https://api.hanzo.ai/v1',
59
- api_key='your-api-key',
60
- )
61
-
62
- # Generate foley audio from video description
63
- response = client.audio.speech.create(
64
- model='zen-foley',
65
- input='footsteps on gravel with ambient wind',
66
- voice='foley',
67
- )
68
- response.stream_to_file('foley.wav')
69
- ```
70
-
71
- ## Local Usage
72
 
73
  ```python
 
74
  import torch
75
 
76
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
77
-
78
- # Load XXL model (best quality)
79
- foley_model = torch.load(
80
- 'hunyuanvideo_foley.pth',
81
- map_location=device,
82
- weights_only=False,
83
- )
84
 
85
- # Load auxiliary models
86
- vae = torch.load('vae_128d_48k.pth', map_location=device, weights_only=False)
87
- sync_encoder = torch.load(
88
- 'synchformer_state_dict.pth',
89
- map_location=device,
90
- weights_only=False,
91
- )
92
  ```
93
 
94
- See [github.com/zenlm/zen-audio](https://github.com/zenlm/zen-audio) for the full inference pipeline.
95
-
96
- ## Capabilities
97
-
98
- - Automatic foley sound synthesis from video frames
99
- - Audio-visual synchronization (24fps video alignment)
100
- - 48kHz high-fidelity audio output
101
- - Environmental ambience generation
102
- - Object interaction sounds (footsteps, impacts, rustling)
103
- - Custom sound effect generation from text description
104
-
105
- ## Hardware Requirements
106
 
107
- | Variant | VRAM |
108
- |---------|------|
109
- | XXL (recommended) | 24GB+ |
110
- | XL (faster) | 16GB+ |
 
 
111
 
112
  ## License
113
 
114
- Community License — model weights are subject to the upstream community license terms.
 
1
  ---
2
+ language: en
3
+ license: apache-2.0
 
 
 
4
  tags:
5
+ - text-to-audio
 
 
 
6
  - zen
7
  - zenlm
8
  - hanzo
9
+ - foley
10
+ - sound-effects
11
+ - audio
12
+ pipeline_tag: text-to-audio
13
+ library_name: transformers
14
  ---
15
 
16
  # Zen Foley
17
 
18
+ Foley sound effects generation model for video and interactive media production.
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ ## Overview
 
 
 
 
 
21
 
22
+ Built on **Zen MoDE (Mixture of Distilled Experts)** architecture with 1B parameters.
23
 
24
+ Developed by [Hanzo AI](https://hanzo.ai) and the [Zoo Labs Foundation](https://zoo.ngo).
 
 
 
 
 
 
 
25
 
26
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ```python
29
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
30
  import torch
31
 
32
+ model_id = "zenlm/zen-foley"
33
+ processor = AutoProcessor.from_pretrained(model_id)
34
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
 
 
 
 
 
35
 
36
+ # Load audio
37
+ import librosa
38
+ audio, sr = librosa.load("audio.wav", sr=16000)
39
+ inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to(model.device)
40
+ outputs = model.generate(**inputs)
41
+ print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
 
42
  ```
43
 
44
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ | Attribute | Value |
47
+ |-----------|-------|
48
+ | Parameters | 1B |
49
+ | Architecture | Zen MoDE |
50
+ | Context | 10s audio |
51
+ | License | Apache 2.0 |
52
 
53
  ## License
54
 
55
+ Apache 2.0