antonios-makro commited on
Commit
18a538b
Β·
verified Β·
1 Parent(s): 1aba25a

Upload 4 files

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +225 -0
  3. config.json +62 -0
  4. wav2arkit_cpu.onnx +3 -0
  5. wav2arkit_cpu.onnx.data +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ wav2arkit_cpu.onnx.data filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,228 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: 3DAIGC/LAM_audio2exp
4
+ library_name: onnxruntime
5
+ pipeline_tag: audio-to-audio
6
+ tags:
7
+ - onnx
8
+ - audio2expression
9
+ - arkit
10
+ - blendshapes
11
+ - facial-animation
12
+ - avatar
13
+ - wav2vec2
14
+ - realtime
15
+ - cpu
16
  ---
17
+
18
+ # Wav2ARKit - Audio to Facial Expression (ONNX)
19
+
20
+ A **fused, end-to-end ONNX model** that converts raw audio waveforms directly into 52 ARKit-compatible facial blendshapes. Based on the [Facebook Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) and [LAM Audio2Expression](https://huggingface.co/3DAIGC/LAM_audio2exp) models, optimized for real-time CPU inference.
21
+
22
+ ## ✨ Features
23
+
24
+ | Feature | Value |
25
+ |---------|-------|
26
+ | **Input** | Raw 16kHz audio waveform |
27
+ | **Output** | 52 ARKit blendshapes @ 30fps |
28
+ | **Inference** | ~45ms per second of audio |
29
+ | **Speed** | 22Γ— faster than realtime |
30
+ | **Size** | 1.8 MB |
31
+
32
+ ## Quick Start
33
+
34
+ ```python
35
+ import onnxruntime as ort
36
+ import numpy as np
37
+
38
+ # Load model
39
+ session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])
40
+
41
+ # Load audio (16kHz, mono, float32)
42
+ # Example: 1 second = 16000 samples
43
+ audio = np.random.randn(1, 16000).astype(np.float32)
44
+
45
+ # Run inference
46
+ blendshapes = session.run(None, {"audio_waveform": audio})[0]
47
+ # Output: (1, 30, 52) - 30 frames at 30fps, 52 blendshapes
48
+ ```
49
+
50
+ ## Model Specification
51
+
52
+ ### Input
53
+ | Name | Type | Shape | Description |
54
+ |------|------|-------|-------------|
55
+ | `audio_waveform` | float32 | `[batch, samples]` | Raw audio at 16kHz |
56
+
57
+ ### Output
58
+ | Name | Type | Shape | Description |
59
+ |------|------|-------|-------------|
60
+ | `blendshapes` | float32 | `[batch, frames, 52]` | ARKit blendshapes [0-1] |
61
+
62
+ ### Frame Calculation
63
+ ```
64
+ output_frames = ceil(30 Γ— (num_samples / 16000))
65
+ ```
66
+ Example: 1 second audio (16000 samples) β†’ 30 frames
67
+
68
+ ## ARKit Blendshapes
69
+
70
+ <details>
71
+ <summary>52 blendshape indices (click to expand)</summary>
72
+
73
+ | Idx | Name | Idx | Name |
74
+ |-----|------|-----|------|
75
+ | 0 | browDownLeft | 26 | mouthFrownRight |
76
+ | 1 | browDownRight | 27 | mouthFunnel |
77
+ | 2 | browInnerUp | 28 | mouthLeft |
78
+ | 3 | browOuterUpLeft | 29 | mouthLowerDownLeft |
79
+ | 4 | browOuterUpRight | 30 | mouthLowerDownRight |
80
+ | 5 | cheekPuff | 31 | mouthPressLeft |
81
+ | 6 | cheekSquintLeft | 32 | mouthPressRight |
82
+ | 7 | cheekSquintRight | 33 | mouthPucker |
83
+ | 8 | eyeBlinkLeft | 34 | mouthRight |
84
+ | 9 | eyeBlinkRight | 35 | mouthRollLower |
85
+ | 10 | eyeLookDownLeft | 36 | mouthRollUpper |
86
+ | 11 | eyeLookDownRight | 37 | mouthShrugLower |
87
+ | 12 | eyeLookInLeft | 38 | mouthShrugUpper |
88
+ | 13 | eyeLookInRight | 39 | mouthSmileLeft |
89
+ | 14 | eyeLookOutLeft | 40 | mouthSmileRight |
90
+ | 15 | eyeLookOutRight | 41 | mouthStretchLeft |
91
+ | 16 | eyeLookUpLeft | 42 | mouthStretchRight |
92
+ | 17 | eyeLookUpRight | 43 | mouthUpperUpLeft |
93
+ | 18 | eyeSquintLeft | 44 | mouthUpperUpRight |
94
+ | 19 | eyeSquintRight | 45 | noseSneerLeft |
95
+ | 20 | eyeWideLeft | 46 | noseSneerRight |
96
+ | 21 | eyeWideRight | 47 | tongueOut |
97
+ | 22 | jawForward | 48 | mouthClose |
98
+ | 23 | jawLeft | 49 | mouthDimpleLeft |
99
+ | 24 | jawOpen | 50 | mouthDimpleRight |
100
+ | 25 | mouthFrownLeft | 51 | jawRight |
101
+
102
+ </details>
103
+
104
+ ## Usage Examples
105
+
106
+ ### Python with audio file
107
+ ```python
108
+ import onnxruntime as ort
109
+ import numpy as np
110
+ import soundfile as sf
111
+
112
+ session = ort.InferenceSession("wav2arkit_cpu.onnx", providers=["CPUExecutionProvider"])
113
+
114
+ # Load and resample audio to 16kHz if needed
115
+ audio, sr = sf.read("speech.wav")
116
+ if sr != 16000:
117
+ import librosa
118
+ audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
119
+
120
+ # Ensure mono
121
+ if len(audio.shape) > 1:
122
+ audio = audio.mean(axis=1)
123
+
124
+ # Run inference
125
+ audio_input = audio.astype(np.float32).reshape(1, -1)
126
+ blendshapes = session.run(None, {"audio_waveform": audio_input})[0]
127
+
128
+ print(f"Duration: {len(audio)/16000:.2f}s β†’ {blendshapes.shape[1]} frames")
129
+ ```
130
+
131
+ ### C++
132
+ ```cpp
133
+ #include <onnxruntime_cxx_api.h>
134
+
135
+ Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "Wav2ARKit");
136
+ Ort::Session session(env, L"wav2arkit_cpu.onnx", Ort::SessionOptions{});
137
+
138
+ std::vector<float> audio(16000); // 1 second
139
+ std::vector<int64_t> shape = {1, 16000};
140
+
141
+ Ort::MemoryInfo mem = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
142
+ Ort::Value input = Ort::Value::CreateTensor<float>(mem, audio.data(), audio.size(), shape.data(), shape.size());
143
+
144
+ const char* input_names[] = {"audio_waveform"};
145
+ const char* output_names[] = {"blendshapes"};
146
+ auto output = session.Run({}, input_names, &input, 1, output_names, 1);
147
+ ```
148
+
149
+ ### JavaScript (onnxruntime-web/node)
150
+ ```javascript
151
+ const ort = require('onnxruntime-node');
152
+
153
+ const session = await ort.InferenceSession.create('wav2arkit_cpu.onnx');
154
+ const audioTensor = new ort.Tensor('float32', audioData, [1, audioData.length]);
155
+ const { blendshapes } = await session.run({ audio_waveform: audioTensor });
156
+ ```
157
+
158
+ ## Architecture
159
+
160
+ ```
161
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
162
+ β”‚ Wav2ARKit ONNX Model β”‚
163
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
164
+ β”‚ β”‚
165
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
166
+ β”‚ β”‚ Audio Input β”‚ [batch, samples] @ 16kHz β”‚
167
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
168
+ β”‚ β”‚ β”‚
169
+ β”‚ β–Ό β”‚
170
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
171
+ β”‚ β”‚ Wav2Vec2 Encoder β”‚ β”‚
172
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
173
+ β”‚ β”‚ β”‚ CNN Feature Extractor (50fps) β”‚ β”‚ β”‚
174
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
175
+ β”‚ β”‚ β”‚ β”‚ β”‚
176
+ β”‚ β”‚ β–Ό β”‚ β”‚
177
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
178
+ β”‚ β”‚ β”‚ Linear Interpolation 50β†’30fps β”‚ β”‚ β”‚
179
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
180
+ β”‚ β”‚ β”‚ β”‚ β”‚
181
+ β”‚ β”‚ β–Ό β”‚ β”‚
182
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
183
+ β”‚ β”‚ β”‚ Transformer Encoder (12 layers) β”‚ β”‚ β”‚
184
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
185
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
186
+ β”‚ β”‚ [batch, frames, 768] β”‚
187
+ β”‚ β–Ό β”‚
188
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
189
+ β”‚ β”‚ Feature Projection (768 β†’ 512) β”‚ β”‚
190
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
191
+ β”‚ β”‚ [batch, frames, 512] β”‚
192
+ β”‚ β–Ό β”‚
193
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
194
+ β”‚ β”‚ Identity Encoder │◄───│ Identity ID (0-11) β”‚ β”‚
195
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ int β†’ one-hot [12] β”‚ β”‚
196
+ β”‚ β”‚ β”‚ Concat: [512] + MLP([12]β†’[64]) β”‚ β”‚ β”‚ β†’ MLP β†’ [64] β”‚ β”‚
197
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ (baked as ID=11) β”‚ β”‚
198
+ β”‚ β”‚ β–Ό β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
199
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
200
+ β”‚ β”‚ β”‚ SeqTranslator (3Γ— Conv+LN+ReLU) β”‚ β”‚ β”‚
201
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
202
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
203
+ β”‚ β”‚ [batch, 512, frames] β”‚
204
+ β”‚ β–Ό β”‚
205
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
206
+ β”‚ β”‚ Decoder (3Γ— Conv1D + LayerNorm) β”‚ β”‚
207
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
208
+ β”‚ β”‚ [batch, 512, frames] β”‚
209
+ β”‚ β–Ό β”‚
210
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
211
+ β”‚ β”‚ Output Projection (512 β†’ 52) + Οƒ β”‚ β”‚
212
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
213
+ β”‚ β”‚ β”‚
214
+ β”‚ β–Ό β”‚
215
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
216
+ β”‚ β”‚ Output β”‚ [batch, frames, 52] @ 30fps, values ∈ [0,1] β”‚
217
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
218
+ β”‚ β”‚
219
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
220
+ ```
221
+
222
+ **Note:** The identity encoder supports 12 speaker identities (0-11). This ONNX export uses identity `11` baked in for single-speaker inference.
223
+
224
+ ## License
225
+
226
+ Apache 2.0 - Based on:
227
+ - [3DAIGC/LAM_audio2exp](https://huggingface.co/3DAIGC/LAM_audio2exp)
228
+ - [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "wav2arkit_cpu",
3
+ "description": "End-to-end audio to ARKit blendshape model",
4
+ "format": "onnx",
5
+
6
+ "source_models": {
7
+ "audio_encoder": "facebook/wav2vec2-base-960h",
8
+ "expression_decoder": "3DAIGC/LAM_audio2exp"
9
+ },
10
+
11
+ "audio_encoder": {
12
+ "source": "facebook/wav2vec2-base-960h",
13
+ "hidden_size": 768
14
+ },
15
+
16
+ "preprocessing": {
17
+ "sample_rate": 16000,
18
+ "channels": 1,
19
+ "normalize": false
20
+ },
21
+
22
+ "input_spec": {
23
+ "name": "audio_waveform",
24
+ "dtype": "float32",
25
+ "shape": ["batch_size", "num_samples"]
26
+ },
27
+
28
+ "output_spec": {
29
+ "name": "blendshapes",
30
+ "dtype": "float32",
31
+ "shape": ["batch_size", "num_frames", 52],
32
+ "fps": 30,
33
+ "value_range": [0.0, 1.0]
34
+ },
35
+
36
+ "num_blendshapes": 52,
37
+ "output_fps": 30,
38
+ "frame_formula": "ceil(30 * num_samples / 16000)",
39
+
40
+ "blendshape_names": [
41
+ "browDownLeft", "browDownRight", "browInnerUp", "browOuterUpLeft", "browOuterUpRight",
42
+ "cheekPuff", "cheekSquintLeft", "cheekSquintRight",
43
+ "eyeBlinkLeft", "eyeBlinkRight", "eyeLookDownLeft", "eyeLookDownRight",
44
+ "eyeLookInLeft", "eyeLookInRight", "eyeLookOutLeft", "eyeLookOutRight",
45
+ "eyeLookUpLeft", "eyeLookUpRight", "eyeSquintLeft", "eyeSquintRight",
46
+ "eyeWideLeft", "eyeWideRight",
47
+ "jawForward", "jawLeft", "jawOpen", "jawRight",
48
+ "mouthClose", "mouthDimpleLeft", "mouthDimpleRight", "mouthFrownLeft", "mouthFrownRight",
49
+ "mouthFunnel", "mouthLeft", "mouthLowerDownLeft", "mouthLowerDownRight",
50
+ "mouthPressLeft", "mouthPressRight", "mouthPucker", "mouthRight",
51
+ "mouthRollLower", "mouthRollUpper", "mouthShrugLower", "mouthShrugUpper",
52
+ "mouthSmileLeft", "mouthSmileRight", "mouthStretchLeft", "mouthStretchRight",
53
+ "mouthUpperUpLeft", "mouthUpperUpRight",
54
+ "noseSneerLeft", "noseSneerRight", "tongueOut"
55
+ ],
56
+
57
+ "onnx": {
58
+ "opset_version": 18,
59
+ "producer": "pytorch",
60
+ "model_file": "wav2arkit_cpu.onnx"
61
+ }
62
+ }
wav2arkit_cpu.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cdecbfad3915dd20b2f0718942d0b8894b2ee11edcc5a9a9da45d29a46af2ed9
3
+ size 1862753
wav2arkit_cpu.onnx.data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0f0364673c6e50be126b193e2b56809c16ac6bee4805aea9b8251ce53429bf8
3
+ size 402063360