Atotti commited on
Commit
c528222
·
verified ·
1 Parent(s): 8a0696f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AFWhisper - Audio Flamingo Whisper Encoder
2
+
3
+ Audio-Flamingo-3のサウンドエンコーダー(sound_tower)。Qwen2-Audioアーキテクチャベースの音声エンコーダー。
4
+
5
+ ## Model Info
6
+
7
+ - **Base**: Qwen2AudioEncoder
8
+ - **Hidden Size**: 1280
9
+ - **Layers**: 32
10
+ - **Attention Heads**: 20
11
+ - **Sample Rate**: 16000 Hz
12
+ - **Max Audio Length**: 30 seconds (fixed)
13
+ - **Original**: [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3)
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install transformers torch
19
+ ```
20
+
21
+ ## Usage
22
+
23
+ ### Using Transformers
24
+
25
+ ```python
26
+ import torch
27
+ import numpy as np
28
+ from transformers import AutoFeatureExtractor
29
+ from transformers.models.qwen2_audio.modeling_qwen2_audio import Qwen2AudioEncoder
30
+ from transformers.models.qwen2_audio.configuration_qwen2_audio import Qwen2AudioEncoderConfig
31
+
32
+ # Load model
33
+ model = Qwen2AudioEncoder.from_pretrained("Atotti/AFWhisper")
34
+ model = model.to("cuda", dtype=torch.bfloat16)
35
+ model.eval()
36
+
37
+ # Load feature extractor (from Qwen2-Audio)
38
+ feature_extractor = AutoFeatureExtractor.from_pretrained("Qwen/Qwen2-Audio-7B")
39
+
40
+ # Load audio (16kHz, 30s fixed length)
41
+ import librosa
42
+ audio, sr = librosa.load("audio.wav", sr=16000)
43
+
44
+ # Pad/trim to 30 seconds
45
+ target_len = 16000 * 30
46
+ if len(audio) < target_len:
47
+ audio = np.pad(audio, (0, target_len - len(audio)))
48
+ else:
49
+ audio = audio[:target_len]
50
+
51
+ # Extract features
52
+ inputs = feature_extractor([audio], sampling_rate=16000, return_tensors="pt")
53
+ input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)
54
+
55
+ # Encode
56
+ with torch.no_grad():
57
+ output = model(input_features=input_features)
58
+ features = output.last_hidden_state # [1, T, 1280]
59
+
60
+ print(f"Features shape: {features.shape}")
61
+
62
+ # Mean pooling for utterance-level embedding
63
+ embedding = features.mean(dim=1) # [1, 1280]
64
+ ```
65
+
66
+ ## Output
67
+
68
+ - **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量
69
+ - **Pooled embedding**: `[batch, 1280]` - 発話レベル埋め込み
70
+
71
+ ## License
72
+
73
+ See [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) for license information.
74
+
75
+ ## Citation
76
+
77
+ ```bibtex
78
+ @article{kong2024audio,
79
+ title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
80
+ author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Wang, Wei and Valle, Rafael and Catanzaro, Bryan},
81
+ journal={arXiv preprint arXiv:2402.01831},
82
+ year={2024}
83
+ }
84
+ ```