metadata
base_model:
- nvidia/audio-flamingo-3
pipeline_tag: feature-extraction
AFWhisper - Audio Flamingo Whisper Encoder
Audio-Flamingo-3のサウンドエンコーダー(sound_tower)。
Model Info
- Base: Qwen2AudioEncoder
- Hidden Size: 1280
- Layers: 32
- Attention Heads: 20
- Sample Rate: 16000 Hz
- Max Audio Length: 30 seconds (fixed)
- Original: nvidia/audio-flamingo-3
Installation
pip install transformers torch
Usage
Using Transformers
import torch
import numpy as np
from transformers import AutoFeatureExtractor
from transformers.models.qwen2_audio.modeling_qwen2_audio import Qwen2AudioEncoder
from transformers.models.qwen2_audio.configuration_qwen2_audio import Qwen2AudioEncoderConfig
# Load model
model = Qwen2AudioEncoder.from_pretrained("Atotti/AFWhisper")
model = model.to("cuda", dtype=torch.bfloat16)
model.eval()
# Load feature extractor (from Qwen2-Audio)
feature_extractor = AutoFeatureExtractor.from_pretrained("Qwen/Qwen2-Audio-7B")
# Load audio (16kHz, 30s fixed length)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
# Pad/trim to 30 seconds
target_len = 16000 * 30
if len(audio) < target_len:
audio = np.pad(audio, (0, target_len - len(audio)))
else:
audio = audio[:target_len]
# Extract features
inputs = feature_extractor([audio], sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)
# Encode
with torch.no_grad():
output = model(input_features=input_features)
features = output.last_hidden_state # [1, T, 1280]
print(f"Features shape: {features.shape}")
# Mean pooling for utterance-level embedding
embedding = features.mean(dim=1) # [1, 1280]
Output
- Sequential features:
[batch, time_steps, 1280]- 時系列特徴量 - Pooled embedding:
[batch, 1280]- 発話レベル埋め込み
License
See nvidia/audio-flamingo-3 for license information.