--- base_model: - nvidia/audio-flamingo-3 pipeline_tag: feature-extraction --- # AFWhisper - Audio Flamingo Whisper Encoder Audio-Flamingo-3のサウンドエンコーダー(sound_tower)。 ## Model Info - **Base**: Qwen2AudioEncoder - **Hidden Size**: 1280 - **Layers**: 32 - **Attention Heads**: 20 - **Sample Rate**: 16000 Hz - **Max Audio Length**: 30 seconds (fixed) - **Original**: [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) ## Installation ```bash pip install transformers torch ``` ## Usage ### Using Transformers ```python import torch import numpy as np from transformers import AutoFeatureExtractor from transformers.models.qwen2_audio.modeling_qwen2_audio import Qwen2AudioEncoder from transformers.models.qwen2_audio.configuration_qwen2_audio import Qwen2AudioEncoderConfig # Load model model = Qwen2AudioEncoder.from_pretrained("Atotti/AFWhisper") model = model.to("cuda", dtype=torch.bfloat16) model.eval() # Load feature extractor (from Qwen2-Audio) feature_extractor = AutoFeatureExtractor.from_pretrained("Qwen/Qwen2-Audio-7B") # Load audio (16kHz, 30s fixed length) import librosa audio, sr = librosa.load("audio.wav", sr=16000) # Pad/trim to 30 seconds target_len = 16000 * 30 if len(audio) < target_len: audio = np.pad(audio, (0, target_len - len(audio))) else: audio = audio[:target_len] # Extract features inputs = feature_extractor([audio], sampling_rate=16000, return_tensors="pt") input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16) # Encode with torch.no_grad(): output = model(input_features=input_features) features = output.last_hidden_state # [1, T, 1280] print(f"Features shape: {features.shape}") # Mean pooling for utterance-level embedding embedding = features.mean(dim=1) # [1, 1280] ``` ## Output - **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量 - **Pooled embedding**: `[batch, 1280]` - 発話レベル埋め込み ## License See [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) for license information.