|
|
--- |
|
|
base_model: |
|
|
- nvidia/audio-flamingo-3 |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
# AFWhisper - Audio Flamingo Whisper Encoder |
|
|
|
|
|
Audio-Flamingo-3のサウンドエンコーダー(sound_tower)。 |
|
|
|
|
|
## Model Info |
|
|
|
|
|
- **Base**: Qwen2AudioEncoder |
|
|
- **Hidden Size**: 1280 |
|
|
- **Layers**: 32 |
|
|
- **Attention Heads**: 20 |
|
|
- **Sample Rate**: 16000 Hz |
|
|
- **Max Audio Length**: 30 seconds (fixed) |
|
|
- **Original**: [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from transformers import AutoFeatureExtractor |
|
|
from transformers.models.qwen2_audio.modeling_qwen2_audio import Qwen2AudioEncoder |
|
|
from transformers.models.qwen2_audio.configuration_qwen2_audio import Qwen2AudioEncoderConfig |
|
|
|
|
|
# Load model |
|
|
model = Qwen2AudioEncoder.from_pretrained("Atotti/AFWhisper") |
|
|
model = model.to("cuda", dtype=torch.bfloat16) |
|
|
model.eval() |
|
|
|
|
|
# Load feature extractor (from Qwen2-Audio) |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained("Qwen/Qwen2-Audio-7B") |
|
|
|
|
|
# Load audio (16kHz, 30s fixed length) |
|
|
import librosa |
|
|
audio, sr = librosa.load("audio.wav", sr=16000) |
|
|
|
|
|
# Pad/trim to 30 seconds |
|
|
target_len = 16000 * 30 |
|
|
if len(audio) < target_len: |
|
|
audio = np.pad(audio, (0, target_len - len(audio))) |
|
|
else: |
|
|
audio = audio[:target_len] |
|
|
|
|
|
# Extract features |
|
|
inputs = feature_extractor([audio], sampling_rate=16000, return_tensors="pt") |
|
|
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16) |
|
|
|
|
|
# Encode |
|
|
with torch.no_grad(): |
|
|
output = model(input_features=input_features) |
|
|
features = output.last_hidden_state # [1, T, 1280] |
|
|
|
|
|
print(f"Features shape: {features.shape}") |
|
|
|
|
|
# Mean pooling for utterance-level embedding |
|
|
embedding = features.mean(dim=1) # [1, 1280] |
|
|
``` |
|
|
|
|
|
## Output |
|
|
|
|
|
- **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量 |
|
|
- **Pooled embedding**: `[batch, 1280]` - 発話レベル埋め込み |
|
|
|
|
|
## License |
|
|
|
|
|
See [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) for license information. |
|
|
|
|
|
|