File size: 2,088 Bytes
8536f0b
 
 
 
 
c528222
 
da8bd38
c528222
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
base_model:
- nvidia/audio-flamingo-3
pipeline_tag: feature-extraction
---
# AFWhisper - Audio Flamingo Whisper Encoder

Audio-Flamingo-3のサウンドエンコーダー(sound_tower)。

## Model Info

- **Base**: Qwen2AudioEncoder
- **Hidden Size**: 1280
- **Layers**: 32
- **Attention Heads**: 20
- **Sample Rate**: 16000 Hz
- **Max Audio Length**: 30 seconds (fixed)
- **Original**: [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3)

## Installation

```bash
pip install transformers torch
```

## Usage

### Using Transformers

```python
import torch
import numpy as np
from transformers import AutoFeatureExtractor
from transformers.models.qwen2_audio.modeling_qwen2_audio import Qwen2AudioEncoder
from transformers.models.qwen2_audio.configuration_qwen2_audio import Qwen2AudioEncoderConfig

# Load model
model = Qwen2AudioEncoder.from_pretrained("Atotti/AFWhisper")
model = model.to("cuda", dtype=torch.bfloat16)
model.eval()

# Load feature extractor (from Qwen2-Audio)
feature_extractor = AutoFeatureExtractor.from_pretrained("Qwen/Qwen2-Audio-7B")

# Load audio (16kHz, 30s fixed length)
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)

# Pad/trim to 30 seconds
target_len = 16000 * 30
if len(audio) < target_len:
    audio = np.pad(audio, (0, target_len - len(audio)))
else:
    audio = audio[:target_len]

# Extract features
inputs = feature_extractor([audio], sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)

# Encode
with torch.no_grad():
    output = model(input_features=input_features)
    features = output.last_hidden_state  # [1, T, 1280]

print(f"Features shape: {features.shape}")

# Mean pooling for utterance-level embedding
embedding = features.mean(dim=1)  # [1, 1280]
```

## Output

- **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量
- **Pooled embedding**: `[batch, 1280]` - 発話レベル埋め込み

## License

See [nvidia/audio-flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3) for license information.