File size: 1,816 Bytes
73775df 44c89e3 82187a7 73775df 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 82187a7 44c89e3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: mit
base_model:
- moonshotai/Kimi-Audio-7B-Instruct
pipeline_tag: feature-extraction
---
# Kimi-Audio Whisper Encoder
Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。
## Model Info
- **Base**: whisper-large-v3
- **Hidden Size**: 1280
- **Original**: [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct)
## Installation
```bash
pip install transformers librosa torch
```
## Usage
### Using Transformers (Recommended)
```python
import torch
import librosa
from transformers import WhisperModel
# Load model
model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
model = model.encoder.to("cuda", dtype=torch.bfloat16)
model.eval()
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Extract features using Whisper's feature extractor
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)
# Get encoder output
with torch.no_grad():
encoder_output = model(input_features)
features = encoder_output.last_hidden_state # [1, T, 1280]
print(f"Features shape: {features.shape}")
```
### Pooled Features
```python
# Mean pooling for utterance-level embedding
pooled = features.mean(dim=1) # [1, 1280]
```
## Output
- **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量
- **Pooled features**: `[batch, 1280]` - 発話レベル特徴量
## License
See [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) for license information.
|