Atotti's picture
Update README.md
73775df verified
---
license: mit
base_model:
- moonshotai/Kimi-Audio-7B-Instruct
pipeline_tag: feature-extraction
---
# Kimi-Audio Whisper Encoder
Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。
## Model Info
- **Base**: whisper-large-v3
- **Hidden Size**: 1280
- **Original**: [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct)
## Installation
```bash
pip install transformers librosa torch
```
## Usage
### Using Transformers (Recommended)
```python
import torch
import librosa
from transformers import WhisperModel
# Load model
model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
model = model.encoder.to("cuda", dtype=torch.bfloat16)
model.eval()
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Extract features using Whisper's feature extractor
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)
# Get encoder output
with torch.no_grad():
encoder_output = model(input_features)
features = encoder_output.last_hidden_state # [1, T, 1280]
print(f"Features shape: {features.shape}")
```
### Pooled Features
```python
# Mean pooling for utterance-level embedding
pooled = features.mean(dim=1) # [1, 1280]
```
## Output
- **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量
- **Pooled features**: `[batch, 1280]` - 発話レベル特徴量
## License
See [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) for license information.