File size: 1,816 Bytes
73775df
 
 
 
 
 
44c89e3
82187a7
73775df
82187a7
44c89e3
82187a7
44c89e3
 
 
82187a7
44c89e3
82187a7
 
44c89e3
82187a7
 
44c89e3
82187a7
44c89e3
82187a7
 
 
44c89e3
 
82187a7
44c89e3
 
 
 
82187a7
44c89e3
 
82187a7
44c89e3
 
 
 
 
82187a7
44c89e3
 
 
 
82187a7
44c89e3
82187a7
 
44c89e3
82187a7
 
44c89e3
 
82187a7
 
44c89e3
82187a7
44c89e3
 
82187a7
44c89e3
82187a7
44c89e3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
license: mit
base_model:
- moonshotai/Kimi-Audio-7B-Instruct
pipeline_tag: feature-extraction
---
# Kimi-Audio Whisper Encoder

Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。

## Model Info

- **Base**: whisper-large-v3
- **Hidden Size**: 1280
- **Original**: [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct)

## Installation

```bash
pip install transformers librosa torch
```

## Usage

### Using Transformers (Recommended)

```python
import torch
import librosa
from transformers import WhisperModel

# Load model
model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
model = model.encoder.to("cuda", dtype=torch.bfloat16)
model.eval()

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)

# Extract features using Whisper's feature extractor
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder")
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16)

# Get encoder output
with torch.no_grad():
    encoder_output = model(input_features)
    features = encoder_output.last_hidden_state  # [1, T, 1280]

print(f"Features shape: {features.shape}")
```

### Pooled Features

```python
# Mean pooling for utterance-level embedding
pooled = features.mean(dim=1)  # [1, 1280]
```

## Output

- **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量
- **Pooled features**: `[batch, 1280]` - 発話レベル特徴量

## License

See [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) for license information.