--- license: mit base_model: - moonshotai/Kimi-Audio-7B-Instruct pipeline_tag: feature-extraction --- # Kimi-Audio Whisper Encoder Kimi-Audioでファインチューニングされたwhisperエンコーダー。音声から連続的な音響特徴量を抽出。 ## Model Info - **Base**: whisper-large-v3 - **Hidden Size**: 1280 - **Original**: [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) ## Installation ```bash pip install transformers librosa torch ``` ## Usage ### Using Transformers (Recommended) ```python import torch import librosa from transformers import WhisperModel # Load model model = WhisperModel.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder") model = model.encoder.to("cuda", dtype=torch.bfloat16) model.eval() # Load audio audio, sr = librosa.load("audio.wav", sr=16000) # Extract features using Whisper's feature extractor from transformers import WhisperFeatureExtractor feature_extractor = WhisperFeatureExtractor.from_pretrained("Atotti/Kimi-Audio-Whisper-Encoder") inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt") input_features = inputs.input_features.to("cuda", dtype=torch.bfloat16) # Get encoder output with torch.no_grad(): encoder_output = model(input_features) features = encoder_output.last_hidden_state # [1, T, 1280] print(f"Features shape: {features.shape}") ``` ### Pooled Features ```python # Mean pooling for utterance-level embedding pooled = features.mean(dim=1) # [1, 1280] ``` ## Output - **Sequential features**: `[batch, time_steps, 1280]` - 時系列特徴量 - **Pooled features**: `[batch, 1280]` - 発話レベル特徴量 ## License See [moonshotai/Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) for license information.