pgkwon1 commited on
Commit
4ebed0f
Β·
verified Β·
1 Parent(s): a8c82be

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +22 -0
  2. README.md +140 -6
  3. app.py +528 -0
  4. requirements.txt +12 -0
Dockerfile ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # μ‹œμŠ€ν…œ νŒ¨ν‚€μ§€ μ„€μΉ˜
6
+ RUN apt-get update && apt-get install -y \
7
+ ffmpeg \
8
+ git \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Python νŒ¨ν‚€μ§€ μ„€μΉ˜
12
+ COPY requirements.txt .
13
+ RUN pip install --no-cache-dir -r requirements.txt
14
+
15
+ # μ•± 볡사
16
+ COPY app.py .
17
+
18
+ # 포트 λ…ΈμΆœ
19
+ EXPOSE 7860
20
+
21
+ # μ‹€ν–‰
22
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,11 +1,145 @@
1
  ---
2
- title: SpeechlibProject
3
- emoji: πŸ‘
4
- colorFrom: purple
5
- colorTo: green
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Speechlib API
3
+ emoji: 🎀
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
+ app_file: app.py
8
  pinned: false
 
9
  ---
10
 
11
+ # Speechlib REST API (ECAPA-TDNN)
12
+
13
+ ν™”μž 뢄리(Speaker Diarization) + ν™”μž 식별(Speaker Identification) + μŒμ„± 인식(STT) REST API
14
+
15
+ ## Features
16
+
17
+ - **ν™”μž 뢄리**: pyannote/speaker-diarization-3.1둜 μ—¬λŸ¬ ν™”μž ꡬ뢄
18
+ - **ν™”μž 식별**: speechbrain ECAPA-TDNN으둜 λ“±λ‘λœ ν™”μž 식별 (κ³ μ •λ°€)
19
+ - **μŒμ„± 인식**: faster-whisper (large-v3-turbo)λ₯Ό μ‚¬μš©ν•œ STT
20
+
21
+ ## API Endpoints
22
+
23
+ ### GET /
24
+ API μƒνƒœ 확인
25
+
26
+ ### GET /health
27
+ ν—¬μŠ€ 체크
28
+
29
+ ### POST /transcribe
30
+ λ‹¨μˆœ STT + ν™”μž 뢄리 (ν™”μž 식별 μ—†μŒ)
31
+
32
+ **Parameters (multipart/form-data):**
33
+ - `audio`: μ˜€λ””μ˜€ 파일 (ν•„μˆ˜)
34
+ - `language`: μ–Έμ–΄ μ½”λ“œ (κΈ°λ³Έκ°’: ko)
35
+ - `hf_token`: HuggingFace 토큰 (ν•„μˆ˜)
36
+
37
+ ### POST /process
38
+ 전체 κΈ°λŠ₯: ν™”μž 뢄리 + ν™”μž 식별 + STT
39
+
40
+ **Parameters (multipart/form-data):**
41
+ - `audio`: 뢄석할 μ˜€λ””μ˜€ 파일 (ν•„μˆ˜)
42
+ - `voice_sample`: ν™”μž μƒ˜ν”Œ 파일 (선택)
43
+ - `speaker_name`: 식별할 ν™”μž 이름 (κΈ°λ³Έκ°’: speaker)
44
+ - `language`: μ–Έμ–΄ μ½”λ“œ (κΈ°λ³Έκ°’: ko)
45
+ - `hf_token`: HuggingFace 토큰 (ν•„μˆ˜)
46
+
47
+ ## Usage Example
48
+
49
+ ### cURL
50
+
51
+ ```bash
52
+ # λ‹¨μˆœ STT
53
+ curl -X POST "https://YOUR_SPACE.hf.space/transcribe" \
54
+ -F "audio=@audio.wav" \
55
+ -F "language=ko" \
56
+ -F "hf_token=hf_YOUR_TOKEN"
57
+
58
+ # ν™”μž 식별 포함
59
+ curl -X POST "https://YOUR_SPACE.hf.space/process" \
60
+ -F "audio=@conversation.wav" \
61
+ -F "voice_sample=@speaker_sample.wav" \
62
+ -F "speaker_name=홍길동" \
63
+ -F "language=ko" \
64
+ -F "hf_token=hf_YOUR_TOKEN"
65
+ ```
66
+
67
+ ### Python
68
+
69
+ ```python
70
+ import requests
71
+
72
+ # λ‹¨μˆœ STT
73
+ response = requests.post(
74
+ "https://YOUR_SPACE.hf.space/transcribe",
75
+ files={"audio": open("audio.wav", "rb")},
76
+ data={"language": "ko", "hf_token": "hf_YOUR_TOKEN"}
77
+ )
78
+ print(response.json())
79
+
80
+ # ν™”μž 식별 포함
81
+ response = requests.post(
82
+ "https://YOUR_SPACE.hf.space/process",
83
+ files={
84
+ "audio": open("conversation.wav", "rb"),
85
+ "voice_sample": open("speaker_sample.wav", "rb")
86
+ },
87
+ data={
88
+ "speaker_name": "홍길동",
89
+ "language": "ko",
90
+ "hf_token": "hf_YOUR_TOKEN"
91
+ }
92
+ )
93
+ print(response.json())
94
+ ```
95
+
96
+ ### JavaScript/Node.js
97
+
98
+ ```javascript
99
+ const FormData = require('form-data');
100
+ const fs = require('fs');
101
+ const axios = require('axios');
102
+
103
+ const form = new FormData();
104
+ form.append('audio', fs.createReadStream('audio.wav'));
105
+ form.append('language', 'ko');
106
+ form.append('hf_token', 'hf_YOUR_TOKEN');
107
+
108
+ const response = await axios.post(
109
+ 'https://YOUR_SPACE.hf.space/transcribe',
110
+ form,
111
+ { headers: form.getHeaders() }
112
+ );
113
+ console.log(response.data);
114
+ ```
115
+
116
+ ## Response Format
117
+
118
+ ```json
119
+ {
120
+ "success": true,
121
+ "segments": [
122
+ {
123
+ "start": 0.0,
124
+ "end": 2.5,
125
+ "text": "μ•ˆλ…•ν•˜μ„Έμš”",
126
+ "speaker": "홍길동",
127
+ "similarity": 85.3
128
+ }
129
+ ],
130
+ "speaker_stats": {
131
+ "홍길동": {
132
+ "count": 10,
133
+ "duration": 45.5
134
+ }
135
+ },
136
+ "total_segments": 20
137
+ }
138
+ ```
139
+
140
+ ## Notes
141
+
142
+ - ECAPA-TDNN은 μœ μ‚¬λ„ μž„κ³„κ°’ 25% 이상일 λ•Œ ν™”μž λ§€μΉ­
143
+ - GPU μ‚¬μš© κ°€λŠ₯ μ‹œ μžλ™μœΌλ‘œ GPU ν™œμš©
144
+ - 지원 μ˜€λ””μ˜€ 포맷: wav, mp3, m4a, ogg, flac, aac
145
+ - API λ¬Έμ„œ: https://YOUR_SPACE.hf.space/docs
app.py ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Speechlib REST API - HuggingFace Spaces (ECAPA-TDNN 버전)
3
+ ν™”μž 뢄리 + ν™”μž 식별 + STT
4
+ """
5
+ import os
6
+ import tempfile
7
+ import json
8
+ import numpy as np
9
+ import shutil
10
+ from typing import List, Dict, Optional
11
+ from contextlib import asynccontextmanager
12
+
13
+ # ν™˜κ²½ μ„€μ •
14
+ os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
15
+ os.environ["HF_HUB_DISABLE_SYMLINKS"] = "1"
16
+
17
+ import torch
18
+
19
+ # PyTorch ν˜Έν™˜μ„± 패치 (버전에 따라 λΆ„κΈ°)
20
+ if hasattr(torch.serialization, 'add_safe_globals'):
21
+ torch.serialization.add_safe_globals([torch.torch_version.TorchVersion])
22
+ from pyannote.audio.core import task as pyannote_task
23
+ from pyannote.audio.core.io import Audio
24
+ torch.serialization.add_safe_globals([
25
+ pyannote_task.Specifications,
26
+ pyannote_task.Problem,
27
+ pyannote_task.Resolution,
28
+ Audio
29
+ ])
30
+
31
+ # weights_only=False 패치
32
+ original_load = torch.load
33
+ def patched_load(*args, **kwargs):
34
+ if 'weights_only' not in kwargs:
35
+ kwargs['weights_only'] = False
36
+ return original_load(*args, **kwargs)
37
+ torch.load = patched_load
38
+
39
+ from fastapi import FastAPI, UploadFile, File, Form, HTTPException
40
+ from fastapi.responses import JSONResponse
41
+ import uvicorn
42
+ import torchaudio
43
+ from pydub import AudioSegment
44
+
45
+
46
+ class SpeakerPipelineECAPA:
47
+ """
48
+ ECAPA-TDNN μž„λ² λ”©μ„ μ‚¬μš©ν•œ ν™”μž 식별 νŒŒμ΄ν”„λΌμΈ
49
+ """
50
+
51
+ def __init__(
52
+ self,
53
+ hf_token: str,
54
+ whisper_model: str = "large-v3-turbo",
55
+ similarity_threshold: float = 0.25,
56
+ device: str = None
57
+ ):
58
+ self.hf_token = hf_token
59
+ self.whisper_model_size = whisper_model
60
+ self.similarity_threshold = similarity_threshold
61
+
62
+ # GPU μ‚¬μš© κ°€λŠ₯ν•˜λ©΄ GPU, μ•„λ‹ˆλ©΄ CPU
63
+ if device is None:
64
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
65
+ else:
66
+ self.device = device
67
+
68
+ self.registered_speakers: Dict[str, np.ndarray] = {}
69
+
70
+ # λͺ¨λΈλ“€ (lazy loading)
71
+ self._diarization_pipeline = None
72
+ self._ecapa_model = None
73
+ self._whisper_model = None
74
+
75
+ print(f"[SpeakerPipeline ECAPA-TDNN] μ΄ˆκΈ°ν™”")
76
+ print(f" - Device: {self.device}")
77
+ print(f" - μž„κ³„κ°’: {similarity_threshold}")
78
+
79
+ @property
80
+ def diarization_pipeline(self):
81
+ if self._diarization_pipeline is None:
82
+ print("[λ‘œλ”©] pyannote/speaker-diarization-3.1...")
83
+ from pyannote.audio import Pipeline
84
+ self._diarization_pipeline = Pipeline.from_pretrained(
85
+ "pyannote/speaker-diarization-3.1",
86
+ use_auth_token=self.hf_token
87
+ )
88
+ self._diarization_pipeline.to(torch.device(self.device))
89
+ return self._diarization_pipeline
90
+
91
+ @property
92
+ def ecapa_model(self):
93
+ if self._ecapa_model is None:
94
+ print("[λ‘œλ”©] speechbrain ECAPA-TDNN...")
95
+ from speechbrain.inference.speaker import EncoderClassifier
96
+ self._ecapa_model = EncoderClassifier.from_hparams(
97
+ source="speechbrain/spkrec-ecapa-voxceleb",
98
+ savedir="pretrained_models/spkrec-ecapa-voxceleb",
99
+ run_opts={"device": self.device}
100
+ )
101
+ return self._ecapa_model
102
+
103
+ @property
104
+ def whisper_model(self):
105
+ if self._whisper_model is None:
106
+ print(f"[λ‘œλ”©] faster-whisper {self.whisper_model_size}...")
107
+ from faster_whisper import WhisperModel
108
+ compute_type = "float16" if self.device == "cuda" else "int8"
109
+ self._whisper_model = WhisperModel(
110
+ self.whisper_model_size,
111
+ device=self.device,
112
+ compute_type=compute_type
113
+ )
114
+ return self._whisper_model
115
+
116
+ def _load_audio(self, audio_path: str) -> tuple:
117
+ """μ˜€λ””μ˜€ λ‘œλ“œ 및 μ „μ²˜λ¦¬"""
118
+ ext = os.path.splitext(audio_path)[1].lower()
119
+
120
+ if ext in ['.m4a', '.mp4', '.aac', '.ogg', '.flac', '.mp3']:
121
+ audio = AudioSegment.from_file(audio_path)
122
+ audio = audio.set_channels(1).set_frame_rate(16000)
123
+ with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as tmp:
124
+ tmp_path = tmp.name
125
+ audio.export(tmp_path, format='wav')
126
+ waveform, sample_rate = torchaudio.load(tmp_path)
127
+ os.unlink(tmp_path)
128
+ else:
129
+ waveform, sample_rate = torchaudio.load(audio_path)
130
+
131
+ if waveform.shape[0] > 1:
132
+ waveform = waveform.mean(dim=0, keepdim=True)
133
+
134
+ if sample_rate != 16000:
135
+ resampler = torchaudio.transforms.Resample(sample_rate, 16000)
136
+ waveform = resampler(waveform)
137
+ sample_rate = 16000
138
+
139
+ return waveform, sample_rate
140
+
141
+ def get_embedding_ecapa(self, waveform: torch.Tensor) -> np.ndarray:
142
+ """ECAPA-TDNN으둜 μž„λ² λ”© μΆ”μΆœ"""
143
+ if waveform.dim() == 2:
144
+ waveform = waveform.squeeze(0)
145
+
146
+ waveform = waveform.to(self.device)
147
+
148
+ with torch.no_grad():
149
+ embedding = self.ecapa_model.encode_batch(waveform.unsqueeze(0))
150
+
151
+ return embedding.squeeze().cpu().numpy()
152
+
153
+ def register_speaker(self, name: str, audio_paths: List[str]) -> None:
154
+ """ν™”μž 등둝"""
155
+ print(f"\n[ν™”μž 등둝] {name} ({len(audio_paths)}개 μƒ˜ν”Œ)")
156
+ embeddings = []
157
+
158
+ for path in audio_paths:
159
+ if not os.path.exists(path):
160
+ continue
161
+
162
+ try:
163
+ waveform, sr = self._load_audio(path)
164
+ emb = self.get_embedding_ecapa(waveform)
165
+ emb = emb / np.linalg.norm(emb)
166
+ embeddings.append(emb)
167
+ print(f" βœ“ {os.path.basename(path)}")
168
+ except Exception as e:
169
+ print(f" βœ— μ—λŸ¬({os.path.basename(path)}): {e}")
170
+
171
+ if not embeddings:
172
+ print(f" [κ²½κ³ ] μœ νš¨ν•œ μƒ˜ν”Œμ΄ μ—†μŠ΅λ‹ˆλ‹€!")
173
+ return
174
+
175
+ avg_embedding = np.mean(embeddings, axis=0)
176
+ avg_embedding = avg_embedding / np.linalg.norm(avg_embedding)
177
+ self.registered_speakers[name] = avg_embedding
178
+ print(f"[ν™”μž 등둝] {name} μ™„λ£Œ!")
179
+
180
+ def process(self, audio_path: str, language: str = "ko") -> List[Dict]:
181
+ """메인 처리 ν•¨μˆ˜"""
182
+ print(f"\n[처리 μ‹œμž‘] {os.path.basename(audio_path)}")
183
+
184
+ waveform, sample_rate = self._load_audio(audio_path)
185
+ audio_dict = {"waveform": waveform, "sample_rate": sample_rate}
186
+
187
+ # 1. ν™”μž 뢄리
188
+ print("[1/3] ν™”μž 뢄리 쀑...")
189
+ raw_diarization = self.diarization_pipeline(audio_dict)
190
+
191
+ diarization = None
192
+ if hasattr(raw_diarization, "itertracks"):
193
+ diarization = raw_diarization
194
+ else:
195
+ for attr in dir(raw_diarization):
196
+ if attr.startswith("_"): continue
197
+ try:
198
+ val = getattr(raw_diarization, attr)
199
+ if hasattr(val, "itertracks"):
200
+ diarization = val
201
+ break
202
+ except: pass
203
+
204
+ if diarization is None:
205
+ raise RuntimeError("ν™”μž 뢄리 κ²°κ³Όλ₯Ό νŒŒμ‹±ν•  수 μ—†μŠ΅λ‹ˆλ‹€.")
206
+
207
+ segments = []
208
+ for turn, _, speaker in diarization.itertracks(yield_label=True):
209
+ segments.append({
210
+ "start": turn.start,
211
+ "end": turn.end,
212
+ "diarization_speaker": speaker
213
+ })
214
+ print(f" β†’ {len(segments)}개 μ„Έκ·Έλ¨ΌνŠΈ 감지")
215
+
216
+ # 2. ν™”μž 식별 (ECAPA-TDNN)
217
+ if self.registered_speakers:
218
+ print("[2/3] ν™”μž 식별 쀑 (ECAPA-TDNN)...")
219
+
220
+ speaker_embeddings = {}
221
+ speakers_found = set(seg["diarization_speaker"] for seg in segments)
222
+
223
+ for spk in speakers_found:
224
+ spk_embs = []
225
+ for seg in segments:
226
+ if seg["diarization_speaker"] != spk:
227
+ continue
228
+
229
+ duration = seg["end"] - seg["start"]
230
+ if duration < 0.5:
231
+ continue
232
+
233
+ try:
234
+ start_sample = int(seg["start"] * sample_rate)
235
+ end_sample = int(seg["end"] * sample_rate)
236
+ end_sample = min(end_sample, waveform.shape[1])
237
+ seg_waveform = waveform[:, start_sample:end_sample]
238
+
239
+ if seg_waveform.shape[1] < sample_rate * 0.3:
240
+ continue
241
+
242
+ emb = self.get_embedding_ecapa(seg_waveform)
243
+ emb = emb / np.linalg.norm(emb)
244
+ spk_embs.append(emb)
245
+ except:
246
+ pass
247
+
248
+ if spk_embs:
249
+ speaker_embeddings[spk] = spk_embs
250
+
251
+ # ν™”μž λ§€ν•‘
252
+ speaker_mapping = {}
253
+ speaker_scores = {}
254
+
255
+ for spk, embs in speaker_embeddings.items():
256
+ avg_emb = np.mean(embs, axis=0)
257
+ avg_emb = avg_emb / np.linalg.norm(avg_emb)
258
+
259
+ speaker_scores[spk] = {}
260
+ for name, ref_emb in self.registered_speakers.items():
261
+ sim = np.dot(avg_emb, ref_emb)
262
+ speaker_scores[spk][name] = sim
263
+
264
+ # 경쟁 맀칭
265
+ for reg_name in self.registered_speakers.keys():
266
+ best_spk = None
267
+ best_sim = -1
268
+
269
+ for spk in speaker_scores.keys():
270
+ if spk in [m[0] for m in speaker_mapping.values() if m[0] != spk]:
271
+ continue
272
+
273
+ sim = speaker_scores[spk].get(reg_name, -1)
274
+ if sim > best_sim:
275
+ best_sim = sim
276
+ best_spk = spk
277
+
278
+ if best_spk and best_sim >= self.similarity_threshold:
279
+ speaker_mapping[best_spk] = (reg_name, best_sim)
280
+
281
+ for spk in speaker_scores.keys():
282
+ if spk not in speaker_mapping:
283
+ speaker_mapping[spk] = (spk, 0.0)
284
+
285
+ for seg in segments:
286
+ d_spk = seg["diarization_speaker"]
287
+ if d_spk in speaker_mapping:
288
+ seg["speaker"], seg["similarity"] = speaker_mapping[d_spk]
289
+ else:
290
+ seg["speaker"] = d_spk
291
+ seg["similarity"] = 0.0
292
+ else:
293
+ for seg in segments:
294
+ seg["speaker"] = seg["diarization_speaker"]
295
+ seg["similarity"] = 0.0
296
+
297
+ # 3. STT
298
+ print("[3/3] μŒμ„± 인식(STT) 쀑...")
299
+ whisper_segs, _ = self.whisper_model.transcribe(
300
+ audio_path, language=language, beam_size=5, vad_filter=True
301
+ )
302
+ whisper_results = [{"start": s.start, "end": s.end, "text": s.text.strip()} for s in whisper_segs]
303
+
304
+ # 4. 병합
305
+ final_results = []
306
+ for w_seg in whisper_results:
307
+ best_speaker = "Unknown"
308
+ best_overlap = 0
309
+ best_sim = 0.0
310
+
311
+ for d_seg in segments:
312
+ overlap = max(0, min(w_seg["end"], d_seg["end"]) - max(w_seg["start"], d_seg["start"]))
313
+ if overlap > best_overlap:
314
+ best_overlap = overlap
315
+ best_speaker = d_seg["speaker"]
316
+ best_sim = d_seg.get("similarity", 0.0)
317
+
318
+ final_results.append({
319
+ "start": w_seg["start"],
320
+ "end": w_seg["end"],
321
+ "text": w_seg["text"],
322
+ "speaker": best_speaker,
323
+ "similarity": round(best_sim * 100, 1)
324
+ })
325
+
326
+ return final_results
327
+
328
+
329
+ # μ „μ—­ νŒŒμ΄ν”„λΌμΈ μΈμŠ€ν„΄μŠ€
330
+ _pipeline: Optional[SpeakerPipelineECAPA] = None
331
+
332
+
333
+ def get_pipeline(hf_token: str) -> SpeakerPipelineECAPA:
334
+ """νŒŒμ΄ν”„λΌμΈ 싱글톀 μΈμŠ€ν„΄μŠ€ λ°˜ν™˜"""
335
+ global _pipeline
336
+ if _pipeline is None:
337
+ _pipeline = SpeakerPipelineECAPA(hf_token=hf_token)
338
+ return _pipeline
339
+
340
+
341
+ # FastAPI μ•±
342
+ @asynccontextmanager
343
+ async def lifespan(app: FastAPI):
344
+ # μ‹œμž‘ μ‹œ
345
+ print("πŸš€ Speechlib API μ„œλ²„ μ‹œμž‘")
346
+ yield
347
+ # μ’…λ£Œ μ‹œ
348
+ print("πŸ‘‹ Speechlib API μ„œλ²„ μ’…λ£Œ")
349
+
350
+
351
+ app = FastAPI(
352
+ title="Speechlib API",
353
+ description="ν™”μž 뢄리 + ν™”μž 식별 + STT REST API (ECAPA-TDNN)",
354
+ version="1.0.0",
355
+ lifespan=lifespan
356
+ )
357
+
358
+
359
+ @app.get("/")
360
+ async def root():
361
+ """API μƒνƒœ 확인"""
362
+ return {
363
+ "status": "ok",
364
+ "message": "Speechlib API (ECAPA-TDNN)",
365
+ "endpoints": {
366
+ "/transcribe": "POST - λ‹¨μˆœ STT + ν™”μž 뢄리",
367
+ "/process": "POST - 전체 κΈ°λŠ₯ (ν™”μž 식별 포함)"
368
+ }
369
+ }
370
+
371
+
372
+ @app.get("/health")
373
+ async def health_check():
374
+ """ν—¬μŠ€ 체크"""
375
+ return {"status": "healthy", "device": "cuda" if torch.cuda.is_available() else "cpu"}
376
+
377
+
378
+ @app.post("/transcribe")
379
+ async def transcribe(
380
+ audio: UploadFile = File(..., description="μ˜€λ””μ˜€ 파일"),
381
+ language: str = Form(default="ko", description="μ–Έμ–΄ μ½”λ“œ (ko, en, ja, zh)"),
382
+ hf_token: str = Form(..., description="HuggingFace 토큰")
383
+ ):
384
+ """
385
+ λ‹¨μˆœ STT + ν™”μž 뢄리 (ν™”μž 식별 μ—†μŒ)
386
+ """
387
+ temp_dir = None
388
+ try:
389
+ # μž„μ‹œ 파일 μ €μž₯
390
+ temp_dir = tempfile.mkdtemp()
391
+ audio_path = os.path.join(temp_dir, audio.filename)
392
+
393
+ with open(audio_path, "wb") as f:
394
+ content = await audio.read()
395
+ f.write(content)
396
+
397
+ # νŒŒμ΄ν”„λΌμΈ μ‹€ν–‰
398
+ pipeline = get_pipeline(hf_token)
399
+ pipeline.registered_speakers.clear() # ν™”μž 식별 μ—†μŒ
400
+
401
+ results = pipeline.process(audio_path, language=language)
402
+
403
+ # κ²°κ³Ό ν¬λ§·νŒ…
404
+ segments = []
405
+ speaker_stats = {}
406
+
407
+ for r in results:
408
+ segments.append({
409
+ "start": round(r["start"], 2),
410
+ "end": round(r["end"], 2),
411
+ "text": r["text"],
412
+ "speaker": r["speaker"]
413
+ })
414
+
415
+ speaker = r["speaker"]
416
+ if speaker not in speaker_stats:
417
+ speaker_stats[speaker] = {"count": 0, "duration": 0}
418
+ speaker_stats[speaker]["count"] += 1
419
+ speaker_stats[speaker]["duration"] += r["end"] - r["start"]
420
+
421
+ for speaker in speaker_stats:
422
+ speaker_stats[speaker]["duration"] = round(speaker_stats[speaker]["duration"], 2)
423
+
424
+ return JSONResponse(content={
425
+ "success": True,
426
+ "segments": segments,
427
+ "speaker_stats": speaker_stats,
428
+ "total_segments": len(segments)
429
+ })
430
+
431
+ except Exception as e:
432
+ import traceback
433
+ return JSONResponse(
434
+ status_code=500,
435
+ content={
436
+ "success": False,
437
+ "error": str(e),
438
+ "traceback": traceback.format_exc()
439
+ }
440
+ )
441
+ finally:
442
+ if temp_dir and os.path.exists(temp_dir):
443
+ shutil.rmtree(temp_dir, ignore_errors=True)
444
+
445
+
446
+ @app.post("/process")
447
+ async def process_audio(
448
+ audio: UploadFile = File(..., description="뢄석할 μ˜€λ””μ˜€ 파일"),
449
+ voice_sample: UploadFile = File(default=None, description="ν™”μž μƒ˜ν”Œ 파일 (선택)"),
450
+ speaker_name: str = Form(default="speaker", description="식별할 ν™”μž 이름"),
451
+ language: str = Form(default="ko", description="μ–Έμ–΄ μ½”λ“œ (ko, en, ja, zh)"),
452
+ hf_token: str = Form(..., description="HuggingFace 토큰")
453
+ ):
454
+ """
455
+ 전체 κΈ°λŠ₯: ν™”μž 뢄리 + ν™”μž 식별 + STT
456
+ """
457
+ temp_dir = None
458
+ try:
459
+ # μž„μ‹œ 디렉토리 생성
460
+ temp_dir = tempfile.mkdtemp()
461
+
462
+ # 메인 μ˜€λ””μ˜€ μ €μž₯
463
+ audio_path = os.path.join(temp_dir, audio.filename)
464
+ with open(audio_path, "wb") as f:
465
+ content = await audio.read()
466
+ f.write(content)
467
+
468
+ # νŒŒμ΄ν”„λΌμΈ κ°€μ Έμ˜€κΈ°
469
+ pipeline = get_pipeline(hf_token)
470
+ pipeline.registered_speakers.clear()
471
+
472
+ # ν™”μž μƒ˜ν”Œμ΄ 있으면 등둝
473
+ if voice_sample and voice_sample.filename:
474
+ sample_path = os.path.join(temp_dir, voice_sample.filename)
475
+ with open(sample_path, "wb") as f:
476
+ sample_content = await voice_sample.read()
477
+ f.write(sample_content)
478
+ pipeline.register_speaker(speaker_name, [sample_path])
479
+
480
+ # 처리
481
+ results = pipeline.process(audio_path, language=language)
482
+
483
+ # κ²°κ³Ό ν¬λ§·νŒ…
484
+ segments = []
485
+ speaker_stats = {}
486
+
487
+ for r in results:
488
+ segments.append({
489
+ "start": round(r["start"], 2),
490
+ "end": round(r["end"], 2),
491
+ "text": r["text"],
492
+ "speaker": r["speaker"],
493
+ "similarity": r["similarity"]
494
+ })
495
+
496
+ speaker = r["speaker"]
497
+ if speaker not in speaker_stats:
498
+ speaker_stats[speaker] = {"count": 0, "duration": 0}
499
+ speaker_stats[speaker]["count"] += 1
500
+ speaker_stats[speaker]["duration"] += r["end"] - r["start"]
501
+
502
+ for speaker in speaker_stats:
503
+ speaker_stats[speaker]["duration"] = round(speaker_stats[speaker]["duration"], 2)
504
+
505
+ return JSONResponse(content={
506
+ "success": True,
507
+ "segments": segments,
508
+ "speaker_stats": speaker_stats,
509
+ "total_segments": len(segments)
510
+ })
511
+
512
+ except Exception as e:
513
+ import traceback
514
+ return JSONResponse(
515
+ status_code=500,
516
+ content={
517
+ "success": False,
518
+ "error": str(e),
519
+ "traceback": traceback.format_exc()
520
+ }
521
+ )
522
+ finally:
523
+ if temp_dir and os.path.exists(temp_dir):
524
+ shutil.rmtree(temp_dir, ignore_errors=True)
525
+
526
+
527
+ if __name__ == "__main__":
528
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi>=0.100.0
2
+ uvicorn>=0.23.0
3
+ python-multipart>=0.0.6
4
+ torch==2.4.0
5
+ torchaudio==2.4.0
6
+ pyannote.audio==3.3.2
7
+ speechbrain==1.0.0
8
+ faster-whisper>=1.0.0
9
+ pydub>=0.25.1
10
+ numpy<2.0.0
11
+ ffmpeg-python
12
+ soundfile