Spaces:

jade2zhong
/

Document-based_audio_transcriber

Sleeping

App Files Files Community

jade2zhong commited on 24 days ago

Commit

c8eadee

verified ·

1 Parent(s): 66d8af1

Upload 3 files

Browse files

Files changed (3) hide show

README.md +77 -0
app.py +518 -0
网站搭建说明.md +472 -0

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# Context-Aware Audio Correction
+This Gradio app transcribes an audio sample, retrieves relevant passages from a reference document, and asks a language model to correct likely ASR mistakes using only document-backed evidence.
+## Main Flow
+```text
+Upload document
+-> extract text
+-> split into passages
+-> upload or record audio
+-> transcribe with Whisper
+-> retrieve related document passages
+-> correct near-sound and domain-term errors
+```
+## Recognition Profiles
+The app separates English, Chinese, and automatic recognition with explicit ASR profiles:
+| Profile | Default model | Use case |
+|---|---|---|
+| English optimized | `openai/whisper-small.en` | English-only lectures and presentations |
+| Chinese | `openai/whisper-small` | Mandarin recordings |
+| Auto detect | `openai/whisper-small` | Unknown or mixed-language recordings |
+## Local Run
+```powershell
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+pip install -r requirements.txt
+$env:HF_TOKEN="your Hugging Face token"
+python app.py
+```
+Open:
+```text
+http://127.0.0.1:7860
+```
+## Hugging Face Spaces
+Upload these files to the Space root directory:
+```text
+app.py
+requirements.txt
+packages.txt
+README.md
+```
+Then add this secret in `Settings -> Variables and secrets`:
+```text
+HF_TOKEN=your Hugging Face token
+```
+## Optional Variables
+```text
+ASR_MODEL_EN=openai/whisper-small.en
+ASR_MODEL_ZH=openai/whisper-small
+ASR_MODEL_AUTO=openai/whisper-small
+EMBEDDING_MODEL=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
+LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-1M
+```
+`ASR_MODEL` is still supported as the default multilingual ASR model for Chinese and Auto profiles.
+## Notes
+- Scanned PDFs need OCR before upload.
+- Free CPU Spaces can be slow on the first run because models must be downloaded and loaded.
+- Start with short audio samples, around 20 seconds to 2 minutes.
+- The correction step is evidence-bound. It should not freely rewrite the transcript.

app.py ADDED Viewed

	@@ -0,0 +1,518 @@

+import json
+import os
+import re
+from pathlib import Path
+import gradio as gr
+import numpy as np
+import pdfplumber
+from docx import Document
+from openai import OpenAI
+from sentence_transformers import SentenceTransformer
+from transformers import pipeline
+EMBEDDING_MODEL = os.getenv(
+    "EMBEDDING_MODEL", "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
+)
+LLM_MODEL = os.getenv("LLM_MODEL", "Qwen/Qwen2.5-7B-Instruct-1M")
+HF_TOKEN = os.getenv("HF_TOKEN")
+DEFAULT_MULTILINGUAL_ASR_MODEL = os.getenv("ASR_MODEL", "openai/whisper-small")
+ASR_PROFILES = {
+    "English optimized - Whisper small.en": {
+        "model": os.getenv("ASR_MODEL_EN", "openai/whisper-small.en"),
+        "language": None,
+        "description": "Best default for English-only lectures and presentations.",
+    },
+    "Chinese - Whisper multilingual small": {
+        "model": os.getenv("ASR_MODEL_ZH", DEFAULT_MULTILINGUAL_ASR_MODEL),
+        "language": "chinese",
+        "description": "Use this for Mandarin recordings and Chinese documents.",
+    },
+    "Auto detect - Whisper multilingual small": {
+        "model": os.getenv("ASR_MODEL_AUTO", DEFAULT_MULTILINGUAL_ASR_MODEL),
+        "language": None,
+        "description": "Use this when the recording language is uncertain or mixed.",
+    },
+}
+asr_pipelines = {}
+embedding_model = None
+llm_client = None
+APP_CSS = """
+:root {
+  --brand: #0f766e;
+  --brand-strong: #115e59;
+  --ink: #111827;
+  --muted: #64748b;
+  --line: #d8ded9;
+  --paper: #ffffff;
+  --wash: #f6f7f2;
+  --accent: #c2410c;
+}
+body,
+.gradio-container {
+  background: var(--wash) !important;
+  color: var(--ink);
+  font-family: Inter, ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+}
+.main {
+  max-width: 1180px !important;
+  margin: 0 auto !important;
+}
+.app-shell {
+  padding: 28px 28px 12px;
+  border-bottom: 1px solid var(--line);
+}
+.app-kicker {
+  margin: 0 0 8px;
+  color: var(--brand-strong);
+  font-size: 12px;
+  font-weight: 700;
+  letter-spacing: 0.08em;
+  text-transform: uppercase;
+}
+.app-title {
+  margin: 0;
+  color: var(--ink);
+  font-size: 34px;
+  line-height: 1.12;
+  letter-spacing: 0;
+}
+.app-subtitle {
+  margin: 12px 0 0;
+  max-width: 780px;
+  color: var(--muted);
+  font-size: 16px;
+  line-height: 1.6;
+}
+.status-strip {
+  display: grid;
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+  gap: 10px;
+  margin-top: 20px;
+}
+.status-item {
+  background: #ffffff;
+  border: 1px solid var(--line);
+  border-radius: 8px;
+  padding: 12px 14px;
+}
+.status-label {
+  color: var(--muted);
+  font-size: 12px;
+  margin-bottom: 4px;
+}
+.status-value {
+  color: var(--ink);
+  font-weight: 700;
+  font-size: 14px;
+}
+.gradio-container .block {
+  border-radius: 8px !important;
+}
+.gradio-container button.primary {
+  background: var(--brand) !important;
+  border-color: var(--brand) !important;
+}
+.gradio-container button.primary:hover {
+  background: var(--brand-strong) !important;
+  border-color: var(--brand-strong) !important;
+}
+textarea,
+input,
+.wrap {
+  border-radius: 8px !important;
+}
+.output-panel textarea {
+  font-size: 14px !important;
+  line-height: 1.55 !important;
+}
+.correction-notes,
+.evidence-panel {
+  background: var(--paper);
+}
+@media (max-width: 760px) {
+  .app-shell {
+    padding: 22px 18px 8px;
+  }
+  .app-title {
+    font-size: 28px;
+  }
+  .status-strip {
+    grid-template-columns: 1fr;
+  }
+}
+"""
+def get_asr_pipeline(model_id: str):
+    if model_id not in asr_pipelines:
+        asr_pipelines[model_id] = pipeline(
+            "automatic-speech-recognition",
+            model=model_id,
+            device=-1,
+        )
+    return asr_pipelines[model_id]
+def get_embedding_model():
+    global embedding_model
+    if embedding_model is None:
+        embedding_model = SentenceTransformer(EMBEDDING_MODEL)
+    return embedding_model
+def get_llm_client():
+    global llm_client
+    if not HF_TOKEN:
+        return None
+    if llm_client is None:
+        llm_client = OpenAI(
+            base_url="https://router.huggingface.co/v1",
+            api_key=HF_TOKEN,
+        )
+    return llm_client
+def read_text_file(path: Path) -> str:
+    for encoding in ("utf-8", "gb18030"):
+        try:
+            return path.read_text(encoding=encoding)
+        except UnicodeDecodeError:
+            continue
+    return path.read_text(errors="ignore")
+def extract_document_text(file_path: str) -> str:
+    path = Path(file_path)
+    suffix = path.suffix.lower()
+    if suffix == ".txt":
+        text = read_text_file(path)
+    elif suffix == ".pdf":
+        pages = []
+        with pdfplumber.open(path) as pdf:
+            for page in pdf.pages:
+                pages.append(page.extract_text() or "")
+        text = "\n".join(pages)
+    elif suffix == ".docx":
+        doc = Document(path)
+        text = "\n".join(p.text for p in doc.paragraphs)
+    else:
+        raise ValueError("Only PDF, DOCX, and TXT documents are supported.")
+    text = re.sub(r"[ \t]+", " ", text)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    return text.strip()
+def split_into_chunks(text: str, max_chars: int = 700, overlap: int = 90) -> list[str]:
+    paragraphs = re.split(r"\n\s*\n+", text)
+    pieces = []
+    for paragraph in paragraphs:
+        paragraph = paragraph.strip()
+        if not paragraph:
+            continue
+        pieces.extend(re.split(r"(?<=[.!?;:])\s+", paragraph))
+    pieces = [p.strip() for p in pieces if p and p.strip()]
+    chunks = []
+    current = ""
+    for piece in pieces:
+        if len(piece) > max_chars:
+            if current:
+                chunks.append(current)
+                current = ""
+            step = max_chars - overlap
+            for start in range(0, len(piece), step):
+                chunks.append(piece[start : start + max_chars])
+            continue
+        candidate = piece if not current else f"{current}\n{piece}"
+        if len(candidate) <= max_chars:
+            current = candidate
+        else:
+            chunks.append(current)
+            current = piece
+    if current:
+        chunks.append(current)
+    return [chunk for chunk in chunks if len(chunk) >= 20]
+def resolve_asr_profile(profile_name: str) -> dict:
+    return ASR_PROFILES.get(profile_name, next(iter(ASR_PROFILES.values())))
+def transcribe_audio(audio_path: str, profile_name: str) -> str:
+    profile = resolve_asr_profile(profile_name)
+    generate_kwargs = {"task": "transcribe"}
+    if profile["language"]:
+        generate_kwargs["language"] = profile["language"]
+    result = get_asr_pipeline(profile["model"])(audio_path, generate_kwargs=generate_kwargs)
+    if isinstance(result, dict):
+        return str(result.get("text", "")).strip()
+    return str(result).strip()
+def retrieve_contexts(raw_transcript: str, chunks: list[str], top_k: int):
+    model = get_embedding_model()
+    doc_vectors = model.encode(chunks, normalize_embeddings=True)
+    query_vector = model.encode([raw_transcript], normalize_embeddings=True)[0]
+    scores = np.matmul(doc_vectors, query_vector)
+    top_indices = np.argsort(scores)[::-1][:top_k]
+    return [(int(i), float(scores[i]), chunks[int(i)]) for i in top_indices]
+def build_correction_prompt(raw_transcript: str, contexts) -> list[dict]:
+    context_text = "\n\n".join(
+        f"[Document passage {idx + 1} | similarity {score:.3f}]\n{text}"
+        for idx, score, text in contexts
+    )
+    system_prompt = (
+        "You are a strict ASR correction assistant. Correct the transcript only when the "
+        "provided document context gives clear evidence. Focus on homophones, near-sound "
+        "mistakes, technical terms, names, acronyms, chapter titles, and domain-specific "
+        "phrases. Preserve the original sentence structure as much as possible. Do not "
+        "summarize, rewrite freely, or add information that was not spoken."
+    )
+    user_prompt = f"""
+Correct the ASR transcript using the document passages below.
+Rules:
+1. Treat the raw transcript as the primary text.
+2. Make only evidence-backed corrections.
+3. Prefer keeping the original word when the document context is not strong enough.
+4. Output JSON only. Do not output Markdown.
+JSON schema:
+{{
+  "corrected_text": "the complete corrected transcript",
+  "changes": [
+    {{
+      "original": "incorrect word or phrase",
+      "corrected": "corrected word or phrase",
+      "reason": "why the document supports this correction"
+    }}
+  ]
+}}
+Document passages:
+{context_text}
+Raw ASR transcript:
+{raw_transcript}
+""".strip()
+    return [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+def parse_json_response(text: str):
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        match = re.search(r"\{.*\}", text, flags=re.S)
+        if match:
+            return json.loads(match.group(0))
+    raise ValueError("The language model did not return valid JSON.")
+def correct_with_llm(raw_transcript: str, contexts):
+    client = get_llm_client()
+    if client is None:
+        return {
+            "corrected_text": raw_transcript,
+            "changes": [
+                {
+                    "original": "LLM correction skipped",
+                    "corrected": "LLM correction skipped",
+                    "reason": "HF_TOKEN is not set. Add HF_TOKEN locally or in Hugging Face Spaces secrets.",
+                }
+            ],
+        }
+    completion = client.chat.completions.create(
+        model=LLM_MODEL,
+        messages=build_correction_prompt(raw_transcript, contexts),
+        temperature=0.1,
+        max_tokens=1200,
+    )
+    content = completion.choices[0].message.content
+    return parse_json_response(content)
+def format_contexts(contexts) -> str:
+    blocks = []
+    for rank, (idx, score, text) in enumerate(contexts, start=1):
+        blocks.append(f"### Passage {rank}\nSimilarity: `{score:.3f}`\n\n{text}")
+    return "\n\n---\n\n".join(blocks)
+def format_changes(changes) -> str:
+    if not changes:
+        return "No document-backed correction was needed."
+    lines = []
+    for item in changes:
+        original = item.get("original", "")
+        corrected = item.get("corrected", "")
+        reason = item.get("reason", "")
+        lines.append(f"- `{original}` -> `{corrected}`: {reason}")
+    return "\n".join(lines)
+def run_app(document_file, audio_file, profile_name, top_k):
+    if document_file is None:
+        raise gr.Error("Upload a PDF, DOCX, or TXT reference document first.")
+    if audio_file is None:
+        raise gr.Error("Upload or record an audio sample first.")
+    document_text = extract_document_text(document_file)
+    if not document_text:
+        raise gr.Error("No text was extracted from the document. Scanned PDFs need OCR first.")
+    chunks = split_into_chunks(document_text)
+    if not chunks:
+        raise gr.Error("The document is too short to build context.")
+    raw_transcript = transcribe_audio(audio_file, profile_name)
+    if not raw_transcript:
+        raise gr.Error("No speech text was recognized from the audio.")
+    contexts = retrieve_contexts(raw_transcript, chunks, int(top_k))
+    correction = correct_with_llm(raw_transcript, contexts)
+    corrected_text = correction.get("corrected_text", raw_transcript)
+    changes = correction.get("changes", [])
+    return (
+        raw_transcript,
+        corrected_text,
+        format_changes(changes),
+        format_contexts(contexts),
+    )
+theme = gr.themes.Soft(
+    primary_hue="teal",
+    secondary_hue="orange",
+    neutral_hue="zinc",
+    radius_size="sm",
+)
+with gr.Blocks(
+    title="Context-Aware Audio Correction",
+    theme=theme,
+    css=APP_CSS,
+) as demo:
+    gr.HTML(
+        """
+        <section class="app-shell">
+          <p class="app-kicker">Hugging Face ASR + document retrieval</p>
+          <h1 class="app-title">Context-Aware Audio Correction</h1>
+          <p class="app-subtitle">
+            Upload a reference document and an audio clip. The app transcribes speech,
+            retrieves matching document passages, and corrects likely ASR mistakes using
+            only document-backed evidence.
+          </p>
+          <div class="status-strip">
+            <div class="status-item">
+              <div class="status-label">ASR profiles</div>
+              <div class="status-value">English / Chinese / Auto</div>
+            </div>
+            <div class="status-item">
+              <div class="status-label">Context engine</div>
+              <div class="status-value">Sentence embeddings</div>
+            </div>
+            <div class="status-item">
+              <div class="status-label">Correction policy</div>
+              <div class="status-value">Evidence-bound</div>
+            </div>
+          </div>
+        </section>
+        """
+    )
+    with gr.Row():
+        with gr.Column(scale=1, min_width=320):
+            document_input = gr.File(
+                label="Reference document",
+                file_types=[".pdf", ".docx", ".txt"],
+                type="filepath",
+            )
+            audio_input = gr.Audio(
+                label="Audio sample",
+                sources=["upload", "microphone"],
+                type="filepath",
+            )
+        with gr.Column(scale=1, min_width=320):
+            profile_input = gr.Radio(
+                label="Recognition profile",
+                choices=list(ASR_PROFILES.keys()),
+                value="English optimized - Whisper small.en",
+                info=(
+                    "English uses an English-only Whisper model. Chinese and Auto use "
+                    "the multilingual Whisper model."
+                ),
+            )
+            top_k_input = gr.Slider(
+                label="Document passages to retrieve",
+                minimum=1,
+                maximum=8,
+                value=4,
+                step=1,
+            )
+            submit_button = gr.Button("Transcribe and correct", variant="primary")
+    with gr.Row(elem_classes=["output-panel"]):
+        raw_output = gr.Textbox(label="Raw Whisper transcript", lines=9)
+        corrected_output = gr.Textbox(label="Context-corrected transcript", lines=9)
+    changes_output = gr.Markdown(
+        label="Correction notes",
+        elem_classes=["correction-notes"],
+    )
+    contexts_output = gr.Markdown(
+        label="Document evidence",
+        elem_classes=["evidence-panel"],
+    )
+    submit_button.click(
+        fn=run_app,
+        inputs=[document_input, audio_input, profile_input, top_k_input],
+        outputs=[raw_output, corrected_output, changes_output, contexts_output],
+    )
+if __name__ == "__main__":
+    demo.launch(share=True)

网站搭建说明.md ADDED Viewed

	@@ -0,0 +1,472 @@

+# 基于 Hugging Face 的文档感知音频识别纠错网站搭建说明
+## 1. 项目目标
+本项目要实现一个网页应用：用户上传一份参考文档和一段音频后，系统先把音频识别成文字，再根据参考文档内容纠正识别结果中的错误。
+普通语音识别系统经常会把专业词、缩写、人名、课程术语识别成发音相近但意思错误的内容。本项目的核心思路是：不只依赖语音模型本身，而是额外引入文档上下文，让系统知道这段录音可能在讲什么。
+示例：
+```text
+原始识别结果：
+This lecture explains back propagation and banishing gradients.
+文档中出现：
+backpropagation, vanishing gradients
+纠错后：
+This lecture explains backpropagation and vanishing gradients.
+```
+## 2. 系统整体思路
+系统分为四个核心模块：
+```text
+文档上传
+  -> 文档文字提取
+  -> 文档切片
+  -> 文档语义向量化
+音频上传
+  -> Whisper 语音识别
+  -> 得到原始转写文本
+语义检索
+  -> 用原始转写文本检索最相关的文档片段
+大模型纠错
+  -> 把原始转写和相关文档片段交给大模型
+  -> 要求大模型只根据文档证据纠正近音词和专业词
+```
+最终网页输出四部分内容：
+```text
+1. Raw Whisper transcript
+2. Context-corrected transcript
+3. Correction notes
+4. Document evidence
+```
+## 3. 使用的技术
+本项目主要使用以下技术：
+```text
+Python
+Gradio
+Hugging Face Spaces
+Hugging Face Transformers
+Whisper ASR model
+SentenceTransformer embedding model
+Hugging Face Router / Inference Provider
+Qwen instruction model
+```
+各部分作用如下：
+| 技术 | 作用 |
+|---|---|
+| Gradio | 快速搭建网页界面 |
+| Hugging Face Spaces | 部署网页应用 |
+| Transformers pipeline | 调用 Whisper 做语音识别 |
+| Whisper | 把音频转成文字 |
+| SentenceTransformer | 把文档片段和识别文本转换成向量 |
+| NumPy | 计算文本向量相似度 |
+| pdfplumber | 提取 PDF 文字 |
+| python-docx | 提取 Word 文档文字 |
+| Hugging Face Router | 调用在线大模型做纠错 |
+## 4. 项目文件结构
+项目根目录需要包含这些文件：
+```text
+app.py
+requirements.txt
+packages.txt
+README.md
+网站搭建说明.md
+```
+其中：
+| 文件 | 作用 |
+|---|---|
+| app.py | 网站主程序 |
+| requirements.txt | Python 依赖列表 |
+| packages.txt | 系统依赖，例如 ffmpeg |
+| README.md | 项目简要说明 |
+| 网站搭建说明.md | 当前这份搭建和操作文档 |
+上传到 Hugging Face Spaces 时，`app.py`、`requirements.txt`、`packages.txt` 必须放在 Space 根目录，不能放在子文件夹里。
+## 5. 本地运行步骤
+### 5.1 进入项目目录
+在 PowerShell 中执行：
+```powershell
+cd "C:\Users\29697\Documents\Codex\2026-05-14\huggingface"
+```
+### 5.2 创建虚拟环境
+```powershell
+python -m venv .venv
+```
+### 5.3 安装依赖
+如果可以激活虚拟环境，执行：
+```powershell
+.\.venv\Scripts\Activate.ps1
+pip install -r requirements.txt
+```
+如果激活时报执行策略错误，直接用虚拟环境里的 Python 安装：
+```powershell
+.\.venv\Scripts\python.exe -m pip install -r requirements.txt
+```
+### 5.4 设置 Hugging Face Token
+到 Hugging Face 账号中创建 Access Token：
+```text
+https://huggingface.co/settings/tokens
+```
+然后在 PowerShell 中设置环境变量：
+```powershell
+$env:HF_TOKEN="hf_xxxxxxxxxxxxxxxxx"
+```
+注意：`HF_TOKEN` 不要写进代码，也不要发给别人。
+### 5.5 启动网站
+如果虚拟环境已激活：
+```powershell
+python app.py
+```
+如果没有激活虚拟环境：
+```powershell
+.\.venv\Scripts\python.exe app.py
+```
+终端出现下面内容说明启动成功：
+```text
+Running on local URL: http://127.0.0.1:7860
+```
+浏览器打开：
+```text
+http://127.0.0.1:7860
+```
+注意：PowerShell 停住不动是正常现象，因为它正在运行网站服务。如果要关闭网站，在 PowerShell 中按 `Ctrl + C`。
+## 6. 网站使用步骤
+打开网页后，按以下步骤操作：
+1. 在 `Reference document` 上传参考文档。
+2. 文档支持 `PDF`、`DOCX`、`TXT`。
+3. 在 `Audio sample` 上传音频，或用麦克风录音。
+4. 在 `Recognition profile` 选择识别配置。
+5. 英文录音选择 `English optimized - Whisper small.en`。
+6. 中文录音选择 `Chinese - Whisper multilingual small`。
+7. 不确定语言选择 `Auto detect - Whisper multilingual small`。
+8. 点击 `Transcribe and correct`。
+9. 查看原始识别结果、纠错结果、修改说明和文档依据。
+建议第一次测试时使用较短材料：
+```text
+英文文档：100 到 300 词
+英文录音：20 到 60 秒
+录音环境：安静、单人讲话
+```
+## 7. 英文测试样例
+可以新建一个 `test.txt`，内容如下：
+```text
+This lecture explains backpropagation, vanishing gradients, convolutional neural networks, and attention mechanisms.
+Backpropagation is a core algorithm for training neural networks.
+Vanishing gradients can make deep neural networks difficult to train.
+Attention mechanisms are widely used in natural language processing and speech recognition.
+```
+录音时可以读：
+```text
+This lecture explains backpropagation and vanishing gradients. It also introduces convolutional neural networks and attention mechanisms.
+```
+如果 Whisper 把某些专业词识别错，系统会尝试根据文档内容纠正。
+## 8. 部署到 Hugging Face Spaces
+### 8.1 创建 Space
+1. 登录 Hugging Face。
+2. 点击 `New Space`。
+3. Space SDK 选择 `Gradio`。
+4. Visibility 可以选择 `Public`。
+5. 创建 Space。
+### 8.2 上传文件
+进入 Space 的 `Files` 页面，上传：
+```text
+app.py
+requirements.txt
+packages.txt
+README.md
+```
+上传后点击：
+```text
+Commit changes to main
+```
+### 8.3 设置 Secret
+进入：
+```text
+Settings -> Variables and secrets
+```
+添加 Secret：
+```text
+Name: HF_TOKEN
+Value: hf_xxxxxxxxxxxxxxxxx
+```
+注意变量名必须是 `HF_TOKEN`，大小写要完全一致。
+### 8.4 等待构建
+回到 Space 页面查看状态：
+```text
+Building      正在安装依赖和启动应用
+Running       应用运行成功
+Build error   依赖安装失败
+Runtime error 程序启动失败
+```
+如果出现错误，打开 `Logs`，查看最后几十行报错。
+### 8.5 分享网站
+如果 Space 是 Public，别人可以通过下面形式的链接访问：
+```text
+https://huggingface.co/spaces/用户名/Space名
+```
+或：
+```text
+https://用户名-Space名.hf.space
+```
+不要把本地地址发给别人：
+```text
+http://127.0.0.1:7860
+```
+这个地址只能在自己的电脑上打开。
+## 9. 基础版 Hugging Face 的限制
+免费 Hugging Face Spaces 通常适合作业展示和轻量 Demo，但不适合大规模高并发使用。常见限制包括：
+```text
+1. 第一次启动较慢。
+2. 免费 Space 可能会休眠。
+3. CPU 推理速度有限。
+4. 大模型加载和音频识别可能需要等待。
+5. 长音频处理较慢。
+```
+因此建议演示时：
+```text
+使用 20 秒到 2 分钟的短音频
+使用 TXT 或文字版 PDF
+避免扫描版 PDF
+提前打开 Space，避免现场冷启动
+准备本地运行截图或演示视频作为备用
+```
+## 10. 模型配置建议
+当前页面已经把英文、中文、自动识别拆成了三个识别配置：
+```text
+English optimized - Whisper small.en: openai/whisper-small.en
+Chinese - Whisper multilingual small: openai/whisper-small
+Auto detect - Whisper multilingual small: openai/whisper-small
+```
+如果主要识别英文，可以在 Hugging Face Space 的 Variables 中添加：
+```text
+ASR_MODEL_EN=openai/whisper-small.en
+```
+如果要替换中文或自动识别使用的模型，可以添加：
+```text
+ASR_MODEL_ZH=openai/whisper-small
+ASR_MODEL_AUTO=openai/whisper-small
+```
+旧的 `ASR_MODEL` 变量仍然可用，会作为中文和自动识别配置的默认多语言模型。
+如果想要更高准确率，可以尝试：
+```text
+openai/whisper-medium
+openai/whisper-large-v3
+```
+但这些模型在免费 CPU Space 上会更慢，甚至可能影响体验。
+## 11. 常见问题
+### 11.1 找不到 requirements.txt
+报错：
+```text
+Could not open requirements file: requirements.txt
+```
+原因：当前目录没有 `requirements.txt`，或者文件没有上传到 Space 根目录。
+解决：
+```text
+确认 app.py 和 requirements.txt 在同一个目录。
+确认 requirements.txt 没有被命名成 requirements.txt.txt。
+```
+### 11.2 PowerShell 卡住不动
+这是正常现象。`python app.py` 启动的是网页服务，PowerShell 会一直运行。
+关闭服务：
+```text
+Ctrl + C
+```
+### 11.3 页面能打开，但点击按钮很慢
+原因：
+```text
+第一次运行需要加载 Whisper 模型、embedding 模型和大模型接口。
+免费 CPU Space 推理速度有限。
+```
+解决：
+```text
+先用短音频测试。
+提前打开网页预热。
+减少文档长度。
+```
+### 11.4 大模型没有纠错
+可能原因：
+```text
+HF_TOKEN 没有设置。
+原始识别已经足够正确。
+文档里没有相关词。
+检索到的文档片段不相关。
+```
+解决：
+```text
+检查 HF_TOKEN。
+使用和录音内容更相关的文档。
+把 Document passages to retrieve 调高到 5 或 6。
+```
+### 11.5 中国大陆同学打不开
+Hugging Face 在中国大陆网络下访问可能不稳定。解决方式：
+```text
+准备演示视频。
+准备本地运行截图。
+让同学先测试链接。
+必要时迁移到国内平台或云服务器。
+```
+## 12. 项目展示说明
+展示时可以这样介绍：
+```text
+本项目不是简单调用语音识别模型，而是在普通 ASR 之后加入文档上下文检索和大模型纠错。
+系统先用 Whisper 得到原始转写，再用语义向量检索与转写内容最相关的文档片段，
+最后让大模型只根据这些文档证据纠正专业词、近音词和专有名词错误。
+```
+可以强调的创新点：
+```text
+1. 引入参考文档作为领域上下文。
+2. 针对专业词和近音词错误进行纠正。
+3. 输出修改依据，增强结果可信度。
+4. 网页化部署，用户无需本地安装模型即可使用。
+```
+## 13. 后续可优化方向
+如果需要继续完善，可以考虑：
+```text
+1. 增加高亮功能，标出被修改的词。
+2. 增加音频分段，支持更长录音。
+3. 增加 OCR，支持扫描版 PDF。
+4. 增加用户自定义术语表。
+5. 使用更强的 embedding 模型提高文档检索准确率。
+6. 使用 GPU Space 提高推理速度。
+7. 增加导出功能，把纠错结果导出为 TXT 或 DOCX。
+```
+## 14. 参考文档
+- Hugging Face Gradio Spaces: https://huggingface.co/docs/hub/spaces-sdks-gradio
+- Hugging Face Spaces Dependencies: https://huggingface.co/docs/hub/en/spaces-dependencies
+- Hugging Face Space Secrets: https://huggingface.co/docs/huggingface_hub/v0.28.0/en/guides/manage-spaces
+- Hugging Face Transformers Pipeline: https://huggingface.co/docs/transformers/v4.40.0/pipeline_tutorial
+- Hugging Face Whisper Documentation: https://huggingface.co/docs/transformers/model_doc/whisper