Spaces:

ZTXRiley
/

ASR_AGENT_

Sleeping

App Files Files Community

unknown commited on Mar 17

Commit

675f962

2 Parent(s): fa16b79 255bb80

Update ASR LLM Agent

Browse files

Files changed (6) hide show

README.md +7 -9
data/manifest.jsonl +0 -2
data/manifest_hf.jsonl +0 -50
pipeline/run_all.py +15 -5
scripts/run_hf_job.py +16 -74
ui/app.py +45 -19

README.md CHANGED Viewed

@@ -11,22 +11,21 @@ pinned: false
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 ## ASR LLM Agent Upgrade
-This version adds an LLM-based diagnosis layer on top of alignment/event statistics:
-- `analysis/llm_analyzer.py`: sends representative ASR error cases + aggregate stats to an LLM
 - `pipeline/run_analysis.py`: optionally runs LLM diagnosis when `OPENAI_API_KEY` is set
-- `scripts/run_diagnostic.py`: regenerate `llm_diagnosis.json` and `diagnostic_report.md`
-- `report.md`: now includes LLM semantic findings and priority actions
 ### What the LLM adds
 Compared with rule-only classification, the LLM layer can:
 - separate surface-form differences from true semantic distortions
-- identify meaning-preserving paraphrases vs business-critical errors
 - infer likely causes from representative cases
 - propose prioritized, actionable improvement suggestions
@@ -34,7 +33,7 @@ Compared with rule-only classification, the LLM layer can:
 ```bash
 export OPENAI_API_KEY=your_key
-python pipeline/run_all.py   --manifest data/manifest.jsonl   --model_name openai/whisper-small   --llm_model gpt-4.1-mini
 ```
 Or rerun diagnosis only for an existing run:
@@ -44,10 +43,9 @@ export OPENAI_API_KEY=your_key
 python scripts/run_diagnostic.py --run_id <run_id> --model gpt-4.1-mini
 ```
 ## Qwen3-ASR
-This project now supports `Qwen/Qwen3-ASR-0.6B` and `Qwen/Qwen3-ASR-1.7B` via the `qwen-asr` package.
 Install the runtime dependency:

 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 ## ASR LLM Agent Upgrade
+This version adds an LLM-based diagnosis layer on top of alignment and event statistics:
+- `analysis/llm_analyzer.py`: sends representative ASR error cases and aggregate stats to an LLM
 - `pipeline/run_analysis.py`: optionally runs LLM diagnosis when `OPENAI_API_KEY` is set
+- `scripts/run_diagnostic.py`: regenerates `llm_diagnosis.json` and `diagnostic_report.md`
+- `report.md`: includes LLM semantic findings and priority actions
 ### What the LLM adds
 Compared with rule-only classification, the LLM layer can:
 - separate surface-form differences from true semantic distortions
+- identify meaning-preserving paraphrases versus business-critical errors
 - infer likely causes from representative cases
 - propose prioritized, actionable improvement suggestions
 ```bash
 export OPENAI_API_KEY=your_key
+python pipeline/run_all.py --manifest data/manifest.jsonl --model_name openai/whisper-small --llm_model gpt-4.1-mini
 ```
 Or rerun diagnosis only for an existing run:
 python scripts/run_diagnostic.py --run_id <run_id> --model gpt-4.1-mini
 ```
 ## Qwen3-ASR
+This project supports `Qwen/Qwen3-ASR-0.6B` and `Qwen/Qwen3-ASR-1.7B` via the `qwen-asr` package.
 Install the runtime dependency:

data/manifest.jsonl DELETED Viewed

	@@ -1,2 +0,0 @@
1	- {"utt_id":"u001","audio_uri":"data/audio/u001.wav","ref_text":"今天天气很好","meta":{"device":"mobile","domain":"daily","speaker":"spk1"}}
2	- {"utt_id":"u002","audio_uri":"data/audio/u002.wav","ref_text":"我们下午三点开会","meta":{"device":"farfield","domain":"meeting","speaker":"spk2"}}

data/manifest_hf.jsonl DELETED Viewed

@@ -1,50 +0,0 @@
-{"utt_id": "fsicoli_common_voice_22_0_00000", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00000.wav", "ref_text": "巴顿是位于美国加利福尼亚州阿马多尔县的一个非建制地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00001", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00001.wav", "ref_text": "恩骑尉，是中国清朝时的爵名。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00002", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00002.wav", "ref_text": "仙台盐釜港是位于日本宫城县、内的海港，由宫城县政府负责港务营运。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00003", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00003.wav", "ref_text": "利马的阳台是西班牙殖民时期和共和国时期建造的文化遗产。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00004", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00004.wav", "ref_text": "成山，字屏临，号进斋，满洲正蓝旗人。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00005", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00005.wav", "ref_text": "嘉靖十一年任福建龙溪县知县。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00006", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00006.wav", "ref_text": "科莫巴比是位于美国亚利桑那州皮马县的一个人口普查指定地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00007", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00007.wav", "ref_text": "历史上明永乐皇帝、清干隆皇帝等曾经多次到访，并留下牌匾和诗句。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00008", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00008.wav", "ref_text": "小花仙动画角色列表记录了所有在中国大陆动画《小花仙》系列中出场角色的详细介绍。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00009", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00009.wav", "ref_text": "妳不要再去那里了", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00010", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00010.wav", "ref_text": "银座松竹广场是位于日本东京都中央区筑地一丁目的摩天大楼。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00011", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00011.wav", "ref_text": "儿童权利监察使办公室设于华沙。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00012", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00012.wav", "ref_text": "灰阶音乐是位于香港的一家独立唱片厂牌和音乐出版公司。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00013", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00013.wav", "ref_text": "梁士济，字遂良，广东广州府南海县人，明朝、南明政治人物。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00014", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00014.wav", "ref_text": "姜涛曾就读轩尼诗道官立下午小学、邓肇坚维多利亚官立中学和青年学院。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00015", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00015.wav", "ref_text": "上海江南长兴重工有限责任公司简称长兴重工，厂区位于上海长兴岛船舶制造基地。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00016", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00016.wav", "ref_text": "卢启贤是香港的亿万富翁企业家和慈善家。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00017", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00017.wav", "ref_text": "在这类故事的早期版本里，女人的猪脸外观是由巫术导致的。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00018", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00018.wav", "ref_text": "事件起因据信是天然气爆炸。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00019", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00019.wav", "ref_text": "在工作了九年后，伯爵不幸去世。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00020", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00020.wav", "ref_text": "整个系统称为键接合。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00021", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00021.wav", "ref_text": "大和和纪，日本漫画家，代表作有《窈窕淑女》、《源氏物语》等。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00022", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00022.wav", "ref_text": "事后三天，赵宇被福州市公安局晋安分局以涉嫌故意伤害罪刑事拘留。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00023", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00023.wav", "ref_text": "由春岗互通向萝岗方向排列", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00024", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00024.wav", "ref_text": "弘光帝即位，让刘文照袭封新乐伯，南京沦陷后寄居在高邮，开辟农田种菜直到去世。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "bcb4464171113dd9b51f371c3eecea06771fde83e7e3239ad0516469c6dcdf80170d26c7d1b1ef2476c45b51bfb4ee5549f07d7002bcfcec9b371a30c873b92d", "gender": "male_masculine", "accent": "", "age": "twenties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00025", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00025.wav", "ref_text": "露露夫人终究与三姐弟达成了协议。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00026", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00026.wav", "ref_text": "武定州，中国唐朝时设置的州。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00027", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00027.wav", "ref_text": "宝陀寺，可以指", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00028", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00028.wav", "ref_text": "去札幌啤酒博物馆", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00029", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00029.wav", "ref_text": "洛莱塔是位于美国加利福尼亚州洪堡县的一个人口普查指定地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00030", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00030.wav", "ref_text": "许州人。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00031", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00031.wav", "ref_text": "班纳镇区为美国堪萨斯州杰克逊县辖下的镇区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00032", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00032.wav", "ref_text": "范家庄遗址，位于山东省潍坊市坊子区坊城街道。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00033", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00033.wav", "ref_text": "郭新立，河北安国人，出生于北京，中国教育人物，现任山东大学党委书记。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "99a4cee094a7058f27e615982d793da9039f8916c4cb0934eafecb601214cb89657ddee22f688a38782a72f5b6622a323ed6dca74f6663430f8cb3c0804563ea", "gender": "male_masculine", "accent": "出生地：31 上海市", "age": "teens", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00034", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00034.wav", "ref_text": "龟山风景区管理处是下辖的一个类似乡级单位。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00035", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00035.wav", "ref_text": "后来他随着李成栋反正，历任光禄卿、户部右侍郎，兵部左侍郎，永历二年晋兵部尚书。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00036", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00036.wav", "ref_text": "同年加入中国人民解放军。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00037", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00037.wav", "ref_text": "由马可、许亚军领衔主演，并由岳红、柯蓝、王策、孙爽联合主演。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00038", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00038.wav", "ref_text": "生于崎玉县川越市，女子美术大学肄业。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00039", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00039.wav", "ref_text": "旧福布斯敦是位于美国加利福尼亚州比尤特县的一个非建制地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00040", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00040.wav", "ref_text": "大厅供穆斯林祈祷，这也是他们见面以结束禁食的地方。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00041", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00041.wav", "ref_text": "我们就没办法改善", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00042", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00042.wav", "ref_text": "四号镇区是位于美国阿肯色州本顿县的一个镇区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00043", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00043.wav", "ref_text": "格梅林后来出版了若干本关于化学、制药科学、矿物学和植物学的教科书。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "22950d9b987d2554c0d7130808cc60fcb5255d92bb579ad138f4da5e2d5fc52b02d4639e4fe708ef5b820a04812fd3f530e3ea93abfac3e55c8dc2ad22696403", "gender": "", "accent": "", "age": "", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00044", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00044.wav", "ref_text": "同年获选澳门十大杰出运动员，是首位获奖的篮球员。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "3c71635420e0de3a0272e28a63d340dbaaeb5d99e246668955f38c25279dfdbbd8eec8cc8663601fe11d6cfd81a45f9a2e8a5d55379220fe71d24a00bee0effb", "gender": "male_masculine", "accent": "出生地：42 湖北省", "age": "thirties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00045", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00045.wav", "ref_text": "阿尔德斯普林斯是位于美国加利福尼亚州弗雷斯诺县的一个非建制地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "3c71635420e0de3a0272e28a63d340dbaaeb5d99e246668955f38c25279dfdbbd8eec8cc8663601fe11d6cfd81a45f9a2e8a5d55379220fe71d24a00bee0effb", "gender": "male_masculine", "accent": "出生地：42 湖北省", "age": "thirties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00046", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00046.wav", "ref_text": "巴特勒是位于美国亚利桑那州莫哈维县的一个非建制地区。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "3c71635420e0de3a0272e28a63d340dbaaeb5d99e246668955f38c25279dfdbbd8eec8cc8663601fe11d6cfd81a45f9a2e8a5d55379220fe71d24a00bee0effb", "gender": "male_masculine", "accent": "出生地：42 湖北省", "age": "thirties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00047", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00047.wav", "ref_text": "最后放弃", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "3c71635420e0de3a0272e28a63d340dbaaeb5d99e246668955f38c25279dfdbbd8eec8cc8663601fe11d6cfd81a45f9a2e8a5d55379220fe71d24a00bee0effb", "gender": "male_masculine", "accent": "出生地：42 湖北省", "age": "thirties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00048", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00048.wav", "ref_text": "薄刀峰林场，是下辖的一个类似乡级单位。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "3c71635420e0de3a0272e28a63d340dbaaeb5d99e246668955f38c25279dfdbbd8eec8cc8663601fe11d6cfd81a45f9a2e8a5d55379220fe71d24a00bee0effb", "gender": "male_masculine", "accent": "出生地：42 湖北省", "age": "thirties", "locale": "zh-CN"}}
-{"utt_id": "fsicoli_common_voice_22_0_00049", "audio_uri": "C:\\Users\\hp\\Desktop\\ASR Agent\\ASR_AGENT_\\data\\hf_audio\\fsicoli_common_voice_22_0_00049.wav", "ref_text": "该季他第一次出赛是在九局上担任普林斯·菲尔德的代跑。", "meta": {"dataset_id": "fsicoli/common_voice_22_0", "dataset_config": "zh-CN", "split": "validation", "text_field": "sentence", "sample_rate": 48000, "client_id": "dfacf81ef98f2b80ebf3a932d8c926f7fa65ffaa8dfc35edefc1344d0e4096cc52dd6cd86f2b29d9ae8dc8bf25d4ac3e0fd6133ed370de7f4e6df6d89193c9b4", "gender": "male_masculine", "accent": "出生地：35 福建省", "age": "twenties", "locale": "zh-CN"}}

pipeline/run_all.py CHANGED Viewed

@@ -6,26 +6,36 @@ from pipeline.run_analysis import run_analysis
 from pipeline.run_asr import run_asr
-def main():
     ap = argparse.ArgumentParser()
     ap.add_argument("--manifest", required=True)
     ap.add_argument("--model_name", default="openai/whisper-small")
     ap.add_argument("--device", default="cpu")
-    ap.add_argument("--backend", default="auto", choices=["auto", "whisper_transformers", "qwen3_asr"])
     ap.add_argument("--llm_model", default="gpt-4.1-mini")
     ap.add_argument("--disable_llm", action="store_true")
-    ap.add_argument("--language", default="zh")
     args = ap.parse_args()
     run_id = run_asr(
         manifest_path=args.manifest,
         model_repo_id=args.model_name,
         device=args.device,
         asr_config={"language": args.language},
         backend=args.backend,
     )
-    run_analysis(run_id, llm_enabled=not args.disable_llm, llm_model=args.llm_model)
-    print(f"Done. Run: runs/{run_id}")
 if __name__ == "__main__":

 from pipeline.run_asr import run_asr
+BACKEND_CHOICES = ["auto", "whisper_transformers", "qwen3_asr"]
+def main() -> None:
     ap = argparse.ArgumentParser()
     ap.add_argument("--manifest", required=True)
     ap.add_argument("--model_name", default="openai/whisper-small")
     ap.add_argument("--device", default="cpu")
+    ap.add_argument("--backend", default="auto", choices=BACKEND_CHOICES)
+    ap.add_argument("--language", default="zh")
+    ap.add_argument("--out_root", default="runs")
     ap.add_argument("--llm_model", default="gpt-4.1-mini")
     ap.add_argument("--disable_llm", action="store_true")
     args = ap.parse_args()
     run_id = run_asr(
         manifest_path=args.manifest,
+        out_root=args.out_root,
         model_repo_id=args.model_name,
         device=args.device,
         asr_config={"language": args.language},
         backend=args.backend,
     )
+    run_analysis(
+        run_id,
+        out_root=args.out_root,
+        llm_enabled=not args.disable_llm,
+        llm_model=args.llm_model,
+    )
+    print(f"Done. Run: {args.out_root}/{run_id}")
 if __name__ == "__main__":

scripts/run_hf_job.py CHANGED Viewed

@@ -5,9 +5,7 @@ import json
 import os
 import sys
 from pathlib import Path
-from typing import Optional, Dict, Any
-import pandas as pd
 # Ensure project root is on sys.path
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
@@ -29,11 +27,6 @@ def build_manifest_from_hf(
     num_samples: int,
     out_manifest: Path,
 ) -> int:
-    """
-    Robust HF loader for datasets with audio.
-    Uses streaming=True and materializes audio to local wav files.
-    Works well for Common Voice-like datasets when dataset scripts are supported.
-    """
     from datasets import load_dataset
     import soundfile as sf
@@ -96,7 +89,6 @@ def build_manifest_from_hf(
             "text_field": text_field,
             "sample_rate": sr,
         }
         for k in ["client_id", "gender", "accent", "age", "locale"]:
             if k in item:
                 meta[k] = item.get(k)
@@ -121,61 +113,7 @@ def build_manifest_from_hf(
     return len(records)
-def run_diagnostic_pipeline(run_id: str, runs_dir: str = "runs") -> None:
-    """
-    Generate:
-      - root_cause.json
-      - diagnostic_report.md
-    under runs/<run_id>/
-    """
-    from analysis.root_cause import infer_root_causes
-    from report.diagnostic_report import generate_report_with_openai
-    from openai import OpenAI
-    run_dir = Path(runs_dir) / run_id
-    # aligned.jsonl
-    aligned_path = run_dir / "aligned.jsonl"
-    aligned_rows = []
-    if aligned_path.exists():
-        with aligned_path.open("r", encoding="utf-8") as f:
-            for line in f:
-                line = line.strip()
-                if line:
-                    aligned_rows.append(json.loads(line))
-    df_align = pd.DataFrame(aligned_rows)
-    # events.parquet
-    events_path = run_dir / "events.parquet"
-    df_events = pd.read_parquet(events_path) if events_path.exists() else pd.DataFrame()
-    # summary.json
-    summary_path = run_dir / "summary.json"
-    summary = json.loads(summary_path.read_text(encoding="utf-8")) if summary_path.exists() else {}
-    # Rule/statistics-based diagnosis
-    root_cause = infer_root_causes(df_events, df_align)
-    (run_dir / "root_cause.json").write_text(
-        json.dumps(root_cause, ensure_ascii=False, indent=2),
-        encoding="utf-8"
-    )
-    # LLM report
-    api_key = os.getenv("OPENAI_API_KEY", "").strip()
-    if not api_key:
-        report_text = (
-            "# Diagnostic Report\n\n"
-            "OPENAI_API_KEY is not set, so only root_cause.json was generated.\n\n"
-            "Please add OPENAI_API_KEY in Hugging Face Space Settings → Secrets."
-        )
-    else:
-        client = OpenAI(api_key=api_key)
-        report_text = generate_report_with_openai(root_cause, summary, client)
-    (run_dir / "diagnostic_report.md").write_text(report_text, encoding="utf-8")
-def main():
     ap = argparse.ArgumentParser()
     ap.add_argument("--dataset_id", required=True)
     ap.add_argument("--dataset_config", default="")
@@ -186,7 +124,9 @@ def main():
     ap.add_argument("--model_repo_id", required=True)
     ap.add_argument("--backend", default="auto", choices=["auto", "whisper_transformers", "qwen3_asr"])
     ap.add_argument("--language", default="zh")
     ap.add_argument("--out_root", default="runs")
     args = ap.parse_args()
@@ -196,7 +136,7 @@ def main():
     data_dir.mkdir(parents=True, exist_ok=True)
     manifest_path = data_dir / "manifest_hf.jsonl"
-    print("[1/5] Building manifest from Hugging Face dataset...")
     n = build_manifest_from_hf(
         dataset_id=args.dataset_id,
         dataset_config=args.dataset_config.strip() or None,
@@ -207,28 +147,30 @@ def main():
     )
     print(f"  - Wrote {n} samples to {manifest_path}")
-    print("[2/5] Running ASR inference...")
     from pipeline.run_asr import run_asr
     run_id = run_asr(
         manifest_path=str(manifest_path),
         out_root=args.out_root,
         model_repo_id=args.model_repo_id,
-        device="cpu",
         asr_config={"language": args.language},
         backend=args.backend,
     )
     print(f"  - ASR done. run_id={run_id}")
-    print("[3/5] Running analysis (align/events/report)...")
     from pipeline.run_analysis import run_analysis
-    run_analysis(run_id, out_root=args.out_root)
-    print("[4/5] Running diagnostic report...")
-    run_diagnostic_pipeline(run_id, runs_dir=args.out_root)
-    print("[5/5] Done.")
     print(f"Run directory: {Path(args.out_root) / run_id}")

 import os
 import sys
 from pathlib import Path
+from typing import Any, Dict, Optional
 # Ensure project root is on sys.path
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
     num_samples: int,
     out_manifest: Path,
 ) -> int:
     from datasets import load_dataset
     import soundfile as sf
             "text_field": text_field,
             "sample_rate": sr,
         }
         for k in ["client_id", "gender", "accent", "age", "locale"]:
             if k in item:
                 meta[k] = item.get(k)
     return len(records)
+def main() -> None:
     ap = argparse.ArgumentParser()
     ap.add_argument("--dataset_id", required=True)
     ap.add_argument("--dataset_config", default="")
     ap.add_argument("--model_repo_id", required=True)
     ap.add_argument("--backend", default="auto", choices=["auto", "whisper_transformers", "qwen3_asr"])
     ap.add_argument("--language", default="zh")
+    ap.add_argument("--device", default="cpu")
+    ap.add_argument("--llm_model", default="gpt-4.1-mini")
+    ap.add_argument("--disable_llm", action="store_true")
     ap.add_argument("--out_root", default="runs")
     args = ap.parse_args()
     data_dir.mkdir(parents=True, exist_ok=True)
     manifest_path = data_dir / "manifest_hf.jsonl"
+    print("[1/4] Building manifest from Hugging Face dataset...")
     n = build_manifest_from_hf(
         dataset_id=args.dataset_id,
         dataset_config=args.dataset_config.strip() or None,
     )
     print(f"  - Wrote {n} samples to {manifest_path}")
+    print("[2/4] Running ASR inference...")
     from pipeline.run_asr import run_asr
     run_id = run_asr(
         manifest_path=str(manifest_path),
         out_root=args.out_root,
         model_repo_id=args.model_repo_id,
+        device=args.device,
         asr_config={"language": args.language},
         backend=args.backend,
     )
     print(f"  - ASR done. run_id={run_id}")
+    print("[3/4] Running analysis and diagnosis...")
     from pipeline.run_analysis import run_analysis
+    run_analysis(
+        run_id,
+        out_root=args.out_root,
+        llm_enabled=not args.disable_llm,
+        llm_model=args.llm_model,
+    )
+    print("[4/4] Done.")
     print(f"Run directory: {Path(args.out_root) / run_id}")

ui/app.py CHANGED Viewed

@@ -4,6 +4,7 @@ import json
 import subprocess
 import sys
 from pathlib import Path
 import gradio as gr
 import pandas as pd
@@ -13,9 +14,10 @@ RUNS_DIR = Path("runs")
 SEMANTIC_JUDGEMENTS = ["ALL", "语义基本等价", "轻微偏差", "明显偏差", "严重失真", "不确定"]
 SEVERITIES = ["ALL", "high", "medium", "low"]
 BUSINESS_IMPACTS = ["ALL", "high", "medium", "low"]
-def list_runs():
     if not RUNS_DIR.exists():
         return []
     return sorted(
@@ -41,7 +43,7 @@ def _read_jsonl(path: Path):
 def _normalize_semantic_cell(xs):
-    def _clean_seq(seq):
         out = []
         for x in seq:
             if x is None:
@@ -95,6 +97,21 @@ def _normalize_semantic_df(df: pd.DataFrame) -> pd.DataFrame:
     return out
 def load_run(run_id: str):
     run_dir = RUNS_DIR / run_id
     meta = _read_json(run_dir / "run_meta.json", {})
@@ -110,10 +127,11 @@ def load_run(run_id: str):
     return meta, summary, df_align, df_events, df_semantic, llm_diagnosis, diagnostic_text
-def build_summary_md(meta, summary, df_semantic: pd.DataFrame | None = None):
     lines = []
     lines.append(f"### Run ID: `{meta.get('run_id')}`")
     lines.append(f"- Model: `{meta.get('model_info')}`")
     if "wer_mean" in summary and summary["wer_mean"] is not None:
         lines.append(f"- WER(mean): **{summary['wer_mean']:.4f}**")
     if "cer_mean" in summary and summary["cer_mean"] is not None:
@@ -129,7 +147,7 @@ def build_summary_md(meta, summary, df_semantic: pd.DataFrame | None = None):
     return "\n".join(lines)
-def build_semantic_overview_md(df_semantic: pd.DataFrame, llm_diagnosis: dict):
     if df_semantic is None or len(df_semantic) == 0:
         return "### Semantic Overview\n暂无 per-utterance LLM 语义诊断结果。请先用配置了 `OPENAI_API_KEY` 的流程跑分析。"
     lines = ["### Semantic Overview"]
@@ -175,13 +193,13 @@ def _head_semantic(df_semantic: pd.DataFrame) -> pd.DataFrame:
         "semantic_error_types_str", "reason", "ref_text", "hyp_text",
     ]
     cols = [c for c in cols if c in df_semantic.columns]
-    return df_semantic.sort_values([c for c in ["business_impact", "severity", "cer"] if c in df_semantic.columns], ascending=[True, True, False][:len([c for c in ["business_impact", "severity", "cer"] if c in df_semantic.columns])]).head(100)[cols]
 def on_select_run(run_id):
     if not run_id:
         empty = pd.DataFrame()
-        return "", empty, empty, empty, "", "No diagnostic report yet.", gr.update(choices=[]), gr.update(choices=[])
     meta, summary, df_align, df_events, df_semantic, llm_diagnosis, diagnostic_text = load_run(run_id)
     md = build_summary_md(meta, summary, df_semantic)
@@ -261,10 +279,7 @@ def search_semantic(run_id, judgement, severity, business_impact, semantic_type,
     if min_cer is not None and "cer" in q.columns:
         q = q[q["cer"].fillna(0) >= float(min_cer)]
-    order_cols = [c for c in ["business_impact", "severity", "cer"] if c in q.columns]
-    if order_cols:
-        q = q.sort_values(order_cols, ascending=[True, True, False][:len(order_cols)])
     cols = [
         "utt_id", "semantic_judgement", "severity", "business_impact", "wer", "cer",
         "semantic_error_types_str", "reason", "improvement_suggestions_str", "domain", "accent",
@@ -278,6 +293,7 @@ def apply_backend_preset(backend, model_repo_id, language):
     backend = str(backend or "auto").strip()
     model_repo_id = str(model_repo_id or "").strip()
     language = str(language or "").strip()
     if backend == "qwen3_asr":
         if (not model_repo_id) or ("qwen3-asr" not in model_repo_id.lower()):
             model_repo_id = "Qwen/Qwen3-ASR-0.6B"
@@ -285,16 +301,20 @@ def apply_backend_preset(backend, model_repo_id, language):
             language = "zh"
         info = "Qwen3-ASR 已启用。建议模型：Qwen/Qwen3-ASR-0.6B 或 Qwen/Qwen3-ASR-1.7B。若环境未安装 qwen-asr，任务会失败。"
         return model_repo_id, language, info
     if backend == "whisper_transformers":
         if (not model_repo_id) or ("whisper" not in model_repo_id.lower()):
             model_repo_id = "openai/whisper-small"
         info = "Whisper Transformers 已启用。"
-        return model_repo_id, language or "zh", info
     info = "backend=auto：会根据模型名自动选择适配器；模型名包含 qwen3-asr 时会走 Qwen3-ASR Adapter。"
     return model_repo_id or "openai/whisper-small", language or "zh", info
-def run_hf_job(dataset_id, dataset_config, split, text_field, model_repo_id, backend, language, num_samples):
     model_repo_id, language, preset_info = apply_backend_preset(backend, model_repo_id, language)
     cmd = [
         sys.executable,
@@ -305,16 +325,19 @@ def run_hf_job(dataset_id, dataset_config, split, text_field, model_repo_id, bac
         "--model_repo_id", model_repo_id.strip(),
         "--backend", str(backend).strip(),
         "--language", language.strip(),
         "--num", str(int(num_samples)),
     ]
     if dataset_config and dataset_config.strip():
         cmd += ["--dataset_config", dataset_config.strip()]
     p = subprocess.run(cmd, capture_output=True, text=True)
-    out = (p.stdout or "") + ("\n" + (p.stderr or "") if p.stderr else "")
     if p.returncode != 0:
-        out = preset_info + "\n\n" + out
-        out += "\n\n[HINT] If you see 401/403 for Common Voice: set HF_TOKEN in Space Settings → Secrets, and accept dataset terms on HF."
         empty = pd.DataFrame()
         return out, gr.update(), "", empty, empty, empty, "", "No diagnostic report yet.", gr.update(), gr.update()
@@ -325,7 +348,6 @@ def run_hf_job(dataset_id, dataset_config, split, text_field, model_repo_id, bac
     else:
         md, align_view, events_view, semantic_view, semantic_md, diagnostic_text, type_dd, domain_dd = "", pd.DataFrame(), pd.DataFrame(), pd.DataFrame(), "", "No diagnostic report yet.", gr.update(), gr.update()
-    out = preset_info + "\n\n" + out
     return out, gr.update(choices=runs, value=latest), md, align_view, events_view, semantic_view, semantic_md, diagnostic_text, type_dd, domain_dd
@@ -335,7 +357,7 @@ with gr.Blocks() as demo:
     with gr.Accordion("Run from Hugging Face", open=True):
         gr.Markdown(
             "Fill in a dataset and an ASR model, then click **Run**. "
-            "If the dataset is gated, set `HF_TOKEN` in Space **Settings → Secrets**. "
             "For LLM semantic diagnostics, make sure `OPENAI_API_KEY` is available."
         )
         with gr.Row():
@@ -347,8 +369,12 @@ with gr.Blocks() as demo:
             num_samples = gr.Number(label="Num samples", value=50, precision=0)
         with gr.Row():
             model_repo_id = gr.Textbox(label="HF model repo id", value="openai/whisper-small")
-            backend = gr.Dropdown(label="ASR backend", choices=["auto", "whisper_transformers", "qwen3_asr"], value="auto")
             language = gr.Textbox(label="Language", value="zh")
         run_btn = gr.Button("Run")
         backend_info = gr.Markdown("backend=auto：会根据模型名自动选择适配器；模型名包含 qwen3-asr 时会走 Qwen3-ASR Adapter。")
         logs = gr.Textbox(label="Logs", lines=16)
@@ -421,6 +447,6 @@ with gr.Blocks() as demo:
     run_btn.click(
         run_hf_job,
-        inputs=[dataset_id, dataset_config, split, text_field, model_repo_id, backend, language, num_samples],
         outputs=[logs, run_dd, summary_md, align_tbl, events_tbl, semantic_tbl, semantic_overview_md, diagnostic_md, semantic_type, semantic_domain],
     )

 import subprocess
 import sys
 from pathlib import Path
+from typing import Iterable
 import gradio as gr
 import pandas as pd
 SEMANTIC_JUDGEMENTS = ["ALL", "语义基本等价", "轻微偏差", "明显偏差", "严重失真", "不确定"]
 SEVERITIES = ["ALL", "high", "medium", "low"]
 BUSINESS_IMPACTS = ["ALL", "high", "medium", "low"]
+_BACKEND_CHOICES = ["auto", "whisper_transformers", "qwen3_asr"]
+def list_runs() -> list[str]:
     if not RUNS_DIR.exists():
         return []
     return sorted(
 def _normalize_semantic_cell(xs):
+    def _clean_seq(seq: Iterable):
         out = []
         for x in seq:
             if x is None:
     return out
+def _apply_priority_order(df: pd.DataFrame) -> pd.DataFrame:
+    if df is None or len(df) == 0:
+        return df
+    out = df.copy()
+    if "business_impact" in out.columns:
+        out["business_impact"] = pd.Categorical(out["business_impact"], categories=["high", "medium", "low"], ordered=True)
+    if "severity" in out.columns:
+        out["severity"] = pd.Categorical(out["severity"], categories=["high", "medium", "low"], ordered=True)
+    order_cols = [c for c in ["business_impact", "severity", "cer"] if c in out.columns]
+    if order_cols:
+        ascending = [True if c != "cer" else False for c in order_cols]
+        out = out.sort_values(order_cols, ascending=ascending, na_position="last")
+    return out
 def load_run(run_id: str):
     run_dir = RUNS_DIR / run_id
     meta = _read_json(run_dir / "run_meta.json", {})
     return meta, summary, df_align, df_events, df_semantic, llm_diagnosis, diagnostic_text
+def build_summary_md(meta, summary, df_semantic: pd.DataFrame | None = None) -> str:
     lines = []
     lines.append(f"### Run ID: `{meta.get('run_id')}`")
     lines.append(f"- Model: `{meta.get('model_info')}`")
+    lines.append(f"- Backend: `{meta.get('backend', 'unknown')}`")
     if "wer_mean" in summary and summary["wer_mean"] is not None:
         lines.append(f"- WER(mean): **{summary['wer_mean']:.4f}**")
     if "cer_mean" in summary and summary["cer_mean"] is not None:
     return "\n".join(lines)
+def build_semantic_overview_md(df_semantic: pd.DataFrame, llm_diagnosis: dict) -> str:
     if df_semantic is None or len(df_semantic) == 0:
         return "### Semantic Overview\n暂无 per-utterance LLM 语义诊断结果。请先用配置了 `OPENAI_API_KEY` 的流程跑分析。"
     lines = ["### Semantic Overview"]
         "semantic_error_types_str", "reason", "ref_text", "hyp_text",
     ]
     cols = [c for c in cols if c in df_semantic.columns]
+    return _apply_priority_order(df_semantic).head(100)[cols]
 def on_select_run(run_id):
     if not run_id:
         empty = pd.DataFrame()
+        return "", empty, empty, empty, "", "No diagnostic report yet.", gr.update(choices=["ALL"], value="ALL"), gr.update(choices=["ALL"], value="ALL")
     meta, summary, df_align, df_events, df_semantic, llm_diagnosis, diagnostic_text = load_run(run_id)
     md = build_summary_md(meta, summary, df_semantic)
     if min_cer is not None and "cer" in q.columns:
         q = q[q["cer"].fillna(0) >= float(min_cer)]
+    q = _apply_priority_order(q)
     cols = [
         "utt_id", "semantic_judgement", "severity", "business_impact", "wer", "cer",
         "semantic_error_types_str", "reason", "improvement_suggestions_str", "domain", "accent",
     backend = str(backend or "auto").strip()
     model_repo_id = str(model_repo_id or "").strip()
     language = str(language or "").strip()
     if backend == "qwen3_asr":
         if (not model_repo_id) or ("qwen3-asr" not in model_repo_id.lower()):
             model_repo_id = "Qwen/Qwen3-ASR-0.6B"
             language = "zh"
         info = "Qwen3-ASR 已启用。建议模型：Qwen/Qwen3-ASR-0.6B 或 Qwen/Qwen3-ASR-1.7B。若环境未安装 qwen-asr，任务会失败。"
         return model_repo_id, language, info
     if backend == "whisper_transformers":
         if (not model_repo_id) or ("whisper" not in model_repo_id.lower()):
             model_repo_id = "openai/whisper-small"
+        if not language:
+            language = "zh"
         info = "Whisper Transformers 已启用。"
+        return model_repo_id, language, info
     info = "backend=auto：会根据模型名自动选择适配器；模型名包含 qwen3-asr 时会走 Qwen3-ASR Adapter。"
     return model_repo_id or "openai/whisper-small", language or "zh", info
+def run_hf_job(dataset_id, dataset_config, split, text_field, model_repo_id, backend, language, device, llm_model, disable_llm, num_samples):
     model_repo_id, language, preset_info = apply_backend_preset(backend, model_repo_id, language)
     cmd = [
         sys.executable,
         "--model_repo_id", model_repo_id.strip(),
         "--backend", str(backend).strip(),
         "--language", language.strip(),
+        "--device", str(device).strip(),
+        "--llm_model", str(llm_model).strip(),
         "--num", str(int(num_samples)),
     ]
+    if disable_llm:
+        cmd.append("--disable_llm")
     if dataset_config and dataset_config.strip():
         cmd += ["--dataset_config", dataset_config.strip()]
     p = subprocess.run(cmd, capture_output=True, text=True)
+    out = preset_info + "\n\n" + (p.stdout or "") + ("\n" + (p.stderr or "") if p.stderr else "")
     if p.returncode != 0:
+        out += "\n\n[HINT] If you see 401/403 for Common Voice: set HF_TOKEN in Space Settings -> Secrets, and accept dataset terms on HF."
         empty = pd.DataFrame()
         return out, gr.update(), "", empty, empty, empty, "", "No diagnostic report yet.", gr.update(), gr.update()
     else:
         md, align_view, events_view, semantic_view, semantic_md, diagnostic_text, type_dd, domain_dd = "", pd.DataFrame(), pd.DataFrame(), pd.DataFrame(), "", "No diagnostic report yet.", gr.update(), gr.update()
     return out, gr.update(choices=runs, value=latest), md, align_view, events_view, semantic_view, semantic_md, diagnostic_text, type_dd, domain_dd
     with gr.Accordion("Run from Hugging Face", open=True):
         gr.Markdown(
             "Fill in a dataset and an ASR model, then click **Run**. "
+            "If the dataset is gated, set `HF_TOKEN` in Space **Settings -> Secrets**. "
             "For LLM semantic diagnostics, make sure `OPENAI_API_KEY` is available."
         )
         with gr.Row():
             num_samples = gr.Number(label="Num samples", value=50, precision=0)
         with gr.Row():
             model_repo_id = gr.Textbox(label="HF model repo id", value="openai/whisper-small")
+            backend = gr.Dropdown(label="ASR backend", choices=_BACKEND_CHOICES, value="auto")
             language = gr.Textbox(label="Language", value="zh")
+        with gr.Row():
+            device = gr.Dropdown(label="Device", choices=["cpu", "cuda"], value="cpu")
+            llm_model = gr.Textbox(label="LLM model", value="gpt-4.1-mini")
+            disable_llm = gr.Checkbox(label="Disable LLM diagnosis", value=False)
         run_btn = gr.Button("Run")
         backend_info = gr.Markdown("backend=auto：会根据模型名自动选择适配器；模型名包含 qwen3-asr 时会走 Qwen3-ASR Adapter。")
         logs = gr.Textbox(label="Logs", lines=16)
     run_btn.click(
         run_hf_job,
+        inputs=[dataset_id, dataset_config, split, text_field, model_repo_id, backend, language, device, llm_model, disable_llm, num_samples],
         outputs=[logs, run_dd, summary_md, align_tbl, events_tbl, semantic_tbl, semantic_overview_md, diagnostic_md, semantic_type, semantic_domain],
     )