bupalinyu commited on
Commit
594cf88
·
verified ·
1 Parent(s): c144783

Update inference examples from open-audio-opd

Browse files
Files changed (1) hide show
  1. README.md +58 -6
README.md CHANGED
@@ -74,7 +74,63 @@ The following results are from the `open-audio-opd` evaluation. Lower CER/WER is
74
 
75
  ## Inference
76
 
77
- Install the open-source inference code:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ```bash
80
  git clone https://github.com/AutoArk/open-audio-opd
@@ -82,14 +138,12 @@ cd open-audio-opd
82
  pip install -e .
83
  ```
84
 
85
- Prepare an input JSONL file. Each line is one ASR sample:
86
 
87
  ```json
88
  {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
89
  ```
90
 
91
- Run inference:
92
-
93
  ```bash
94
  python scripts/infer/ark_asr_transformers.py \
95
  --input /path/to/input.jsonl \
@@ -106,8 +160,6 @@ The output JSONL preserves input metadata and adds:
106
  - `pred_text`: cleaned prediction text for downstream evaluation
107
  - `pred_text_raw`: raw decoded generation before cleanup
108
 
109
- For longer audio, adjust `--max_audio_seconds`. For CPU inference, use `--dtype float32` and `--attn_impl eager`.
110
-
111
  ## Evaluation
112
 
113
  The repository also includes a J/WER evaluation entrypoint:
 
74
 
75
  ## Inference
76
 
77
+ Run ASR inference with Hugging Face Transformers:
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
82
+
83
+ model_path = "AutoArk-AI/ARK-ASR-0.6B"
84
+ audio_path = "assets/libai.wav"
85
+
86
+ device = "cuda" if torch.cuda.is_available() else "cpu"
87
+ torch_dtype = torch.float16 if device == "cuda" else torch.float32
88
+
89
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
90
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
91
+ model = AutoModelForCausalLM.from_pretrained(
92
+ model_path,
93
+ trust_remote_code=True,
94
+ torch_dtype=torch_dtype,
95
+ attn_implementation="sdpa",
96
+ ).to(device)
97
+
98
+ conversation = [
99
+ {
100
+ "role": "user",
101
+ "content": [
102
+ {"type": "audio", "path": audio_path},
103
+ {"type": "text", "text": "Please transcribe this audio."},
104
+ ],
105
+ }
106
+ ]
107
+
108
+ inputs = processor.apply_chat_template(
109
+ conversation,
110
+ add_generation_prompt=True,
111
+ return_tensors="pt",
112
+ )
113
+ inputs = inputs.to(device)
114
+ if "audios" in inputs:
115
+ inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
116
+
117
+ bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
118
+ outputs = model.generate(
119
+ **inputs,
120
+ do_sample=False,
121
+ max_new_tokens=256,
122
+ pad_token_id=tokenizer.pad_token_id,
123
+ eos_token_id=tokenizer.eos_token_id,
124
+ bad_words_ids=bad_words_ids,
125
+ )
126
+ decoded_outputs = tokenizer.batch_decode(
127
+ outputs[:, inputs.input_ids.shape[1] :],
128
+ skip_special_tokens=True,
129
+ )
130
+ print(decoded_outputs)
131
+ ```
132
+
133
+ For batch JSONL inference, use the open-source inference code:
134
 
135
  ```bash
136
  git clone https://github.com/AutoArk/open-audio-opd
 
138
  pip install -e .
139
  ```
140
 
141
+ The input JSONL should contain one ASR sample per line:
142
 
143
  ```json
144
  {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
145
  ```
146
 
 
 
147
  ```bash
148
  python scripts/infer/ark_asr_transformers.py \
149
  --input /path/to/input.jsonl \
 
160
  - `pred_text`: cleaned prediction text for downstream evaluation
161
  - `pred_text_raw`: raw decoded generation before cleanup
162
 
 
 
163
  ## Evaluation
164
 
165
  The repository also includes a J/WER evaluation entrypoint: