AutoArk-AI
/

ARK-ASR-0.6B

@@ -74,7 +74,63 @@ The following results are from the `open-audio-opd` evaluation. Lower CER/WER is
 ## Inference
-Install the open-source inference code:
 ```bash
 git clone https://github.com/AutoArk/open-audio-opd
@@ -82,14 +138,12 @@ cd open-audio-opd
 pip install -e .
 ```
-Prepare an input JSONL file. Each line is one ASR sample:
 ```json
 {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
 ```
-Run inference:
 ```bash
 python scripts/infer/ark_asr_transformers.py \
   --input /path/to/input.jsonl \
@@ -106,8 +160,6 @@ The output JSONL preserves input metadata and adds:
 - `pred_text`: cleaned prediction text for downstream evaluation
 - `pred_text_raw`: raw decoded generation before cleanup
-For longer audio, adjust `--max_audio_seconds`. For CPU inference, use `--dtype float32` and `--attn_impl eager`.
 ## Evaluation
 The repository also includes a J/WER evaluation entrypoint:

 ## Inference
+Run ASR inference with Hugging Face Transformers:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+model_path = "AutoArk-AI/ARK-ASR-0.6B"
+audio_path = "assets/libai.wav"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if device == "cuda" else torch.float32
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype=torch_dtype,
+    attn_implementation="sdpa",
+).to(device)
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "audio", "path": audio_path},
+            {"type": "text", "text": "Please transcribe this audio."},
+        ],
+    }
+]
+inputs = processor.apply_chat_template(
+    conversation,
+    add_generation_prompt=True,
+    return_tensors="pt",
+)
+inputs = inputs.to(device)
+if "audios" in inputs:
+    inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
+bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
+outputs = model.generate(
+    **inputs,
+    do_sample=False,
+    max_new_tokens=256,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id,
+    bad_words_ids=bad_words_ids,
+)
+decoded_outputs = tokenizer.batch_decode(
+    outputs[:, inputs.input_ids.shape[1] :],
+    skip_special_tokens=True,
+)
+print(decoded_outputs)
+```
+For batch JSONL inference, use the open-source inference code:
 ```bash
 git clone https://github.com/AutoArk/open-audio-opd
 pip install -e .
 ```
+The input JSONL should contain one ASR sample per line:
 ```json
 {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
 ```
 ```bash
 python scripts/infer/ark_asr_transformers.py \
   --input /path/to/input.jsonl \
 - `pred_text`: cleaned prediction text for downstream evaluation
 - `pred_text_raw`: raw decoded generation before cleanup
 ## Evaluation
 The repository also includes a J/WER evaluation entrypoint: