Automatic Speech Recognition
Transformers
Safetensors
PyTorch
arkasr
text-generation
speech
audio
ark-asr
custom_code
Instructions to use AutoArk-AI/ARK-ASR-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update inference examples from open-audio-opd
Browse files
README.md
CHANGED
|
@@ -74,7 +74,63 @@ The following results are from the `open-audio-opd` evaluation. Lower CER/WER is
|
|
| 74 |
|
| 75 |
## Inference
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
```bash
|
| 80 |
git clone https://github.com/AutoArk/open-audio-opd
|
|
@@ -82,14 +138,12 @@ cd open-audio-opd
|
|
| 82 |
pip install -e .
|
| 83 |
```
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
```json
|
| 88 |
{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
|
| 89 |
```
|
| 90 |
|
| 91 |
-
Run inference:
|
| 92 |
-
|
| 93 |
```bash
|
| 94 |
python scripts/infer/ark_asr_transformers.py \
|
| 95 |
--input /path/to/input.jsonl \
|
|
@@ -106,8 +160,6 @@ The output JSONL preserves input metadata and adds:
|
|
| 106 |
- `pred_text`: cleaned prediction text for downstream evaluation
|
| 107 |
- `pred_text_raw`: raw decoded generation before cleanup
|
| 108 |
|
| 109 |
-
For longer audio, adjust `--max_audio_seconds`. For CPU inference, use `--dtype float32` and `--attn_impl eager`.
|
| 110 |
-
|
| 111 |
## Evaluation
|
| 112 |
|
| 113 |
The repository also includes a J/WER evaluation entrypoint:
|
|
|
|
| 74 |
|
| 75 |
## Inference
|
| 76 |
|
| 77 |
+
Run ASR inference with Hugging Face Transformers:
|
| 78 |
+
|
| 79 |
+
```python
|
| 80 |
+
import torch
|
| 81 |
+
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
|
| 82 |
+
|
| 83 |
+
model_path = "AutoArk-AI/ARK-ASR-0.6B"
|
| 84 |
+
audio_path = "assets/libai.wav"
|
| 85 |
+
|
| 86 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 87 |
+
torch_dtype = torch.float16 if device == "cuda" else torch.float32
|
| 88 |
+
|
| 89 |
+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
| 90 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 91 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 92 |
+
model_path,
|
| 93 |
+
trust_remote_code=True,
|
| 94 |
+
torch_dtype=torch_dtype,
|
| 95 |
+
attn_implementation="sdpa",
|
| 96 |
+
).to(device)
|
| 97 |
+
|
| 98 |
+
conversation = [
|
| 99 |
+
{
|
| 100 |
+
"role": "user",
|
| 101 |
+
"content": [
|
| 102 |
+
{"type": "audio", "path": audio_path},
|
| 103 |
+
{"type": "text", "text": "Please transcribe this audio."},
|
| 104 |
+
],
|
| 105 |
+
}
|
| 106 |
+
]
|
| 107 |
+
|
| 108 |
+
inputs = processor.apply_chat_template(
|
| 109 |
+
conversation,
|
| 110 |
+
add_generation_prompt=True,
|
| 111 |
+
return_tensors="pt",
|
| 112 |
+
)
|
| 113 |
+
inputs = inputs.to(device)
|
| 114 |
+
if "audios" in inputs:
|
| 115 |
+
inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
|
| 116 |
+
|
| 117 |
+
bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
|
| 118 |
+
outputs = model.generate(
|
| 119 |
+
**inputs,
|
| 120 |
+
do_sample=False,
|
| 121 |
+
max_new_tokens=256,
|
| 122 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 123 |
+
eos_token_id=tokenizer.eos_token_id,
|
| 124 |
+
bad_words_ids=bad_words_ids,
|
| 125 |
+
)
|
| 126 |
+
decoded_outputs = tokenizer.batch_decode(
|
| 127 |
+
outputs[:, inputs.input_ids.shape[1] :],
|
| 128 |
+
skip_special_tokens=True,
|
| 129 |
+
)
|
| 130 |
+
print(decoded_outputs)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
For batch JSONL inference, use the open-source inference code:
|
| 134 |
|
| 135 |
```bash
|
| 136 |
git clone https://github.com/AutoArk/open-audio-opd
|
|
|
|
| 138 |
pip install -e .
|
| 139 |
```
|
| 140 |
|
| 141 |
+
The input JSONL should contain one ASR sample per line:
|
| 142 |
|
| 143 |
```json
|
| 144 |
{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
|
| 145 |
```
|
| 146 |
|
|
|
|
|
|
|
| 147 |
```bash
|
| 148 |
python scripts/infer/ark_asr_transformers.py \
|
| 149 |
--input /path/to/input.jsonl \
|
|
|
|
| 160 |
- `pred_text`: cleaned prediction text for downstream evaluation
|
| 161 |
- `pred_text_raw`: raw decoded generation before cleanup
|
| 162 |
|
|
|
|
|
|
|
| 163 |
## Evaluation
|
| 164 |
|
| 165 |
The repository also includes a J/WER evaluation entrypoint:
|