Duplicate from zai-org/GLM-ASR-Nano-2512

Browse files

Co-authored-by: zR <ZHANGYUXUAN-zR@users.noreply.huggingface.co>

Files changed (9) hide show

.gitattributes +35 -0
README.md +125 -0
chat_template.jinja +32 -0
config.json +60 -0
generation_config.json +10 -0
model.safetensors +3 -0
processor_config.json +20 -0
tokenizer.json +0 -0
tokenizer_config.json +37 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,125 @@

+---
+license: mit
+language:
+- en
+- zh
+pipeline_tag: automatic-speech-recognition
+library_name: transformers
+---
+# GLM-ASR-Nano-2512
+<div align="center">
+<img src=https://raw.githubusercontent.com/zai-org/GLM-ASR/refs/heads/main/resources/logo.svg width="20%"/>
+</div>
+<p align="center">
+    👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-ASR/refs/heads/main/resources/wechat.png" target="_blank">WeChat</a> community
+</p>
+## Model Introduction
+**GLM-ASR-Nano-2512** is a robust, open-source speech recognition model with **1.5B parameters**. Designed for
+real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.
+Key capabilities include:
+* **Exceptional Dialect Support:**
+  Beyond standard Mandarin and English, the model is highly optimized for **Cantonese (粤语)** and other dialects,
+  effectively bridging the gap in dialectal speech recognition.
+* **Low-Volume Speech Robustness:**
+  Specifically trained for **"Whisper/Quiet Speech"** scenarios. It captures and accurately transcribes extremely
+  low-volume audio that traditional models often miss.
+* **SOTA Performance:**
+  Achieves the **lowest average error rate (4.10)** among comparable open-source models, showing significant advantages
+  in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).
+## Benchmark
+We evaluated GLM-ASR-Nano against leading open-source and closed-source models. The results demonstrate that *
+*GLM-ASR-Nano (1.5B)** achieves superior performance, particularly in challenging acoustic environments.
+![Benchmark results](https://raw.githubusercontent.com/zai-org/GLM-ASR/refs/heads/main/resources/bench.png)
+Notes:
+- Wenet Meeting reflects real-world meeting scenarios with noise and overlapping speech.
+- Aishell-1 is a standard Mandarin benchmark.
+## Inference
+`GLM-ASR-Nano-2512` can be easily integrated using the `transformers` library.
+We will support `transformers 5.x` as well as inference frameworks such as `vLLM` and `SGLang`.
+you can check more code in [Github](https://github.com/zai-org/GLM-ASR).
+### Transformers 🤗
+Install `transformers` from source:
+```bash
+pip install git+https://github.com/huggingface/transformers
+```
+#### Basic Usage
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoProcessor
+processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512")
+model = AutoModelForSeq2SeqLM.from_pretrained("zai-org/GLM-ASR-Nano-2512", dtype="auto", device_map="auto")
+inputs = processor.apply_transcription_request("https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3")
+inputs = inputs.to(model.device, dtype=model.dtype)
+outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)
+print(decoded_outputs)
+```
+#### Using Audio Arrays Directly
+You can also use audio arrays directly:
+```python
+from transformers import GlmAsrForConditionalGeneration, AutoProcessor
+from datasets import load_dataset
+from datasets import Audio
+processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512")
+model = GlmAsrForConditionalGeneration.from_pretrained("zai-org/GLM-ASR-Nano-2512", dtype="auto", device_map="auto")
+# loading audio directly from dataset
+ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
+ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
+audio_array = ds[0]["audio"]["array"]
+inputs = processor.apply_transcription_request(audio_array)
+inputs = inputs.to(model.device, dtype=model.dtype)
+outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)
+print(decoded_outputs)
+```
+#### Batched Inference
+You can process multiple audio files at once:
+```python
+from transformers import GlmAsrForConditionalGeneration, AutoProcessor
+processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512")
+model = GlmAsrForConditionalGeneration.from_pretrained("zai-org/GLM-ASR-Nano-2512", dtype="auto", device_map="auto")
+inputs = processor.apply_transcription_request([
+    "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
+    "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
+])
+inputs = inputs.to(model.device, dtype=model.dtype)
+outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
+decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)
+print(decoded_outputs)
+```

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,32 @@

+{%- macro to_text(content) -%}
+{%- if content is string -%}
+{{- content -}}
+{%- elif content is iterable and content is not mapping -%}
+{%- for item in content -%}
+{%- if item is mapping and item.type == 'text' and item.text is defined -%}
+{{- item.text -}}
+{%- elif item is mapping and (item.type == 'audio' or 'audio' in item) -%}
+<|begin_of_audio|><|pad|><|end_of_audio|><|user|>
+{% elif item is string -%}
+{{- item -}}
+{%- endif -%}
+{%- endfor -%}
+{%- else -%}
+{{- content -}}
+{%- endif -%}
+{%- endmacro -%}
+{%- for m in messages -%}
+{%- if m.role == 'system' -%}
+<|system|>
+{{ to_text(m.content) | trim }}
+{%- elif m.role == 'user' -%}
+<|user|>
+{{ to_text(m.content) | trim }}
+{%- elif m.role == 'assistant' -%}
+<|assistant|>
+{{ to_text(m.content) | trim }}
+{%- endif -%}
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+<|assistant|>
+{% endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "architectures": [
+    "GlmAsrForConditionalGeneration"
+  ],
+  "audio_config": {
+    "attention_dropout": 0.0,
+    "head_dim": 64,
+    "hidden_act": "gelu",
+    "hidden_size": 1280,
+    "initializer_range": 0.02,
+    "intermediate_size": 5120,
+    "max_position_embeddings": 1500,
+    "model_type": "glmasr_encoder",
+    "num_attention_heads": 20,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 20,
+    "num_mel_bins": 128,
+    "partial_rotary_factor": 0.5,
+    "rope_parameters": {
+      "partial_rotary_factor": 0.5,
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    }
+  },
+  "audio_token_id": 59260,
+  "dtype": "bfloat16",
+  "hidden_size": 2048,
+  "model_type": "glmasr",
+  "projector_hidden_act": "gelu",
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "eos_token_id": [
+      59246,
+      59253,
+      59255
+    ],
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 6144,
+    "max_position_embeddings": 8192,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 4,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_parameters": {
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    },
+    "use_cache": true,
+    "vocab_size": 59264
+  },
+  "transformers_version": "5.0.0.dev0",
+  "vocab_size": 59264
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [
+    59246,
+    59253,
+    59255
+  ],
+  "transformers_version": "5.0.0.dev0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b8af83ccf6b34dfc7921cedcc46d4a6dc6aaffa661b8f71b44e3a2ff60a90a91
+size 4515776712

processor_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "audio_token": "<|pad|>",
+  "default_transcription_prompt": "Please transcribe this audio into text",
+  "feature_extractor": {
+    "chunk_length": 30,
+    "dither": 0.0,
+    "feature_extractor_type": "WhisperFeatureExtractor",
+    "feature_size": 128,
+    "hop_length": 160,
+    "n_fft": 400,
+    "n_samples": 480000,
+    "nb_max_frames": 3000,
+    "padding_side": "right",
+    "padding_value": 0.0,
+    "return_attention_mask": false,
+    "sampling_rate": 16000
+  },
+  "max_audio_len": 655,
+  "processor_class": "GlmAsrProcessor"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": false,
+  "do_lower_case": false,
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": [
+    "<|endoftext|>",
+    "[MASK]",
+    "[gMASK]",
+    "[sMASK]",
+    "<sop>",
+    "<eop>",
+    "<|system|>",
+    "<|user|>",
+    "<|assistant|>",
+    "<|observation|>",
+    "<|begin_of_image|>",
+    "<|end_of_image|>",
+    "<|begin_of_video|>",
+    "<|end_of_video|>",
+    "<|pad|>",
+    "<|begin_of_audio|>",
+    "<|end_of_audio|>"
+  ],
+  "is_local": false,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 65536,
+  "model_specific_special_tokens": {},
+  "pad_token": "<|endoftext|>",
+  "padding_side": "left",
+  "processor_class": "GlmAsrProcessor",
+  "remove_space": false,
+  "tokenizer_class": "TokenizersBackend"
+}