| | --- |
| | language: |
| | - en |
| | - as |
| | - bn |
| | - brx |
| | - doi |
| | - gu |
| | - hi |
| | - kn |
| | - ks |
| | - kok |
| | - mai |
| | - ml |
| | - mni |
| | - mr |
| | - ne |
| | - or |
| | - pa |
| | - sa |
| | - sat |
| | - sd |
| | - ta |
| | - te |
| | - ur |
| | license: mit |
| | tags: |
| | - english |
| | - punctuation-restoration |
| | - multilingual |
| | - indic-languages |
| | - ai4bharat |
| | datasets: |
| | - ai4bharat/sangraha |
| | - HuggingFaceFW/fineweb-2 |
| | - ai4bharat/indicvoices_r |
| | - ai4bharat/IndicCorpV2 |
| | metrics: |
| | - f1 |
| | pipeline_tag: token-classification |
| | library_name: cadence-punctuation |
| | base_model: |
| | - google/gemma-3-270m |
| | widget: |
| | - text: hello world how are you today |
| | example_title: English Punctuation |
| | - text: यह एक हिंदी वाक्य है |
| | example_title: Hindi Punctuation |
| | - text: cadence is a great model for punctuation |
| | example_title: Another English Example |
| | --- |
| | |
| | # Cadence-Fast |
| |
|
| | A multilingual punctuation restoration model based on Gemma-3-270M. Cadence-Fast supports English and 22 other Indian langauges. |
| |
|
| | Cadence-Fast is distilled from [ai4bharat/Cadence](https://huggingface.co/ai4bharat/Cadence) for punctuation prediction on English and Indic languages text. |
| | Cadence-Fast achieves over 93.8% performance of Cadence at a much faster speed. |
| |
|
| | <a href="https://arxiv.org/abs/2506.03793" target="_blank" rel="noopener noreferrer" style="text-decoration: none; color: inherit;"> |
| | <span style="display: inline-flex; align-items: center; gap: 0.3em;"> |
| | <img src="https://huggingface.co/ai4bharat/Cadence/resolve/main/arxiv_logo.svg" alt="arXiv" style="height: 1em;"> |
| | <span>Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts</span> |
| | </span> |
| | </a> |
| | |
| | ## Model Description |
| |
|
| | - **Model Type**: Token Classification (Punctuation Prediction) |
| | - **Base Model**: Gemma-3-270M |
| | - **Languages**: English + 22 Indic Languages |
| | - **Task**: Automatic punctuation restoration |
| |
|
| | ## Installation (Optional) |
| | Python package has features such as sliding-window decoding, (rule-based) capitalisation of English text and some (rule-based) corrections for the errors made by the model. |
| |
|
| | ```bash |
| | pip install cadence-punctuation |
| | ``` |
| |
|
| | ## Quick Start |
| |
|
| | ### Using the python package (Recommended) |
| |
|
| | ```python |
| | from Cadence import PunctuationModel |
| | |
| | # Load model from local path or downloads at the specified directory |
| | model = PunctuationModel(model="Cadence-Fast","path/to/model") |
| | |
| | # Punctuate single text |
| | text = "hello world how are you today" |
| | result = model.punctuate([text]) |
| | print(result[0]) # "Hello world, how are you today?" |
| | |
| | # Punctuate multiple texts |
| | texts = [ |
| | "hello world how are you", |
| | "this is another test sentence", |
| | "यह एक हिंदी वाक्य है" # Hindi example |
| | ] |
| | results = model.punctuate(texts, batch_size=8) |
| | for original, punctuated in zip(texts, results): |
| | print(f"Original: {original}") |
| | print(f"Punctuated: {punctuated}") |
| | print() |
| | ``` |
| |
|
| | ### Using AutoModel |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | # Load model and tokenizer |
| | model_name = "ai4bharat/Cadence-Fast" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
| | id2label = model.config.id2label |
| | text = "यह एक वाक्य है इसका क्या मतलब है" |
| | # text = "this is a test sentence what do you think" |
| | # Tokenize input and prepare for model |
| | inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
| | input_ids = inputs['input_ids'][0] # Get input_ids for the first (and only) sentence |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | predictions_for_sentence = torch.argmax(outputs.logits, dim=-1)[0] |
| | result_tokens_and_punctuation = [] |
| | all_token_strings = tokenizer.convert_ids_to_tokens(input_ids.tolist()) # Get all token strings |
| | for i, token_id_value in enumerate(input_ids.tolist()): |
| | # Process only non-padding tokens based on the attention mask |
| | if inputs['attention_mask'][0][i] == 0: |
| | continue |
| | current_token_string = all_token_strings[i] |
| | is_special_token = token_id_value in tokenizer.all_special_ids |
| | |
| | if not is_special_token: |
| | result_tokens_and_punctuation.append(current_token_string) |
| | |
| | predicted_punctuation_id = predictions_for_sentence[i].item() |
| | punctuation_character = id2label[predicted_punctuation_id] |
| | if punctuation_character != "O" and not is_special_token: |
| | result_tokens_and_punctuation.append(punctuation_character) |
| | punctuated_text = tokenizer.convert_tokens_to_string(result_tokens_and_punctuation) |
| | print(f"Original Text: {text}") |
| | print(f"Punctuated Text: {punctuated_text}") |
| | ``` |
| |
|
| | ## Officially Supported Languages |
| | - English, Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu |
| |
|
| | Tokenizer doesn't support Manipuri's Meitei script well. The model can punctuate if the text is transliterated to Bengali's script. |
| |
|
| | One can try using this model for languages not listed above. Performance may vary. |
| |
|
| | ## Supported Punctuation |
| | The model can predict the following punctuation marks: |
| | - Period (.) |
| | - Comma (,) |
| | - Question mark (?) |
| | - Exclamation mark (!) |
| | - Semicolon (;) |
| | - Colon (:) |
| | - Hyphen (-) |
| | - Quotes (" and ') |
| | - Ellipse (...) |
| | - Parentheses () |
| | - Hindi Danda (।) |
| | - Urdu punctuation (۔، ؟) |
| | - Arabic punctuation (٬ ،) |
| | - Santali punctuation (᱾ ᱾।) |
| | - Sanskrit punctuation (॥) |
| | - And various combinations |
| |
|
| | ## Configuration Options for cadence-puncuation |
| |
|
| | ### PunctuationModel Parameters |
| | All the parameters are optional to pass. |
| | - `model`: Can be choosen between "Cadence" (based on Gemma-3-1B) and "Cadence-Fast" (based on Gemma-3-270M). (default: "Cadence") |
| | - `model_path`: Path to a local directory where model weights will be downloaded to and cached, or from which pre-downloaded weights will be loaded. If None, weights downloaded to default HuggingFace cache location. |
| | - `gpu_id`: Specific GPU device ID to use (e.g., 0, 1). If None, the model will attempt to auto-detect and use an available GPU. This parameter is ignored if cpu is True. (default: None) |
| | - `cpu`: If True, forces the model to run on the CPU, even if a GPU is available. (default: False) |
| | - `max_length`: Maximum sequence length the model can process at once. If sliding_window is True, this value is used as the width of each sliding window. If sliding_window is False, texts longer than max_length will be truncated. (default: 300) |
| | - `attn_implementation`: The attention implementation to use. (default: "eager") |
| | - `sliding_window`: If True, enables sliding window mechanism to process texts longer than max_length. The text is split into overlapping chunks of max_length. If False, texts longer than max_length are truncated. (default: True) |
| | - `verbose`: Enable verbose logging. (default: False) |
| | - `d_type`: Precision with which weights are loaded. (default: bfloat16) |
| | - `batch_size` ((for punctuate() method)): Batch size to use. (default: 8) |
| |
|
| | ```python |
| | # Custom configuration |
| | model = PunctuationModel( |
| | model="Cadence-Fast" |
| | model_path="path/to/download/weights", |
| | gpu_id=0, # Use specific GPU |
| | max_length=512, # length for trunation; also used as window size when sliding_window=True |
| | attn_implementation="flash_attention_2", |
| | sliding_window=True, # Handle long texts |
| | verbose=False, # Quiet mode |
| | d_type="bfloat16" |
| | ) |
| | batch_size=32 |
| | # Process long texts with sliding window |
| | long_text = "Your very long text here..." * 100 |
| | short_text = "a short text" |
| | result = model.punctuate([long_text, short_text],batch_size=batch_size) |
| | ``` |
| |
|
| | ## License |
| | MIT License |
| |
|
| |
|