Token Classification
Transformers
Safetensors
xmod
code-switching
language-identification
child-speech
multilingual
Instructions to use ZurichNLP/SwissBERT-CS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ZurichNLP/SwissBERT-CS with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ZurichNLP/SwissBERT-CS")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ZurichNLP/SwissBERT-CS") model = AutoModelForTokenClassification.from_pretrained("ZurichNLP/SwissBERT-CS") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - gsw | |
| - en | |
| - fr | |
| - it | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: token-classification | |
| base_model: ZurichNLP/swissbert | |
| tags: | |
| - code-switching | |
| - language-identification | |
| - child-speech | |
| - multilingual | |
| # SwissBERT for Token-Level Language Identification in Multilingual Child Speech | |
| This repository contains a fine-tuned version of **SwissBERT** for word-level language identification in multilingual child–caregiver interactions. | |
| The model predicts a language label for each word and supports downstream analyses such as: | |
| - Inter-sentential code-switching | |
| - Intra-sentential code-switching | |
| - Cross-speaker switching | |
| - Switch-point detection | |
| - Multilingual child speech profiling | |
| The model was trained on manually annotated child speech transcripts containing Swiss German, English, French, Italian, and an “other” category. | |
| Because Swiss German child speech data is limited, the model was partially trained on the **SwissDial dataset** to improve Swiss German coverage. | |
| --- | |
| ## Model Description | |
| - **Base model:** `ZurichNLP/swissbert` (XLM-RoBERTa architecture) | |
| - **Task:** Token classification (word-level language ID) | |
| - **Labels:** Swiss German, English, French, Italian, Other | |
| - **Tokenizer:** SentencePiece (slow tokenizer), extended with: | |
| - `<medium>` | |
| - `<year>` | |
| - `<month>` | |
| The model is designed for: | |
| - Child multilingualism research | |
| - Code-switching analysis | |
| - Annotation pipelines | |
| - Automatic language tagging in naturalistic child speech | |
| --- | |
| ## Training Data | |
| The training dataset is a tab-separated file with the following structure: | |
| | sentence_id | token | label | | |
| |-------------|-------|-------| | |
| | 12 | Das | gsw | | |
| | 12 | isch | gsw | | |
| | 12 | good | eng | | |
| | 12 | gäll | gsw | | |
| Tokens are grouped by `sentence_id` to form sequences for token-level classification. | |
| --- | |
| ## Training Pipeline | |
| The model was trained using the Hugging Face **Trainer API**. | |
| ### 1. Load labeled data | |
| - Read TSV file with `(sentence_id, token, label)` | |
| - Remove empty tokens and labels | |
| - Normalize labels to lowercase | |
| --- | |
| ### 2. Group tokens into sentences | |
| Tokens and labels are grouped by `sentence_id` to form input sequences. | |
| --- | |
| ### 3. Build label mappings | |
| ```python | |
| label2id = { | |
| "gsw": 0, | |
| "deu": 1, | |
| "eng": 2, | |
| "fra": 3, | |
| "ita": 4, | |
| "other": 5 | |
| } | |
| id2label = {v: k for k, v in label2id.items()} | |
| ``` | |
| ### 4. Tokenization and label alignment | |
| Because SwissBERT uses SentencePiece, each token may split into multiple subword units. | |
| Manual alignment was implemented: | |
| - First subword receives the label | |
| - Remaining subwords receive `-100` (ignored in loss) | |
| - CLS and SEP tokens also receive `-100` | |
| - Sequences padded/truncated to `MAX_LENGTH = 128` | |
| --- | |
| ## 5. Training Configuration | |
| - **Epochs:** 5 | |
| - **Batch size:** 8 | |
| - **Learning rate:** 5e‑5 | |
| - **Weight decay:** 0.01 | |
| - **Evaluation:** every epoch | |
| - **Metric:** F1 (seqeval) | |
| - **Best model selection:** enabled | |
| - **Tokenizer:** slow SentencePiece tokenizer | |
| --- | |
| ## 6. Model Setup | |
| ```python | |
| AutoModelForTokenClassification.from_pretrained( | |
| MODEL_NAME, | |
| num_labels=len(label2id), | |
| id2label=id2label, | |
| label2id=label2id, | |
| ) | |
| ``` | |
| ## 7. Metrics | |
| Evaluation uses seqeval, reporting: | |
| - token‑level F1 | |
| - per‑label precision and recall | |
| - full classification report printed during training | |
| ## References | |
| Agnese D'Angelo, Sina Ahmadi, Moritz M. Daum, and Stephanie Wermelinger. 2026. Code-Switching Detection in Multilingual Child Speech with SwissBERT. In *Proceedings of the 11th Swiss Text Analytics Conference (SwissText 2026)*, Zurich, Switzerland. |