File size: 3,698 Bytes

---
language:
  - gsw
  - en
  - fr
  - it
license: mit
library_name: transformers
pipeline_tag: token-classification
base_model: ZurichNLP/swissbert
tags:
  - code-switching
  - language-identification
  - child-speech
  - multilingual
---

# SwissBERT for Token-Level Language Identification in Multilingual Child Speech

This repository contains a fine-tuned version of **SwissBERT** for word-level language identification in multilingual child–caregiver interactions.  
The model predicts a language label for each word and supports downstream analyses such as:

- Inter-sentential code-switching  
- Intra-sentential code-switching  
- Cross-speaker switching  
- Switch-point detection  
- Multilingual child speech profiling  

The model was trained on manually annotated child speech transcripts containing Swiss German, English, French, Italian, and an “other” category.  
Because Swiss German child speech data is limited, the model was partially trained on the **SwissDial dataset** to improve Swiss German coverage.

---

## Model Description

- **Base model:** `ZurichNLP/swissbert` (XLM-RoBERTa architecture)  
- **Task:** Token classification (word-level language ID)  
- **Labels:** Swiss German, English, French, Italian, Other  
- **Tokenizer:** SentencePiece (slow tokenizer), extended with:
  - `<medium>`
  - `<year>`
  - `<month>`

The model is designed for:
- Child multilingualism research  
- Code-switching analysis  
- Annotation pipelines  
- Automatic language tagging in naturalistic child speech  

---

## Training Data

The training dataset is a tab-separated file with the following structure:

| sentence_id | token | label |
|-------------|-------|-------|
| 12          | Das   | gsw   |
| 12          | isch  | gsw   |
| 12          | good  | eng   |
| 12          | gäll  | gsw   |

Tokens are grouped by `sentence_id` to form sequences for token-level classification.

---

## Training Pipeline

The model was trained using the Hugging Face **Trainer API**.

### 1. Load labeled data
- Read TSV file with `(sentence_id, token, label)`
- Remove empty tokens and labels
- Normalize labels to lowercase

---

### 2. Group tokens into sentences
Tokens and labels are grouped by `sentence_id` to form input sequences.

---

### 3. Build label mappings

```python
label2id = {
    "gsw": 0,
    "deu": 1,
    "eng": 2,
    "fra": 3,
    "ita": 4,
    "other": 5
}

id2label = {v: k for k, v in label2id.items()}
```


### 4. Tokenization and label alignment

Because SwissBERT uses SentencePiece, each token may split into multiple subword units.

Manual alignment was implemented:

- First subword receives the label  
- Remaining subwords receive `-100` (ignored in loss)  
- CLS and SEP tokens also receive `-100`  
- Sequences padded/truncated to `MAX_LENGTH = 128`  

---

## 5. Training Configuration

- **Epochs:** 5  
- **Batch size:** 8  
- **Learning rate:** 5e‑5  
- **Weight decay:** 0.01  
- **Evaluation:** every epoch  
- **Metric:** F1 (seqeval)  
- **Best model selection:** enabled  
- **Tokenizer:** slow SentencePiece tokenizer  

---

## 6. Model Setup

```python
AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label2id),
    id2label=id2label,
    label2id=label2id,
)
```

## 7. Metrics
Evaluation uses seqeval, reporting:

- token‑level F1
- per‑label precision and recall
- full classification report printed during training

## References

Agnese D'Angelo, Sina Ahmadi, Moritz M. Daum, and Stephanie Wermelinger. 2026. Code-Switching Detection in Multilingual Child Speech with SwissBERT. In *Proceedings of the 11th Swiss Text Analytics Conference (SwissText 2026)*, Zurich, Switzerland.