ZurichNLP
/

SwissBERT-CS

Token Classification

language-identification

Model card Files Files and versions

SwissBERT-CS / README.md

SinaAhmadi's picture

Add model card metadata

fc9bbe6 17 days ago

|

history blame contribute delete

3.7 kB

	---
	language:
	- gsw
	- en
	- fr
	- it
	license: mit
	library_name: transformers
	pipeline_tag: token-classification
	base_model: ZurichNLP/swissbert
	tags:
	- code-switching
	- language-identification
	- child-speech
	- multilingual
	---

	# SwissBERT for Token-Level Language Identification in Multilingual Child Speech

	This repository contains a fine-tuned version of SwissBERT for word-level language identification in multilingual child–caregiver interactions.
	The model predicts a language label for each word and supports downstream analyses such as:

	- Inter-sentential code-switching
	- Intra-sentential code-switching
	- Cross-speaker switching
	- Switch-point detection
	- Multilingual child speech profiling

	The model was trained on manually annotated child speech transcripts containing Swiss German, English, French, Italian, and an “other” category.
	Because Swiss German child speech data is limited, the model was partially trained on the SwissDial dataset to improve Swiss German coverage.

	---

	## Model Description

	- Base model: `ZurichNLP/swissbert` (XLM-RoBERTa architecture)
	- Task: Token classification (word-level language ID)
	- Labels: Swiss German, English, French, Italian, Other
	- Tokenizer: SentencePiece (slow tokenizer), extended with:
	- `<medium>`
	- `<year>`
	- `<month>`

	The model is designed for:
	- Child multilingualism research
	- Code-switching analysis
	- Annotation pipelines
	- Automatic language tagging in naturalistic child speech

	---

	## Training Data

	The training dataset is a tab-separated file with the following structure:

	\| sentence_id \| token \| label \|
	\|-------------\|-------\|-------\|
	\| 12 \| Das \| gsw \|
	\| 12 \| isch \| gsw \|
	\| 12 \| good \| eng \|
	\| 12 \| gäll \| gsw \|

	Tokens are grouped by `sentence_id` to form sequences for token-level classification.

	---

	## Training Pipeline

	The model was trained using the Hugging Face Trainer API.

	### 1. Load labeled data
	- Read TSV file with `(sentence_id, token, label)`
	- Remove empty tokens and labels
	- Normalize labels to lowercase

	---

	### 2. Group tokens into sentences
	Tokens and labels are grouped by `sentence_id` to form input sequences.

	---

	### 3. Build label mappings

	```python
	label2id = {
	"gsw": 0,
	"deu": 1,
	"eng": 2,
	"fra": 3,
	"ita": 4,
	"other": 5
	}

	id2label = {v: k for k, v in label2id.items()}
	```


	### 4. Tokenization and label alignment

	Because SwissBERT uses SentencePiece, each token may split into multiple subword units.

	Manual alignment was implemented:

	- First subword receives the label
	- Remaining subwords receive `-100` (ignored in loss)
	- CLS and SEP tokens also receive `-100`
	- Sequences padded/truncated to `MAX_LENGTH = 128`

	---

	## 5. Training Configuration

	- Epochs: 5
	- Batch size: 8
	- Learning rate: 5e‑5
	- Weight decay: 0.01
	- Evaluation: every epoch
	- Metric: F1 (seqeval)
	- Best model selection: enabled
	- Tokenizer: slow SentencePiece tokenizer

	---

	## 6. Model Setup

	```python
	AutoModelForTokenClassification.from_pretrained(
	MODEL_NAME,
	num_labels=len(label2id),
	id2label=id2label,
	label2id=label2id,
	)
	```

	## 7. Metrics
	Evaluation uses seqeval, reporting:

	- token‑level F1
	- per‑label precision and recall
	- full classification report printed during training

	## References

	Agnese D'Angelo, Sina Ahmadi, Moritz M. Daum, and Stephanie Wermelinger. 2026. Code-Switching Detection in Multilingual Child Speech with SwissBERT. In Proceedings of the 11th Swiss Text Analytics Conference (SwissText 2026), Zurich, Switzerland.