DialSeg-Ar-Gemma3-4B
Model Summary
DialSeg-Ar-Gemma3-4B is a model for linear semantic segmentation in dialectal conversational Arabic and related transcript-like genres.
The model is designed to split a sequence of utterances into contiguous topic-coherent segments.
What the Model Does
Given an ordered list of utterances, the model predicts segments splits.
In our setup, the task is framed as instruction-following generation over structured input, where the model outputs a JSON list of segments with line IDs.
Example input
[
{"line_id": 1, "speaker": "A", "text": "..."},
{"line_id": 2, "speaker": "B", "text": "..."},
{"line_id": 3, "speaker": "A", "text": "..."}
]
Example output
[
{"split_id": 1, "line_ids": "1,2"},
{"split_id": 2, "line_ids": "3"}
]
Intended Uses
Direct use
This model can be used for:
- topic segmentation of Arabic conversational transcripts
- chunking spoken or transcript-like content for downstream NLP
- segmentation of dialectal Arabic discourse
- baseline evaluation on DialSeg-Ar
Downstream use
Potential downstream applications include:
- transcript chunking for retrieval systems
- podcast navigation / chaptering
- discourse preprocessing for summarization
- call-center or conversational analytics
Note: this model is evaluated intrinsically for segmentation quality. Its impact on downstream tasks such as RAG or summarization was not directly measured in the paper.
Covered varieties and genres
- Broadcast transcripts
- Phone conversations
- Code-switched podcasts
- Dialogues from fiction books
- MSA news corpora
For details, see the dataset card:
- Dataset:
MBZUAI/DialSeg-Ar
How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "MBZUAI/DialSeg-Ar-Gemma3-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = """
Split the conversation (podcasts) in Gulf Arabic-English into sequential segments,
where each segment contains lines that discuss the same topic.
Conversation:
{"line_id": 1, "speaker": "A", "text": "..."}
{"line_id": 2, "speaker": "B", "text": "..."}
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
If you use this model, please cite:
@inproceedings{
title = {Linear Semantic Segmentation for Low-Resource Spoken Dialects},
author = {Chirkunov, Kirill and Samih, Younes and Freihat, Abed Alhakim and Aldarmaki, Hanan},
booktitle = {Proceedings of ACL 2026},
year = {2026}
}
- Downloads last month
- -