You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

DialSeg-Ar-Gemma3-4B

Model Summary

DialSeg-Ar-Gemma3-4B is a model for linear semantic segmentation in dialectal conversational Arabic and related transcript-like genres.

The model is designed to split a sequence of utterances into contiguous topic-coherent segments.


What the Model Does

Given an ordered list of utterances, the model predicts segments splits.

In our setup, the task is framed as instruction-following generation over structured input, where the model outputs a JSON list of segments with line IDs.

Example input

[
  {"line_id": 1, "speaker": "A", "text": "..."},
  {"line_id": 2, "speaker": "B", "text": "..."},
  {"line_id": 3, "speaker": "A", "text": "..."}
]

Example output

[
  {"split_id": 1, "line_ids": "1,2"},
  {"split_id": 2, "line_ids": "3"}
]

Intended Uses

Direct use

This model can be used for:

  • topic segmentation of Arabic conversational transcripts
  • chunking spoken or transcript-like content for downstream NLP
  • segmentation of dialectal Arabic discourse
  • baseline evaluation on DialSeg-Ar

Downstream use

Potential downstream applications include:

  • transcript chunking for retrieval systems
  • podcast navigation / chaptering
  • discourse preprocessing for summarization
  • call-center or conversational analytics

Note: this model is evaluated intrinsically for segmentation quality. Its impact on downstream tasks such as RAG or summarization was not directly measured in the paper.

Covered varieties and genres

  • Broadcast transcripts
  • Phone conversations
  • Code-switched podcasts
  • Dialogues from fiction books
  • MSA news corpora

For details, see the dataset card:

  • Dataset: MBZUAI/DialSeg-Ar

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MBZUAI/DialSeg-Ar-Gemma3-4B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = """
Split the conversation (podcasts) in Gulf Arabic-English into sequential segments,
where each segment contains lines that discuss the same topic.

Conversation:
{"line_id": 1, "speaker": "A", "text": "..."}
{"line_id": 2, "speaker": "B", "text": "..."}
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model, please cite:

@inproceedings{
  title     = {Linear Semantic Segmentation for Low-Resource Spoken Dialects},
  author    = {Chirkunov, Kirill and Samih, Younes and Freihat, Abed Alhakim and Aldarmaki, Hanan},
  booktitle = {Proceedings of ACL 2026},
  year      = {2026}
}

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support