Update README.md

804d27b verified 5 days ago

3.08 kB

language:
  - ar
license: cc-by-nc-4.0
pipeline_tag: text-generation
tags:
  - arabic
  - dialectal-arabic
  - discourse-analysis
  - linear-segmentation
  - semantic-segmentation
  - topic-segmentation
  - low-resource
pretty_name: DialSeg-Ar-Gemma3-4B
datasets:
  - MBZUAI/DialSeg-Ar

DialSeg-Ar-Gemma3-4B

Model Summary

DialSeg-Ar-Gemma3-4B is a model for linear semantic segmentation in dialectal conversational Arabic and related transcript-like genres.

The model is designed to split a sequence of utterances into contiguous topic-coherent segments.

What the Model Does

Given an ordered list of utterances, the model predicts segments splits.

In our setup, the task is framed as instruction-following generation over structured input, where the model outputs a JSON list of segments with line IDs.

Example input

[
  {"line_id": 1, "speaker": "A", "text": "..."},
  {"line_id": 2, "speaker": "B", "text": "..."},
  {"line_id": 3, "speaker": "A", "text": "..."}
]

Example output

[
  {"split_id": 1, "line_ids": "1,2"},
  {"split_id": 2, "line_ids": "3"}
]

Intended Uses

Direct use

This model can be used for:

topic segmentation of Arabic conversational transcripts
chunking spoken or transcript-like content for downstream NLP
segmentation of dialectal Arabic discourse
baseline evaluation on DialSeg-Ar

Downstream use

Potential downstream applications include:

transcript chunking for retrieval systems
podcast navigation / chaptering
discourse preprocessing for summarization
call-center or conversational analytics

Note: this model is evaluated intrinsically for segmentation quality. Its impact on downstream tasks such as RAG or summarization was not directly measured in the paper.

Covered varieties and genres

Broadcast transcripts
Phone conversations
Code-switched podcasts
Dialogues from fiction books
MSA news corpora

For details, see the dataset card:

Dataset: MBZUAI/DialSeg-Ar

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MBZUAI/DialSeg-Ar-Gemma3-4B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = """
Split the conversation (podcasts) in Gulf Arabic-English into sequential segments,
where each segment contains lines that discuss the same topic.

Conversation:
{"line_id": 1, "speaker": "A", "text": "..."}
{"line_id": 2, "speaker": "B", "text": "..."}
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model, please cite:

@inproceedings{
  title     = {Linear Semantic Segmentation for Low-Resource Spoken Dialects},
  author    = {Chirkunov, Kirill and Samih, Younes and Freihat, Abed Alhakim and Aldarmaki, Hanan},
  booktitle = {Proceedings of ACL 2026},
  year      = {2026}
}