k-chirkunov's picture
Update README.md
804d27b verified
metadata
language:
  - ar
license: cc-by-nc-4.0
pipeline_tag: text-generation
tags:
  - arabic
  - dialectal-arabic
  - discourse-analysis
  - linear-segmentation
  - semantic-segmentation
  - topic-segmentation
  - low-resource
pretty_name: DialSeg-Ar-Gemma3-4B
datasets:
  - MBZUAI/DialSeg-Ar

DialSeg-Ar-Gemma3-4B

Model Summary

DialSeg-Ar-Gemma3-4B is a model for linear semantic segmentation in dialectal conversational Arabic and related transcript-like genres.

The model is designed to split a sequence of utterances into contiguous topic-coherent segments.


What the Model Does

Given an ordered list of utterances, the model predicts segments splits.

In our setup, the task is framed as instruction-following generation over structured input, where the model outputs a JSON list of segments with line IDs.

Example input

[
  {"line_id": 1, "speaker": "A", "text": "..."},
  {"line_id": 2, "speaker": "B", "text": "..."},
  {"line_id": 3, "speaker": "A", "text": "..."}
]

Example output

[
  {"split_id": 1, "line_ids": "1,2"},
  {"split_id": 2, "line_ids": "3"}
]

Intended Uses

Direct use

This model can be used for:

  • topic segmentation of Arabic conversational transcripts
  • chunking spoken or transcript-like content for downstream NLP
  • segmentation of dialectal Arabic discourse
  • baseline evaluation on DialSeg-Ar

Downstream use

Potential downstream applications include:

  • transcript chunking for retrieval systems
  • podcast navigation / chaptering
  • discourse preprocessing for summarization
  • call-center or conversational analytics

Note: this model is evaluated intrinsically for segmentation quality. Its impact on downstream tasks such as RAG or summarization was not directly measured in the paper.

Covered varieties and genres

  • Broadcast transcripts
  • Phone conversations
  • Code-switched podcasts
  • Dialogues from fiction books
  • MSA news corpora

For details, see the dataset card:

  • Dataset: MBZUAI/DialSeg-Ar

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MBZUAI/DialSeg-Ar-Gemma3-4B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = """
Split the conversation (podcasts) in Gulf Arabic-English into sequential segments,
where each segment contains lines that discuss the same topic.

Conversation:
{"line_id": 1, "speaker": "A", "text": "..."}
{"line_id": 2, "speaker": "B", "text": "..."}
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model, please cite:

@inproceedings{
  title     = {Linear Semantic Segmentation for Low-Resource Spoken Dialects},
  author    = {Chirkunov, Kirill and Samih, Younes and Freihat, Abed Alhakim and Aldarmaki, Hanan},
  booktitle = {Proceedings of ACL 2026},
  year      = {2026}
}