U4RASD
/

dalla-llama-it

Model card Files Files and versions

dalla-llama-it / README.md

qusai-di's picture

Update README.md

a0e852b verified about 1 month ago

|

history blame contribute delete

2.75 kB

	---
	license: cc-by-nc-4.0
	language:
	- ar
	- en
	base_model:
	- meta-llama/Llama-3.1-8B
	extra_gated_fields:
	First Name: text
	Last Name: text
	Date of birth: date_picker
	Country: country
	Affiliation: text
	Job title:
	type: select
	options:
	- Student
	- Research Graduate
	- AI researcher
	- AI developer/engineer
	- Reporter
	- Other
	geo: ip_location
	? By clicking Submit below I accept the terms of the license and acknowledge that
	the information I provide will be collected stored processed and shared in accordance
	with the Meta Privacy Policy
	: checkbox
	---

	# DALLA LLama

	dalla-llama is an Arabic-focused adaptation of `meta-llama/Llama-3.1-8B`, built using the [DALLA suite](https://github.com/U4RASD/dalla-model-training).
	The model uses a tokenizer modified through our [R-BPE framework](https://github.com/U4RASD/r-bpe) to improve Arabic coverage without increasing vocabulary size.
	It was further trained on curated, culturally grounded Arabic data to support more fluent Arabic generation and better value alignment with Arab communities.
	This model serves as a demonstration of the DALLA pipeline for adapting open-weight models to Arabic.

	## Intended Use

	This model is released for research purposes and general experimentation with Arabic language tasks.
	It is not designed for deployment in high-risk settings, and its outputs should not be relied on for factual, legal, medical, or sensitive decisions.

	## Getting Started

	```sh
	pip install -U transformers
	pip install -U accelerate
	pip install -U rbpe
	```

	```python
	from rbpe import RBPETokenizer
	from transformers import AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("dru-ac/dalla-llama-it")
	model = AutoModelForCausalLM.from_pretrained(
	"dru-ac/dalla-llama-it",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)
	messages = [
	{"role": "user", "content": "من انت؟"},
	]
	input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

	outputs = model.generate(input_ids, max_new_tokens=256)
	print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
	# أنا دلّة، نموذج لغوي ضخم تم تدريبي على مجموعة واسعة من البيانات في مختلف المجالات للإجابة على أسئلة المستخدمين. تم تطويري من قبل باحثي ومهندسي المركز العربي للأبحاث ودراسة السياسات الذي يقع مقره الرئيسي في الدوحة، قطر. يمكنك سؤالي عن مختلف المواضيع خاصة المتعلقة بالثقافة واللغة العربية.
	```