README.md · jana-ashraf-ai/python-assistant at main

File size: 3,083 Bytes

d8d9cba
9a175d2
288967a
 
 
 
 
 
 
 
9a175d2
 
288967a
 
 
 
 
 
 
d8d9cba
9a175d2
288967a
9a175d2
288967a
9a175d2
d8d9cba
9a175d2
288967a
9a175d2
288967a
 
 
 
 
 
9a175d2
d8d9cba
9a175d2
288967a
9a175d2
288967a
9a175d2
d8d9cba
9a175d2
288967a
 
 
 
9a175d2
288967a
9a175d2
288967a
 
 
 
 
 
9a175d2
288967a
 
 
9a175d2
288967a
9a175d2
288967a
 
 
 
9a175d2
288967a
 
9a175d2
288967a
 
 
9a175d2
d8d9cba
9a175d2
 
 
288967a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a175d2
d8d9cba
9a175d2
288967a
9a175d2
288967a
9a175d2
288967a
9a175d2
288967a
9a175d2
d8d9cba
9a175d2
288967a
9a175d2
288967a

---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen2.5-1.5B-Instruct
datasets:
- iamtarun/python_code_instructions_18k_alpaca
language:
- ar
- en
pipeline_tag: text-generation
tags:
- llama-factory
- lora
- qwen2
- python
- arabic
- code
- instruction-tuning
- fine-tuned
---

# 🐍 Python Assistant (Arabic)

A fine-tuned version of **Qwen2.5-1.5B-Instruct** that answers Python programming questions in **Arabic**, with structured JSON output. Fine-tuned using LoRA via LLaMA-Factory.

---

## Model Details

- **Developed by:** jana-ashraf-ai  
- **Base Model:** [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct)  
- **Model type:** Causal Language Model (text-generation)  
- **Language(s):** Arabic (answers) + English (questions)  
- **License:** Apache 2.0  
- **Fine-tuning method:** QLoRA (LoRA rank=32) via LLaMA-Factory  

---

## What does this model do?

Given a Python programming question in English, the model returns a structured JSON answer **in Arabic**, explaining the solution step by step.

---

## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "jana-ashraf-ai/python-assistant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

system_prompt = """You are a Python expert assistant.
Answer the user's Python question in Arabic following the Output Schema.
Do not add any introduction or conclusion."""

question = "How do I reverse a list in Python?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | Qwen2.5-1.5B-Instruct |
| Fine-tuning method | LoRA (QLoRA) |
| LoRA rank | 32 |
| LoRA target | all |
| Training samples | 1,000 |
| Epochs | 3 |
| Learning rate | 1e-4 |
| LR scheduler | cosine |
| Warmup ratio | 0.1 |
| Batch size | 1 (grad accum = 8) |
| Precision | fp16 |
| Quantization | 4-bit (nf4) |
| Framework | LLaMA-Factory |
| Hardware | Google Colab T4 GPU |

---

## Training Data

Fine-tuned on a curated subset (1,000 samples) from [iamtarun/python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca).

The answers were annotated and structured using GPT to produce Arabic explanations in a JSON schema format.

**Train / Val split:** 90% / 10%

---

## Limitations

- The model is optimized for Python questions only.
- Answers are in Arabic — not suitable for English-only use cases.
- Small model size (1.5B) may struggle with very complex programming problems.
- Output quality depends on the question being clear and specific.