|
|
---
|
|
|
license: mit
|
|
|
language:
|
|
|
- zho
|
|
|
- eng
|
|
|
- fra
|
|
|
- spa
|
|
|
- por
|
|
|
- deu
|
|
|
- ita
|
|
|
- rus
|
|
|
- jpn
|
|
|
- kor
|
|
|
- vie
|
|
|
- tha
|
|
|
- ara
|
|
|
base_model:
|
|
|
- Qwen/Qwen2.5-1.5B-Instruct
|
|
|
pipeline_tag: text2text-generation
|
|
|
library_name: transformers
|
|
|
tags:
|
|
|
- Text-To-SQL
|
|
|
- Arabic
|
|
|
- Spider
|
|
|
- SQL
|
|
|
---
|
|
|
|
|
|
# Model Card for Arabic Text-To-SQL (OsamaMo)
|
|
|
|
|
|
## Model Details
|
|
|
|
|
|
### Model Description
|
|
|
|
|
|
This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.
|
|
|
|
|
|
- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
|
|
|
- **Funded by:** Self-funded
|
|
|
- **Shared by:** Osama Mohamed
|
|
|
- **Model type:** Text-to-SQL fine-tuned model
|
|
|
- **Language(s):** Arabic (ar)
|
|
|
- **License:** MIT
|
|
|
- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct
|
|
|
|
|
|
### Model Sources
|
|
|
|
|
|
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
|
|
|
- **Dataset:** Spider (translated to Arabic)
|
|
|
- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)
|
|
|
|
|
|
## Uses
|
|
|
|
|
|
### Direct Use
|
|
|
|
|
|
This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.
|
|
|
|
|
|
### Downstream Use
|
|
|
|
|
|
Can be fine-tuned further for specific databases or Arabic dialect adaptations.
|
|
|
|
|
|
### Out-of-Scope Use
|
|
|
|
|
|
- The model is **not** intended for direct execution of SQL queries.
|
|
|
- Not recommended for non-database-related NLP tasks.
|
|
|
|
|
|
## Bias, Risks, and Limitations
|
|
|
|
|
|
- The model might generate incorrect or non-optimized SQL queries.
|
|
|
- Bias may exist due to dataset translations and model pretraining data.
|
|
|
|
|
|
### Recommendations
|
|
|
|
|
|
- Validate generated SQL queries before execution.
|
|
|
- Ensure compatibility with specific database schemas.
|
|
|
|
|
|
## How to Get Started with the Model
|
|
|
### Load Model
|
|
|
```python
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
import torch
|
|
|
import re
|
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
|
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
|
|
|
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"
|
|
|
|
|
|
# Load the base model and adapter for fine-tuning
|
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
|
base_model_id,
|
|
|
device_map="auto",
|
|
|
torch_dtype=torch.bfloat16
|
|
|
)
|
|
|
model.load_adapter(finetuned_model_id)
|
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
|
|
|
|
|
|
def generate_resp(messages):
|
|
|
text = tokenizer.apply_chat_template(
|
|
|
messages,
|
|
|
tokenize=False,
|
|
|
add_generation_prompt=True
|
|
|
)
|
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
|
|
generated_ids = model.generate(
|
|
|
model_inputs.input_ids,
|
|
|
max_new_tokens=1024,
|
|
|
do_sample=False, temperature= False,
|
|
|
)
|
|
|
generated_ids = [
|
|
|
output_ids[len(input_ids):]
|
|
|
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
|
|
]
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
|
return response
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Example Usage
|
|
|
```python
|
|
|
|
|
|
# Production-ready system message for SQL generation
|
|
|
system_message = (
|
|
|
"You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
|
|
|
"natural language queries into SQL queries with precision and clarity.\n"
|
|
|
)
|
|
|
|
|
|
def get_sql_query(db_schema, arabic_query):
|
|
|
# Construct the instruction message including the DB schema and the Arabic query
|
|
|
instruction_message = "\n".join([
|
|
|
"## DB-Schema:",
|
|
|
db_schema,
|
|
|
"",
|
|
|
"## User-Prompt:",
|
|
|
arabic_query,
|
|
|
"# Output SQL:",
|
|
|
"```SQL"
|
|
|
])
|
|
|
|
|
|
messages = [
|
|
|
{"role": "system", "content": system_message},
|
|
|
{"role": "user", "content": instruction_message}
|
|
|
]
|
|
|
|
|
|
response = generate_resp(messages)
|
|
|
|
|
|
# Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
|
|
|
match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE)
|
|
|
if match:
|
|
|
sql_query = match.group(1).strip()
|
|
|
return sql_query
|
|
|
else:
|
|
|
return response.strip()
|
|
|
|
|
|
# Example usage:
|
|
|
example_db_schema = r'''{
|
|
|
'Pharmcy':
|
|
|
CREATE TABLE `purchase` (
|
|
|
`BARCODE` varchar(20) NOT NULL,
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`TYPE` varchar(20) NOT NULL,
|
|
|
`COMPANY_NAME` varchar(20) NOT NULL,
|
|
|
`QUANTITY` int NOT NULL,
|
|
|
`PRICE` double NOT NULL,
|
|
|
`AMOUNT` double NOT NULL,
|
|
|
PRIMARY KEY (`BARCODE`),
|
|
|
KEY `fkr3` (`COMPANY_NAME`),
|
|
|
CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `sales` (
|
|
|
`BARCODE` varchar(20) NOT NULL,
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`TYPE` varchar(10) NOT NULL,
|
|
|
`DOSE` varchar(10) NOT NULL,
|
|
|
`QUANTITY` int NOT NULL,
|
|
|
`PRICE` double NOT NULL,
|
|
|
`AMOUNT` double NOT NULL,
|
|
|
`DATE` varchar(15) NOT NULL
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `users` (
|
|
|
`ID` int NOT NULL,
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`DOB` varchar(20) NOT NULL,
|
|
|
`ADDRESS` varchar(100) NOT NULL,
|
|
|
`PHONE` varchar(20) NOT NULL,
|
|
|
`SALARY` double NOT NULL,
|
|
|
`PASSWORD` varchar(20) NOT NULL,
|
|
|
PRIMARY KEY (`ID`)
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `history_sales` (
|
|
|
`USER_NAME` varchar(20) NOT NULL,
|
|
|
`BARCODE` varchar(20) NOT NULL,
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`TYPE` varchar(10) NOT NULL,
|
|
|
`DOSE` varchar(10) NOT NULL,
|
|
|
`QUANTITY` int NOT NULL,
|
|
|
`PRICE` double NOT NULL,
|
|
|
`AMOUNT` double NOT NULL,
|
|
|
`DATE` varchar(15) NOT NULL,
|
|
|
`TIME` varchar(20) NOT NULL
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `expiry` (
|
|
|
`PRODUCT_NAME` varchar(50) NOT NULL,
|
|
|
`PRODUCT_CODE` varchar(20) NOT NULL,
|
|
|
`DATE_OF_EXPIRY` varchar(10) NOT NULL,
|
|
|
`QUANTITY_REMAIN` int NOT NULL
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `drugs` (
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`TYPE` varchar(20) NOT NULL,
|
|
|
`BARCODE` varchar(20) NOT NULL,
|
|
|
`DOSE` varchar(10) NOT NULL,
|
|
|
`CODE` varchar(10) NOT NULL,
|
|
|
`COST_PRICE` double NOT NULL,
|
|
|
`SELLING_PRICE` double NOT NULL,
|
|
|
`EXPIRY` varchar(20) NOT NULL,
|
|
|
`COMPANY_NAME` varchar(50) NOT NULL,
|
|
|
`PRODUCTION_DATE` date NOT NULL,
|
|
|
`EXPIRATION_DATE` date NOT NULL,
|
|
|
`PLACE` varchar(20) NOT NULL,
|
|
|
`QUANTITY` int NOT NULL,
|
|
|
PRIMARY KEY (`BARCODE`)
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
CREATE TABLE `company` (
|
|
|
`NAME` varchar(50) NOT NULL,
|
|
|
`ADDRESS` varchar(50) NOT NULL,
|
|
|
`PHONE` varchar(20) NOT NULL,
|
|
|
PRIMARY KEY (`NAME`)
|
|
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1
|
|
|
|
|
|
Answer the following questions about this schema:
|
|
|
}'''
|
|
|
|
|
|
example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"
|
|
|
|
|
|
sql_result = get_sql_query(example_db_schema, example_arabic_query)
|
|
|
print("استعلام SQL الناتج:")
|
|
|
print(sql_result)
|
|
|
```
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Training Data
|
|
|
|
|
|
- Dataset: **Spider (translated into Arabic)**
|
|
|
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
|
|
|
- Training format:
|
|
|
- System instruction guiding Arabic-to-SQL conversion.
|
|
|
- Database schema provided for context.
|
|
|
- Arabic user queries mapped to correct SQL output.
|
|
|
- Output is strictly formatted SQL queries enclosed in markdown code blocks.
|
|
|
|
|
|
### Training Procedure
|
|
|
|
|
|
#### Training Hyperparameters
|
|
|
|
|
|
- **Batch size:** 1 (per device)
|
|
|
- **Gradient accumulation:** 4 steps
|
|
|
- **Learning rate:** 1.0e-4
|
|
|
- **Epochs:** 3
|
|
|
- **Scheduler:** Cosine
|
|
|
- **Warmup ratio:** 0.1
|
|
|
- **Precision:** bf16
|
|
|
|
|
|
#### Speeds, Sizes, Times
|
|
|
|
|
|
- **Training time:** 15 hours on **NVIDIA P100 8GB**
|
|
|
- **Checkpointing every:** 500 steps
|
|
|
|
|
|
## Evaluation
|
|
|
|
|
|
### Testing Data
|
|
|
|
|
|
- Validation dataset: Spider validation set (translated to Arabic)
|
|
|
|
|
|
### Metrics
|
|
|
|
|
|
- Exact Match (EM) for SQL correctness
|
|
|
- Execution Accuracy (EX) on databases
|
|
|
|
|
|
### Results
|
|
|
|
|
|
- Model achieved **competitive SQL generation accuracy** for Arabic queries.
|
|
|
- Further testing required for robustness.
|
|
|
|
|
|
## Environmental Impact
|
|
|
|
|
|
- **Hardware Type:** NVIDIA Tesla P100 8GB
|
|
|
- **Hours used:** 15
|
|
|
- **Cloud Provider:** Kaggle
|
|
|
- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)
|
|
|
|
|
|
## Technical Specifications
|
|
|
|
|
|
### Model Architecture and Objective
|
|
|
|
|
|
- Transformer-based **Qwen2.5-1.5B** architecture.
|
|
|
- Fine-tuned for Text-to-SQL task using LoRA.
|
|
|
|
|
|
### Compute Infrastructure
|
|
|
|
|
|
- **Hardware:** Kaggle P100 GPU (8GB VRAM)
|
|
|
- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use this model, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{OsamaMo_ArabicSQL,
|
|
|
author = {Osama Mohamed},
|
|
|
title = {Arabic Text-To-SQL Model},
|
|
|
year = {2024},
|
|
|
howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Model Card Contact
|
|
|
|
|
|
For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).
|
|
|
|
|
|
|