Komma-LuisMiSanVe commited on
Commit
108741c
·
1 Parent(s): 4fe838d

Upload 4 files

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.es.md +68 -0
  3. README.md +68 -3
  4. train.json +3 -0
  5. trainer.py +92 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ train.json filter=lfs diff=lfs merge=lfs -text
README.es.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ > [Ver en ingles/See in english](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.md)
2
+
3
+ <img src="https://raw.githubusercontent.com/LuisMiSanVe/LuisMiSanVe/refs/heads/main/Resources/LangToSQL/LangToSQLLLM_banner.png" style="width: 100%; height: auto;" alt="LangToSQL LLM Banner">
4
+
5
+ # 🤖 Modelo de IA para sentencias PostgreSQL
6
+ [![image](https://img.shields.io/badge/postgres-%23316192.svg?style=for-the-badge&logo=postgresql&logoColor=white)](https://www.postgresql.org/)
7
+ [![image](https://img.shields.io/badge/json-5E5C5C?style=for-the-badge&logo=json&logoColor=white)](https://www.newtonsoft.com/json)
8
+ [![image](https://img.shields.io/badge/Visual_Studio_Code-0078D4?style=for-the-badge&logo=visual%20studio%20code&logoColor=white)](https://code.visualstudio.com/)
9
+ [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://www.python.org/)
10
+ [![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white)](https://pytorch.org/)
11
+ [![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white)](https://numpy.org/)
12
+ [![HuggingFace](https://img.shields.io/badge/Hugging%20Face-%23000040.svg?style=for-the-badge&logo=Hugging%20Face&logoColor=ffdf00)](https://huggingface.co/Komma-LuisMiSanVe)
13
+
14
+ >[!NOTE]
15
+ > Dale un vistazo a las otras versiones del programa:
16
+ >- [WinForms](https://github.com/LuisMiSanVe/LangToSQL/tree/main)
17
+ >- [REST API](https://github.com/LuisMiSanVe/LangToSQL_API/tree/main)
18
+ >- [ChatBot](https://github.com/LuisMiSanVe/LangToSQL_ChatBot/tree/main)
19
+ >- [NuGet](https://github.com/LuisMiSanVe/LangToSQL_NuGet/tree/main)
20
+ >- [Android](https://github.com/LuisMiSanVe/GeminiLiteSQL/tree/main)
21
+
22
+ El modelo de IA ha sido entrenado para convertir lenguaje natural a sentencias de PostgreSQL.
23
+
24
+ ## 📝 Explicación de Tecnología
25
+ El modelo usa [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) de base y refinado con los datasets de [Spider](https://yale-lily.github.io/spider).
26
+
27
+ El dataset en archivo `JSON` contiene `train_spider.json` de **Spider**, ya que es el dataset principal.
28
+
29
+ El modelo se puede exportar a `GGUF` con [llama.cpp](https://github.com/ggml-org/llama.cpp) para que puedas usarlo en programas como [LM Studio](https://lmstudio.ai/).
30
+
31
+ ## 🛠️ Instalación
32
+ Para ejecutar el script de entrenamiento por tu cuenta, primero necesitas instalar [Python](https://www.python.org/) y ejecuta este comando:
33
+ ```
34
+ pip install transformers datasets peft accelerate bitsandbytes trl
35
+ ```
36
+ Dependiendo en la versión, es posible que necesites usar este en su lugar:
37
+ ```
38
+ py -m pip install transformers datasets peft accelerate bitsandbytes trl
39
+ ```
40
+
41
+ ## 📂 Archivos
42
+ Este repositorio incluye los archivos del modelo LLM entrenado, su script de entrenamiento y el dataset para entrenar.
43
+
44
+ Puedes descargar el `GGUF` final desde los [Lanzamientos](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
45
+
46
+ ## 🚀 Lanzamientos
47
+ Una versión será lanzada solo cuando se cumplan los siguientes puntos:\
48
+ Nuevas funciones importantes y arreglos de fallos criticos causarán la salida inmediata de una nueva versión, mientras que otros cambios/arreglos menores deberán esperar una semana desde que se incluyeron en el repositorio antes de ser incluidos en la nueva versión, para que otros posibles cambios puedan ser añadidos tambien.
49
+ >[!NOTE]
50
+ >Estos posibles nuevos cambios no alargarán la espera de la salida de la nueva versión a más de una semana.
51
+
52
+ El número de la versión seguirá este formato: \
53
+ \[Añadido Importante\].\[Añadido Menor\].\[Arreglos de Errores\]
54
+
55
+ ## 💻 Tecnologías usadas
56
+ - Lenguaje de programación: [Python](https://www.python.org/)
57
+ - Librerías:
58
+ - [transformers](https://pypi.org/project/transformers/)
59
+ - [datasets](https://pypi.org/project/datasets/)
60
+ - [peft](https://pypi.org/project/peft/)
61
+ - [acceletare](https://pypi.org/project/accelerate/)
62
+ - [bitsandbytes](https://pypi.org/project/bitsandbytes/)
63
+ - [trl](https://pypi.org/project/trl/)
64
+ - Otros:
65
+ - [llama.cpp](https://lmstudio.ai/)
66
+ - [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base)
67
+ - [Spider](https://yale-lily.github.io/spider)
68
+ - IDE Recomendado: [VS Code](https://code.visualstudio.com/)
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ > [See in spanish/Ver en español](https://huggingface.co/Komma-LuisMiSanVe/LangToSQL/blob/main/README.es.md)
2
+
3
+ <img src="https://raw.githubusercontent.com/LuisMiSanVe/LuisMiSanVe/refs/heads/main/Resources/LangToSQL/LangToSQLLLM_banner.png" style="width: 100%; height: auto;" alt="LangToSQL LLM Banner">
4
+
5
+ # 🤖 AI Model for PostgreSQL queries
6
+ [![image](https://img.shields.io/badge/postgres-%23316192.svg?style=for-the-badge&logo=postgresql&logoColor=white)](https://www.postgresql.org/)
7
+ [![image](https://img.shields.io/badge/json-5E5C5C?style=for-the-badge&logo=json&logoColor=white)](https://www.newtonsoft.com/json)
8
+ [![image](https://img.shields.io/badge/Visual_Studio_Code-0078D4?style=for-the-badge&logo=visual%20studio%20code&logoColor=white)](https://code.visualstudio.com/)
9
+ [![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)](https://www.python.org/)
10
+ [![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white)](https://pytorch.org/)
11
+ [![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white)](https://numpy.org/)
12
+ [![HuggingFace](https://img.shields.io/badge/Hugging%20Face-%23000040.svg?style=for-the-badge&logo=Hugging%20Face&logoColor=ffdf00)](https://huggingface.co/Komma-LuisMiSanVe)
13
+
14
+ >[!NOTE]
15
+ > Check out other versions of this program:
16
+ >- [WinForms](https://github.com/LuisMiSanVe/LangToSQL/tree/main)
17
+ >- [REST API](https://github.com/LuisMiSanVe/LangToSQL_API/tree/main)
18
+ >- [ChatBot](https://github.com/LuisMiSanVe/LangToSQL_ChatBot/tree/main)
19
+ >- [NuGet](https://github.com/LuisMiSanVe/LangToSQL_NuGet/tree/main)
20
+ >- [Android](https://github.com/LuisMiSanVe/GeminiLiteSQL/tree/main)
21
+
22
+ The AI model has been trained for turning natural language to PostgreSQL queries.
23
+
24
+ ## 📝 Technology Explanation
25
+ This model uses [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base) as a base and then is fine tuned with [Spider](https://yale-lily.github.io/spider) datasets.
26
+
27
+ The `JSON` dataset file contains **Spider**'s `train_spider.json` as is the main dataset.
28
+
29
+ The model can be exported to `GGUF` with [llama.cpp](https://github.com/ggml-org/llama.cpp) so it can be used by programs like [LM Studio](https://lmstudio.ai/).
30
+
31
+ ## 🛠️ Setup
32
+ In order to execute the training script for your own, you first need to install [Python](https://www.python.org/) and run this command:
33
+ ```
34
+ pip install transformers datasets peft accelerate bitsandbytes trl
35
+ ```
36
+ Depending on the version, you may have to use this instead:
37
+ ```
38
+ py -m pip install transformers datasets peft accelerate bitsandbytes trl
39
+ ```
40
+
41
+ ## 📂 Files
42
+ This repository includes the trained LLM model's files, its training script and the training dataset.
43
+
44
+ You can download the final `GGUF` in the [Releases](https://github.com/LuisMiSanVe/LangToSQL_LLM/releases).
45
+
46
+ ## 🚀 Releases
47
+ The version will be released using these versioning policies:\
48
+ New major features and critical bug fixes will cause the immediate release of a new version, while other minor changes or fixes will wait one week since the time the change is introduced in the repository before being included in the new version, so that other potential changes can be added.
49
+ >[!NOTE]
50
+ >These potencial new changes will not increase the wait time for the new version beyond one week.
51
+
52
+ The version number will follow this format: \
53
+ \[Major Feature\].\[Minor Feature\].\[Bug Fixes\]
54
+
55
+ ## 💻 Technologies Used
56
+ - Programming Language: [Python](https://www.python.org/)
57
+ - Libraries:
58
+ - [transformers](https://pypi.org/project/transformers/)
59
+ - [datasets](https://pypi.org/project/datasets/)
60
+ - [peft](https://pypi.org/project/peft/)
61
+ - [acceletare](https://pypi.org/project/accelerate/)
62
+ - [bitsandbytes](https://pypi.org/project/bitsandbytes/)
63
+ - [trl](https://pypi.org/project/trl/)
64
+ - Other:
65
+ - [llama.cpp](https://github.com/ggml-org/llama.cpp)
66
+ - [DeepSeek Coder](https://huggingface.co/deepseek-ai/deepseek-coder-1.3b-base)
67
+ - [Spider](https://yale-lily.github.io/spider)
68
+ - Recommended IDE: [VS Code](https://code.visualstudio.com/)
train.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c43d0d72e59e1a9e1a60837da9bf70d5a6277226bdb7f634d544f380646f527a
3
+ size 24928884
trainer.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from datasets import load_dataset
3
+ from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
4
+ from peft import LoraConfig, PeftModel
5
+ from trl import SFTTrainer
6
+
7
+ model_name = "deepseek-ai/deepseek-coder-1.3b-base"
8
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
9
+ tokenizer.pad_token = tokenizer.eos_token
10
+
11
+ model = AutoModelForCausalLM.from_pretrained(
12
+ model_name,
13
+ torch_dtype=torch.float32,
14
+ device_map={"": "cpu"} # Sets CPU for training, you can change it to use the GPU instead
15
+ )
16
+
17
+ dataset = load_dataset("json", data_files="train.json", split="train")
18
+
19
+ def format_example(example):
20
+ return {
21
+ "instruction": example["question"],
22
+ "input": "",
23
+ "output": example["query"]
24
+ }
25
+
26
+ dataset = dataset.map(format_example)
27
+
28
+ def tokenize(example):
29
+ prompt_ids = tokenizer(
30
+ example["instruction"],
31
+ padding="max_length",
32
+ truncation=True,
33
+ max_length=512
34
+ ).input_ids
35
+
36
+ label_ids = tokenizer(
37
+ example["output"],
38
+ padding="max_length",
39
+ truncation=True,
40
+ max_length=512
41
+ ).input_ids
42
+
43
+ attention_mask = [1 if id != tokenizer.pad_token_id else 0 for id in prompt_ids]
44
+
45
+ return {
46
+ "input_ids": prompt_ids,
47
+ "attention_mask": attention_mask,
48
+ "labels": label_ids
49
+ }
50
+
51
+ dataset = dataset.map(tokenize, batched=False)
52
+
53
+ peft_config = LoraConfig(
54
+ r=16,
55
+ lora_alpha=32,
56
+ target_modules=["q_proj", "v_proj"],
57
+ lora_dropout=0.05,
58
+ bias="none",
59
+ task_type="CAUSAL_LM"
60
+ )
61
+
62
+ training_args = TrainingArguments(
63
+ output_dir="./sql-model",
64
+ per_device_train_batch_size=1,
65
+ gradient_accumulation_steps=4,
66
+ learning_rate=2e-4,
67
+ num_train_epochs=1, # More epochs -> better accuracy but longer training
68
+ logging_steps=10,
69
+ save_strategy="epoch",
70
+ fp16=False
71
+ )
72
+
73
+ trainer = SFTTrainer(
74
+ model=model,
75
+ train_dataset=dataset,
76
+ peft_config=peft_config,
77
+ args=training_args
78
+ )
79
+
80
+ trainer.train()
81
+
82
+ trainer.model.save_pretrained("./sql-model")
83
+ tokenizer.save_pretrained("./sql-model")
84
+
85
+ base_model = AutoModelForCausalLM.from_pretrained(
86
+ model_name,
87
+ torch_dtype=torch.float32,
88
+ device_map={"": "cpu"}
89
+ )
90
+ model_merged = PeftModel.from_pretrained(base_model, "./sql-model")
91
+ model_merged = model_merged.merge_and_unload()
92
+ model_merged.save_pretrained("./sql-model-merged")