Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- neo4j/text2cypher-2024v1
|
| 4 |
+
base_model:
|
| 5 |
+
- google/gemma-2-9b-it
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Model Card for Model ID
|
| 9 |
+
|
| 10 |
+
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
|
| 11 |
+
|
| 12 |
+
## Model Details
|
| 13 |
+
This is gguf format model for neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1
|
| 14 |
+
|
| 15 |
+
### Model Description
|
| 16 |
+
This model serves as a demonstration of how fine-tuning foundational models using the Neo4j-Text2Cypher(2024) Dataset (https://huggingface.co/datasets/neo4j/text2cypher-2024v1) can enhance performance on the Text2Cypher task.
|
| 17 |
+
Please note, this is part of ongoing research and exploration, aimed at highlighting the dataset's potential rather than a production-ready solution.
|
| 18 |
+
|
| 19 |
+
Base model: google/gemma-2-9b-it
|
| 20 |
+
Dataset: neo4j/text2cypher-2024v1
|
| 21 |
+
|
| 22 |
+
An overview of the finetuned models and benchmarking results are shared at https://medium.com/p/d77be96ab65a and https://medium.com/p/b2203d1173b0
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
## Bias, Risks, and Limitations
|
| 26 |
+
|
| 27 |
+
We need to be cautious about a few risks:
|
| 28 |
+
|
| 29 |
+
In our evaluation setup, the training and test sets come from the same data distribution (sampled from a larger dataset). If the data distribution changes, the results may not follow the same pattern.
|
| 30 |
+
The datasets used were gathered from publicly available sources. Over time, foundational models may access both the training and test sets, potentially achieving similar or even better results.
|
| 31 |
+
|
| 32 |
+
## Training Details
|
| 33 |
+
Training Procedure
|
| 34 |
+
Used RunPod with following setup:
|
| 35 |
+
|
| 36 |
+
1 x A100 PCIe
|
| 37 |
+
31 vCPU 117 GB RAM
|
| 38 |
+
runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
|
| 39 |
+
On-Demand - Secure Cloud
|
| 40 |
+
60 GB Disk
|
| 41 |
+
60 GB Pod Volume
|
| 42 |
+
Training Hyperparameters
|
| 43 |
+
lora_config = LoraConfig( r=64, lora_alpha=64, target_modules=target_modules, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )
|
| 44 |
+
sft_config = SFTConfig( dataset_text_field=dataset_text_field, per_device_train_batch_size=4, gradient_accumulation_steps=8, dataset_num_proc=16, max_seq_length=1600, logging_dir="./logs", num_train_epochs=1, learning_rate=2e-5, save_steps=5, save_total_limit=1, logging_steps=5, output_dir="outputs", optim="paged_adamw_8bit", save_strategy="steps", )
|
| 45 |
+
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
|
| 46 |
+
|
| 47 |
+
### NOTE on creating your own schemas:
|
| 48 |
+
In the dataset we used, the schemas are already provided. They are created either by
|
| 49 |
+
Directly using the schema the input data source provided OR
|
| 50 |
+
Creating schema using neo4j-graphrag package (Check: SchemaReader.get_schema(...) function)
|
| 51 |
+
In your own Neo4j database, you can utilize neo4j-graphrag package::SchemaReader functions
|