| --- |
| language: en |
| license: apache-2.0 |
| base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| tags: |
| - data-engineering |
| - fine-tuned |
| - qlora |
| - llm |
| --- |
| |
| # TinyLlama Data Engineering Assistant |
|
|
| A TinyLlama-1.1B model fine-tuned on data engineering Q&A pairs using QLoRA. |
| It answers questions about data engineering concepts more accurately than the base model. |
|
|
| ## Base model |
| [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) |
|
|
| ## Training |
| - **Method:** QLoRA (4-bit quantization + LoRA) |
| - **Dataset:** 15 custom data engineering Q&A pairs |
| - **Epochs:** 10 |
| - **LoRA rank:** 16 |
| - **Hardware:** NVIDIA T4 (Google Colab free tier) |
|
|
| ## Topics covered |
| ETL, data warehouses, data lakes, Apache Spark, dbt, Apache Airflow, |
| DAGs, batch vs stream processing, data pipelines, partitioning, |
| data lineage, medallion architecture, idempotency, BigQuery, |
| dimensional modeling, RAG |
|
|
| ## How to use |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_name = "cyb3rr31a/tinyllama-data-engineering" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) |
| |
| prompt = "### Question:\nWhat is dbt?\n\n### Answer:\n" |
| inputs = tokenizer(prompt, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=200) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Limitations |
| This model was fine-tuned on a small dataset of 15 examples for demonstration |
| purposes. It performs best on the topics covered in the training data. |
|
|