--- language: en license: apache-2.0 base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 tags: - data-engineering - fine-tuned - qlora - llm --- # TinyLlama Data Engineering Assistant A TinyLlama-1.1B model fine-tuned on data engineering Q&A pairs using QLoRA. It answers questions about data engineering concepts more accurately than the base model. ## Base model [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) ## Training - **Method:** QLoRA (4-bit quantization + LoRA) - **Dataset:** 15 custom data engineering Q&A pairs - **Epochs:** 10 - **LoRA rank:** 16 - **Hardware:** NVIDIA T4 (Google Colab free tier) ## Topics covered ETL, data warehouses, data lakes, Apache Spark, dbt, Apache Airflow, DAGs, batch vs stream processing, data pipelines, partitioning, data lineage, medallion architecture, idempotency, BigQuery, dimensional modeling, RAG ## How to use ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "cyb3rr31a/tinyllama-data-engineering" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) prompt = "### Question:\nWhat is dbt?\n\n### Answer:\n" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Limitations This model was fine-tuned on a small dataset of 15 examples for demonstration purposes. It performs best on the topics covered in the training data.