File size: 1,580 Bytes
e7f4fd0
75e38fc
 
 
 
 
 
 
 
e7f4fd0
 
75e38fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: en
license: apache-2.0
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
tags:
- data-engineering
- fine-tuned
- qlora
- llm
---

# TinyLlama Data Engineering Assistant

A TinyLlama-1.1B model fine-tuned on data engineering Q&A pairs using QLoRA.
It answers questions about data engineering concepts more accurately than the base model.

## Base model
[TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

## Training
- **Method:** QLoRA (4-bit quantization + LoRA)
- **Dataset:** 15 custom data engineering Q&A pairs
- **Epochs:** 10
- **LoRA rank:** 16
- **Hardware:** NVIDIA T4 (Google Colab free tier)

## Topics covered
ETL, data warehouses, data lakes, Apache Spark, dbt, Apache Airflow,
DAGs, batch vs stream processing, data pipelines, partitioning,
data lineage, medallion architecture, idempotency, BigQuery,
dimensional modeling, RAG

## How to use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "cyb3rr31a/tinyllama-data-engineering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

prompt = "### Question:\nWhat is dbt?\n\n### Answer:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Limitations
This model was fine-tuned on a small dataset of 15 examples for demonstration
purposes. It performs best on the topics covered in the training data.