cyb3rr31a
/

tinyllama-data-engineering

data-engineering

Model card Files Files and versions

tinyllama-data-engineering / README.md

cyb3rr31a's picture

Upload README.md with huggingface_hub

75e38fc verified 24 days ago

|

history blame contribute delete

1.58 kB

	---
	language: en
	license: apache-2.0
	base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	tags:
	- data-engineering
	- fine-tuned
	- qlora
	- llm
	---

	# TinyLlama Data Engineering Assistant

	A TinyLlama-1.1B model fine-tuned on data engineering Q&A pairs using QLoRA.
	It answers questions about data engineering concepts more accurately than the base model.

	## Base model
	[TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

	## Training
	- Method: QLoRA (4-bit quantization + LoRA)
	- Dataset: 15 custom data engineering Q&A pairs
	- Epochs: 10
	- LoRA rank: 16
	- Hardware: NVIDIA T4 (Google Colab free tier)

	## Topics covered
	ETL, data warehouses, data lakes, Apache Spark, dbt, Apache Airflow,
	DAGs, batch vs stream processing, data pipelines, partitioning,
	data lineage, medallion architecture, idempotency, BigQuery,
	dimensional modeling, RAG

	## How to use
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "cyb3rr31a/tinyllama-data-engineering"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

	prompt = "### Question:\nWhat is dbt?\n\n### Answer:\n"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Limitations
	This model was fine-tuned on a small dataset of 15 examples for demonstration
	purposes. It performs best on the topics covered in the training data.