kubernetes-ai-4bit / README.md

Update README.md

8335a67 verified 3 months ago

5.67 kB

	---
	license: mit
	language:
	- tr
	- en
	library_name: transformers
	tags:
	- kubernetes
	- devops
	- quantized
	- 4bit
	- gemma3
	- bitsandbytes
	base_model: aciklab/kubernetes-ai
	model_type: gemma3
	quantized_by: aciklab
	---

	# Kubernetes AI - 4bit Safetensors

	Fine-tuned Gemma 3 12B model specialized for answering Kubernetes questions in Turkish, quantized to 4bit format for efficient inference with reduced memory footprint.

	## Model Description

	This repository contains a 4bit quantized version of the Kubernetes AI model, optimized for running on consumer hardware with reduced VRAM/RAM requirements. The model uses BitsAndBytes quantization with safetensors format for fast loading and efficient inference.

	Primary Purpose: Answer Kubernetes-related questions in Turkish language with minimal hardware requirements.

	## Model Specifications

	\| Specification \| Details \|
	\|---------------\|---------\|
	\| Format \| Safetensors (4bit quantized) \|
	\| Base Model \| unsloth/gemma-3-12b-it-qat-bnb-4bit \|
	\| Quantization \| 4bit (BitsAndBytes) \|
	\| Model Size \| ~7.2 GB \|
	\| Memory Usage \| ~8-10 GB VRAM/RAM \|
	\| Precision \| 4bit weights, FP16 compute \|

	## Quick Start

	### Installation

	```bash
	# Install required packages
	pip install torch transformers accelerate bitsandbytes safetensors
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model_name = "aciklab/kubernetes-ai-4bit"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	trust_remote_code=True
	)

	# Prepare input
	prompt = "Kubernetes'te 3 replikaya sahip bir deployment nasıl oluştururum?"

	# Format with chat template
	messages = [
	{"role": "system", "content": "Sen Kubernetes konusunda uzmanlaşmış bir yapay zeka asistanısın. Kubernetes ile ilgili soruları Türkçe olarak yanıtlıyorsun."},
	{"role": "user", "content": prompt}
	]

	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	# Generate response
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=1.0,
	top_p=0.95,
	top_k=64,
	repetition_penalty=1.05,
	do_sample=True
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Training Details

	This model is based on the [aciklab/kubernetes-ai](https://huggingface.co/aciklab/kubernetes-ai) LoRA adapters:

	- Base Model: unsloth/gemma-3-12b-it-qat-bnb-4bit
	- Training Method: LoRA (Low-Rank Adaptation)
	- LoRA Rank: 8
	- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Training Dataset: ~157,210 examples from Kubernetes docs, Stack Overflow, and DevOps datasets
	- Training Time: 28 hours on NVIDIA RTX 5070 12GB
	- Max Sequence Length: 1024 tokens

	### Training Dataset Summary

	\| Dataset Category \| Count \| Description \|
	\|-----------------\|-------\|-------------\|
	\| Kubernetes Official Docs \| 8,910 \| Concepts, kubectl, setup, tasks, tutorials \|
	\| Stack Overflow \| 52,000 \| Kubernetes Q&A from community \|
	\| DevOps Datasets \| 62,500 \| General DevOps and Kubernetes content \|
	\| Configurations & CLI \| 36,800 \| Kubernetes configs, kubectl examples, operators \|
	\| Total \| ~157,210 \| Comprehensive Kubernetes knowledge base \|

	## Quantization Details

	This model uses 4bit quantization with BitsAndBytes for optimal memory efficiency:

	- Source: Merged LoRA adapters with base model
	- Quantization Method: BitsAndBytes 4bit (NF4)
	- Compute Precision: FP16
	- Format: Safetensors (fast loading)
	- Memory Footprint: ~7.2 GB on disk, ~8-10 GB in memory

	### Advantages of 4bit Format

	- Efficient Memory Usage: Runs on GPUs with 8GB+ VRAM
	- Fast Loading: Safetensors format loads quickly
	- Good Quality: Minimal accuracy loss compared to full precision
	- Framework Support: Compatible with Transformers, vLLM, Text Generation Inference
	- Flexible Deployment: Can run on CPU with acceptable speed

	## Hardware Requirements

	### Minimum (GPU)
	- GPU: 8GB VRAM
	- RAM: 8GB system memory
	- Storage: 10GB free space

	### Recommended
	- GPU: 12GB+ VRAM
	- RAM: 16GB system memory
	- Storage: 15GB free space



	## Limitations

	- Language: Optimized primarily for Turkish and English.
	- Domain: Specialized for Kubernetes; may not perform well on general topics
	- Quantization: 4bit quantization may occasionally affect response quality on complex queries

	## License

	This model is released under the MIT License. Free to use in commercial and open-source projects.

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{kubernetes-ai-4bit,
	author = {HAVELSAN/Açıklab},
	title = {Kubernetes AI - 4bit Safetensors},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/aciklab/kubernetes-ai-4bit}}
	}
	```

	## Contact

	Produced by: HAVELSAN/Açıklab

	For questions, feedback, or issues, please open an issue on the model repository or contact us through HuggingFace.

	## Related Models

	- [aciklab/kubernetes-ai](https://huggingface.co/aciklab/kubernetes-ai) - Original LoRA adapters
	- [aciklab/kubernetes-ai-GGUF](https://huggingface.co/aciklab/kubernetes-ai-GGUF) - GGUF quantized versions for llama.cpp

	---

	Note: This is a 4bit quantized model ready for immediate use with the Transformers library. No additional model merging or quantization required.