edge_llm / README.md

readme.md

8651d52 verified 14 days ago

2.11 kB

	# Atlas-1B: Lightweight Fine-tuned LLM for Edge and Low-Memory Devices

	🚀 Atlas-1B is a 1.2-billion parameter model fine-tuned from BaseLLM-1B to deliver improved accuracy, reasoning, and efficiency on low-power inference devices (e.g., Jetson, Ryzen APU, and mobile-based LLM frameworks).
	This version introduces quantization-aware finetuning, dataset specialization, and token efficiency optimization, making it a solid drop-in model for on-device AI use cases.

	---

	## 🧠 Model Overview

	- Base model: BaseLLM-1B v1.3 (transformer-based autoregressive)
	- Architecture: Decoder-only transformer
	- Parameters: 1.2B
	- Precision support: FP16 / INT8 / INT4
	- Context length: 16K tokens
	- Tokenizer: SentencePiece (32K vocab)
	- Frameworks supported: PyTorch, vLLM, and sglang

	This model was optimized specifically for edge inference and multi-request throughput, providing ~30% lower memory bandwidth usage at batch=4 compared to the base model.

	---

	## 🧩 Use Cases

	- On-device chat assistants
	- Smart IoT response systems
	- Embedded analytics (offline summarization, intent detection, etc.)
	- Lightweight reasoning for robotics

	---

	## 🔧 Fine-tuning Details

	\| Attribute \| Description \|
	\|------------\|-------------\|
	\| Dataset \| Blend of 50M tokens curated for code, chat, and reasoning \|
	\| Training framework \| PyTorch + DeepSpeed ZeRO-2 \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 2e-5 (cosine decay) \|
	\| Batch size \| 512 tokens per GPU \|
	\| Epochs \| 3 \|
	\| Loss function \| Cross-entropy (token-level) \|
	\| Special techniques \| LoRA adapters (rank=8), QLoRA-aware finetuning, FlashAttention-2 integration \|

	---

	## 🧪 Performance Benchmarks

	\| Metric \| BaseLLM-1B \| Atlas-1B \|
	\|--------\|-------------\|----------\|
	\| MMLU (Subset) \| 30.2 \| 38.7 \|
	\| CodeEval (Python) \| 22.4 \| 29.1 \|
	\| Average latency (Jetson Orin, INT4) \| 213ms \| 158ms \|
	\| Memory usage (FP16) \| 7.9GB \| 5.4GB \|

	> Benchmarks measured with vLLM 0.4.2 and sglang backend on an RTX 3060 (12GB) and Jetson Orin AGX.

	# Atlas-1B: Lightweight Fine-tuned LLM for Edge and Low-Memory Devices

	🚀 Atlas-1B is a 1.2-billion parameter model fine-tuned from BaseLLM-1B to deliver improved accuracy, reasoning, and efficiency on low-power inference devices (e.g., Jetson, Ryzen APU, and mobile-based LLM frameworks).
	This version introduces quantization-aware finetuning, dataset specialization, and token efficiency optimization, making it a solid drop-in model for on-device AI use cases.

	---

	## 🧠 Model Overview

	- Base model: BaseLLM-1B v1.3 (transformer-based autoregressive)
	- Architecture: Decoder-only transformer
	- Parameters: 1.2B
	- Precision support: FP16 / INT8 / INT4
	- Context length: 16K tokens
	- Tokenizer: SentencePiece (32K vocab)
	- Frameworks supported: PyTorch, vLLM, and sglang

	This model was optimized specifically for edge inference and multi-request throughput, providing ~30% lower memory bandwidth usage at batch=4 compared to the base model.

	---

	## 🧩 Use Cases

	- On-device chat assistants
	- Smart IoT response systems
	- Embedded analytics (offline summarization, intent detection, etc.)
	- Lightweight reasoning for robotics

	---

	## 🔧 Fine-tuning Details

	\| Attribute \| Description \|
	\|------------\|-------------\|
	\| Dataset \| Blend of 50M tokens curated for code, chat, and reasoning \|
	\| Training framework \| PyTorch + DeepSpeed ZeRO-2 \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 2e-5 (cosine decay) \|
	\| Batch size \| 512 tokens per GPU \|
	\| Epochs \| 3 \|
	\| Loss function \| Cross-entropy (token-level) \|
	\| Special techniques \| LoRA adapters (rank=8), QLoRA-aware finetuning, FlashAttention-2 integration \|

	---

	## 🧪 Performance Benchmarks

	\| Metric \| BaseLLM-1B \| Atlas-1B \|
	\|--------\|-------------\|----------\|
	\| MMLU (Subset) \| 30.2 \| 38.7 \|
	\| CodeEval (Python) \| 22.4 \| 29.1 \|
	\| Average latency (Jetson Orin, INT4) \| 213ms \| 158ms \|
	\| Memory usage (FP16) \| 7.9GB \| 5.4GB \|

	> Benchmarks measured with vLLM 0.4.2 and sglang backend on an RTX 3060 (12GB) and Jetson Orin AGX.