edge_llm / README.md
34574rd's picture
readme.md
8651d52 verified

Atlas-1B: Lightweight Fine-tuned LLM for Edge and Low-Memory Devices

🚀 Atlas-1B is a 1.2-billion parameter model fine-tuned from BaseLLM-1B to deliver improved accuracy, reasoning, and efficiency on low-power inference devices (e.g., Jetson, Ryzen APU, and mobile-based LLM frameworks).
This version introduces quantization-aware finetuning, dataset specialization, and token efficiency optimization, making it a solid drop-in model for on-device AI use cases.


🧠 Model Overview

  • Base model: BaseLLM-1B v1.3 (transformer-based autoregressive)
  • Architecture: Decoder-only transformer
  • Parameters: 1.2B
  • Precision support: FP16 / INT8 / INT4
  • Context length: 16K tokens
  • Tokenizer: SentencePiece (32K vocab)
  • Frameworks supported: PyTorch, vLLM, and sglang

This model was optimized specifically for edge inference and multi-request throughput, providing ~30% lower memory bandwidth usage at batch=4 compared to the base model.


🧩 Use Cases

  • On-device chat assistants
  • Smart IoT response systems
  • Embedded analytics (offline summarization, intent detection, etc.)
  • Lightweight reasoning for robotics

🔧 Fine-tuning Details

Attribute Description
Dataset Blend of 50M tokens curated for code, chat, and reasoning
Training framework PyTorch + DeepSpeed ZeRO-2
Optimizer AdamW
Learning rate 2e-5 (cosine decay)
Batch size 512 tokens per GPU
Epochs 3
Loss function Cross-entropy (token-level)
Special techniques LoRA adapters (rank=8), QLoRA-aware finetuning, FlashAttention-2 integration

🧪 Performance Benchmarks

Metric BaseLLM-1B Atlas-1B
MMLU (Subset) 30.2 38.7
CodeEval (Python) 22.4 29.1
Average latency (Jetson Orin, INT4) 213ms 158ms
Memory usage (FP16) 7.9GB 5.4GB

Benchmarks measured with vLLM 0.4.2 and sglang backend on an RTX 3060 (12GB) and Jetson Orin AGX.