Instructions to use aungkomyint/tara1.4-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use aungkomyint/tara1.4-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="aungkomyint/tara1.4-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("aungkomyint/tara1.4-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use aungkomyint/tara1.4-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "aungkomyint/tara1.4-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aungkomyint/tara1.4-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/aungkomyint/tara1.4-base
- SGLang
How to use aungkomyint/tara1.4-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "aungkomyint/tara1.4-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aungkomyint/tara1.4-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "aungkomyint/tara1.4-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "aungkomyint/tara1.4-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use aungkomyint/tara1.4-base with Docker Model Runner:
docker model run hf.co/aungkomyint/tara1.4-base
Tara 1.4 (Base Model)
Tara 1.4 is a tiny experimental base language model built using a Mixture of Experts (MoE) architecture. It is designed to act as an extremely lightweight, edge-deployable foundation model capable of basic text completion.
While it has roughly 107M total parameters, its sparse MoE architecture means only ~65M parameters are active during inference per token.
This release represents the pure base model (not fine-tuned for tool calling or instruction following). It generates unstructured, stream-of-consciousness text and is intended as a starting point for further specialized fine-tuning or edge-computing research.
Model Details
- Model name:
tara1.4 - Architecture: Custom
LlamaMoeForCausalLM(4 Experts, Top-2 Routing) - Total Parameters: ~106.9M
- Active Parameters / Token: ~65.6M
- Context length: 1,024 tokens
- Vocabulary size: 32,768 (Tara Flagship v1 Tokenizer)
- Hidden size: 448
- Intermediate size: 1536
- Layers: 12 (2 dense layers, followed by MoE layers)
- Attention heads: 7 (1 KV head)
- Weights format:
safetensors - License: Apache-2.0
The Move to Mixture of Experts (MoE)
The transition to Tara 1.4 marked a major architectural shift for the Tara series. Previous versions were dense LLaMA/GPT-2 style models. However, scaling up the reasoning capacity of a "tiny LLM" while maintaining ultra-low inference costs (for local and IoT deployment) required a new approach.
We implemented a custom LLaMA-based Mixture of Experts architecture. The model uses 4 specialized experts and routes each token to the top 2 experts. This allows the model to increase its total parameter count and factual capacity without increasing the computational cost (FLOPs) per token.
Compute Efficiency & Active Parameters
One of the most important metrics for Tara 1.4 is the distinction between Total Parameters and Active Parameters:
- Total Parameters (106.9M): The total memory footprint on disk/RAM.
- Active Parameters (65.6M): The actual number of weights evaluated during a forward pass for a single token.
Because each token only activates 2 out of the 4 experts, the model achieves the representational capacity of a 107M parameter model, but only requires the compute (FLOPs) of a ~65M parameter model. This sparse activation makes Tara 1.4 exceptionally highly compute-efficient, yielding faster inference speeds and lower energy consumption—ideal traits for battery-powered edge computing.
Benchmarking vs GPT-2 Small
To test the efficiency of the Mixture of Experts architecture, we benchmarked Tara 1.4-base against the classic GPT-2 Small (124M parameters) using standard PyTorch on a single GPU.
| Metric | GPT-2 Small (Dense) | Tara 1.4-base (MoE) |
|---|---|---|
| Total Parameters | 124.4M | 106.9M |
| Active Parameters / Token | 124.4M | 65.6M |
| Peak VRAM Usage | 502.96 MB | 419.80 MB |
| Tokens per second (Unoptimized) | 82.92 | 22.51 |
Analysis:
Tara 1.4 uses nearly 100 MB less VRAM than GPT-2 Small while theoretically operating at half the FLOPs per token. However, in pure PyTorch without custom CUDA kernels (like Triton or Flash-MoE), Tara generates tokens slower than GPT-2. This is a well-known MoE bottleneck: standard PyTorch for loops struggle with memory-bandwidth and routing overhead. Writing optimized kernels for the LlamaMoeDecoderLayer would unlock the true hardware speed of this sparse architecture.
Custom Architecture Scripts
Because LlamaMoeForCausalLM is a custom architecture, this repository includes the necessary Python files (modeling_llama_moe.py and configuration_llama_moe.py). When loading the model with Hugging Face transformers, ensure you have trust_remote_code=True enabled to allow the custom scripts to load.
Capability
Tara 1.4 is a base model. It is capable of syntactic text completion and predicting the next likely tokens based on its training distribution.
Example usage:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "aungkomyint/tara1.4-base" # Or your local path
# Make sure to set trust_remote_code=True for the custom MoE architecture!
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True, torch_dtype=torch.float32)
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=120, # Adjust between 16 and 256
do_sample=True, # Set to False for greedy decoding
temperature=0.7, # Adjust between 0.00 and 1.20
top_p=0.9, # Adjust between 0.10 and 1.00
repetition_penalty=1.08, # Penalize repeated phrases
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Limitations
- Hallucinations: As a ~100M parameter model, it simply does not have the capacity to store robust factual world knowledge. It will confidently generate incorrect facts (e.g., claiming Paris has a population of 1,000 people).
- No Instruction Tuning: This model does not understand instructions. If you ask it a question, it is highly likely to just generate more questions or continue the prompt rather than answering it.
- Not a Tool Agent: It has not been fine-tuned for tool calling.
Citation
If you use this model or the custom MoE implementation, cite it as:
Aung Ko Myint. Tara 1.4 Base. 2026. Hugging Face model checkpoint.
- Downloads last month
- 83