Tara 1.4 (Base Model)

Tara 1.4 is a tiny experimental base language model built using a Mixture of Experts (MoE) architecture. It is designed to act as an extremely lightweight, edge-deployable foundation model capable of basic text completion.

While it has roughly 107M total parameters, its sparse MoE architecture means only ~65M parameters are active during inference per token.

This release represents the pure base model (not fine-tuned for tool calling or instruction following). It generates unstructured, stream-of-consciousness text and is intended as a starting point for further specialized fine-tuning or edge-computing research.

Model Details

  • Model name: tara1.4
  • Architecture: Custom LlamaMoeForCausalLM (4 Experts, Top-2 Routing)
  • Total Parameters: ~106.9M
  • Active Parameters / Token: ~65.6M
  • Context length: 1,024 tokens
  • Vocabulary size: 32,768 (Tara Flagship v1 Tokenizer)
  • Hidden size: 448
  • Intermediate size: 1536
  • Layers: 12 (2 dense layers, followed by MoE layers)
  • Attention heads: 7 (1 KV head)
  • Weights format: safetensors
  • License: Apache-2.0

The Move to Mixture of Experts (MoE)

The transition to Tara 1.4 marked a major architectural shift for the Tara series. Previous versions were dense LLaMA/GPT-2 style models. However, scaling up the reasoning capacity of a "tiny LLM" while maintaining ultra-low inference costs (for local and IoT deployment) required a new approach.

We implemented a custom LLaMA-based Mixture of Experts architecture. The model uses 4 specialized experts and routes each token to the top 2 experts. This allows the model to increase its total parameter count and factual capacity without increasing the computational cost (FLOPs) per token.

Compute Efficiency & Active Parameters

One of the most important metrics for Tara 1.4 is the distinction between Total Parameters and Active Parameters:

  • Total Parameters (106.9M): The total memory footprint on disk/RAM.
  • Active Parameters (65.6M): The actual number of weights evaluated during a forward pass for a single token.

Because each token only activates 2 out of the 4 experts, the model achieves the representational capacity of a 107M parameter model, but only requires the compute (FLOPs) of a ~65M parameter model. This sparse activation makes Tara 1.4 exceptionally highly compute-efficient, yielding faster inference speeds and lower energy consumption—ideal traits for battery-powered edge computing.

Benchmarking vs GPT-2 Small

To test the efficiency of the Mixture of Experts architecture, we benchmarked Tara 1.4-base against the classic GPT-2 Small (124M parameters) using standard PyTorch on a single GPU.

Metric GPT-2 Small (Dense) Tara 1.4-base (MoE)
Total Parameters 124.4M 106.9M
Active Parameters / Token 124.4M 65.6M
Peak VRAM Usage 502.96 MB 419.80 MB
Tokens per second (Unoptimized) 82.92 22.51

Analysis: Tara 1.4 uses nearly 100 MB less VRAM than GPT-2 Small while theoretically operating at half the FLOPs per token. However, in pure PyTorch without custom CUDA kernels (like Triton or Flash-MoE), Tara generates tokens slower than GPT-2. This is a well-known MoE bottleneck: standard PyTorch for loops struggle with memory-bandwidth and routing overhead. Writing optimized kernels for the LlamaMoeDecoderLayer would unlock the true hardware speed of this sparse architecture.

Custom Architecture Scripts

Because LlamaMoeForCausalLM is a custom architecture, this repository includes the necessary Python files (modeling_llama_moe.py and configuration_llama_moe.py). When loading the model with Hugging Face transformers, ensure you have trust_remote_code=True enabled to allow the custom scripts to load.

Capability

Tara 1.4 is a base model. It is capable of syntactic text completion and predicting the next likely tokens based on its training distribution.

Example usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "aungkomyint/tara1.4-base" # Or your local path

# Make sure to set trust_remote_code=True for the custom MoE architecture!
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True, torch_dtype=torch.float32)

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=120,          # Adjust between 16 and 256
        do_sample=True,             # Set to False for greedy decoding
        temperature=0.7,            # Adjust between 0.00 and 1.20
        top_p=0.9,                  # Adjust between 0.10 and 1.00
        repetition_penalty=1.08,    # Penalize repeated phrases
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

  • Hallucinations: As a ~100M parameter model, it simply does not have the capacity to store robust factual world knowledge. It will confidently generate incorrect facts (e.g., claiming Paris has a population of 1,000 people).
  • No Instruction Tuning: This model does not understand instructions. If you ask it a question, it is highly likely to just generate more questions or continue the prompt rather than answering it.
  • Not a Tool Agent: It has not been fine-tuned for tool calling.

Citation

If you use this model or the custom MoE implementation, cite it as:

Aung Ko Myint. Tara 1.4 Base. 2026. Hugging Face model checkpoint.
Downloads last month
83
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support