βββββββββββ βββββββ βββ ββββββ βββ ββββββββββ ββββ
ββββββββββββ βββββββββ βββββββββββββββββββββββββββ βββββ
ββββββββ βββββββ ββββββ βββββββββββ ββββββ ββββββββββββββ
ββββββββ βββββ ββββββββββββββββββ ββββββ ββββββββββββββ
ββββββββ βββ βββ βββββββββ βββββββ βββββββββ βββ βββ
ββββββββ βββ βββ ββββββββ ββββββ βββββββββ βββ
Llama 3.2 1B β SYNAXIM .symb Format (INT4)
The first model converted to the SYNAXIM proprietary .symb inference format.
This is Meta's Llama-3.2-1B converted to SYNAXIM's framework-free .symb binary format with INT4 per-group quantization. It runs entirely through the SYNAXIM Symbiotic State Engine β no PyTorch, no Transformers library, no KV-Cache.
Quick Start
1. Install SYNAXIM
pip install grrn-inference
Or install from source:
git clone https://github.com/GRRN-MAKER/SYNAXIM.git
cd SYNAXIM
pip install -e .
2. Download This Model
# Using huggingface-cli
huggingface-cli download GRRNNOB/SYNAXIM --local-dir ./llama-1b-symb
# Or using Python
from huggingface_hub import snapshot_download
snapshot_download("GRRNNOB/SYNAXIM", local_dir="./llama-1b-symb")
3. Run Inference
from grrn_inference import GRRNModel
# Load the model
model = GRRNModel.from_pretrained("./llama-1b-symb")
# Generate text
result = model.generate("The meaning of life is", max_tokens=50, temperature=0.7)
print(result.text)
print(f"Speed: {result.tokens_per_second} tok/s")
4. Chat (OpenAI-Style)
result = model.chat([
{"role": "user", "content": "Explain quantum computing simply."}
], max_tokens=200)
print(result.choices[0].message["content"])
5. Streaming
for chunk in model.stream("Once upon a time", max_tokens=100):
print(chunk.text, end="", flush=True)
6. Serve as OpenAI API
from grrn_inference import serve
serve("./llama-1b-symb", port=8000, api_key="my-secret-key")
Then connect with any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="my-secret-key")
response = client.chat.completions.create(
model="llama-1b-symb",
messages=[{"role": "user", "content": "Hello!"}]
)
Model Details
| Property | Value |
|---|---|
| Base Model | meta-llama/Llama-3.2-1B |
| Architecture | LlamaForCausalLM (Dense) |
| Parameters | 1.24B |
| Hidden Size | 2048 |
| Layers | 16 |
| Attention Heads | 32 Q / 8 KV (GQA 4:1) |
| Head Dim | 64 |
| Vocabulary | 128,256 tokens |
| Intermediate Size | 8,192 |
| Activation | SiLU |
| RoPE ΞΈ | 500,000 |
| Tied Embeddings | Yes (lm_head = embed_tokens.T) |
| Format | .symb (SYNAXIM proprietary binary) |
| Quantization | INT4, group_size=128 |
| Compression | 3.8Γ vs FP16 |
| Total Size | ~674 MB |
.symb File Structure
llama-1b-symb/
βββ config.symb.json # Architecture + quantization config
βββ embeddings.symb # Token embeddings (INT4, 66 MB)
βββ final_norm.symb # Final RMSNorm (FP16, 4 KB)
βββ tokenizer/ # Tokenizer files
β βββ tokenizer.json
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ layers/
βββ layer_00/
β βββ attn_q.symb # Q projection (INT4)
β βββ attn_k.symb # K projection (INT4)
β βββ attn_v.symb # V projection (INT4)
β βββ attn_o.symb # Output projection (INT4)
β βββ mlp_gate.symb # SwiGLU gate (INT4)
β βββ mlp_up.symb # SwiGLU up (INT4)
β βββ mlp_down.symb # SwiGLU down (INT4)
β βββ norm_attn.symb # Pre-attention RMSNorm (FP16)
β βββ norm_mlp.symb # Pre-MLP RMSNorm (FP16)
βββ layer_01/
β βββ ...
βββ layer_15/
βββ ...
How SYNAXIM Works
SYNAXIM replaces the standard Transformer inference paradigm:
| Standard Transformer | SYNAXIM | |
|---|---|---|
| Memory Model | KV-Cache (grows with context) | O(1) M matrix (fixed size) |
| Attention | QΒ·K^TΒ·V with stored K,V pairs | Sigmoid-gated associative memory update |
| Runtime | PyTorch + CUDA | NumPy only (zero framework) |
| Weight Format | safetensors (open) | .symb (proprietary INT4 bitpacked) |
| Install Size | ~2 GB (PyTorch + deps) | < 5 MB |
The Symbiotic Gate
Instead of growing a KV cache, SYNAXIM maintains a fixed-size state matrix M:
M_{t+1} = Ο(gate_score) Β· M_t + (1 - Ο(gate_score)) Β· key β value
output = query Β· M_{t+1}
- M is
(D Γ D)β fixed size, never grows, regardless of sequence length - Gate score computed from QΒ·K similarity controls retention vs. new imprint
- Memory: O(DΒ²) fixed vs O(nΒ·d) growing KV-cache
Device Selection
# Auto-detect (uses Numba if available, else pure NumPy)
model = GRRNModel.from_pretrained("./llama-1b-symb", device="cpu")
# Force Numba-accelerated CPU (requires: pip install grrn-inference[cpu])
model = GRRNModel.from_pretrained("./llama-1b-symb", device="cpu-accelerated")
# Force pure NumPy (no dependencies beyond numpy)
model = GRRNModel.from_pretrained("./llama-1b-symb", device="cpu-numpy")
# Triton GPU (requires: pip install grrn-inference[gpu])
model = GRRNModel.from_pretrained("./llama-1b-symb", device="cuda")
System Requirements
| Requirement | Minimum |
|---|---|
| Python | 3.9+ |
| RAM | 4 GB |
| Disk | 1 GB |
| OS | macOS, Linux, Windows |
| GPU | Not required (CPU-only) |
Core dependencies: numpy, safetensors, tqdm β that's it.
Convert Your Own Model
pip install grrn-inference
grrn-convert meta-llama/Llama-3.2-1B ./my-llama-symb --quantize int4
Or in Python:
from grrn_inference import SymbioticConverter
converter = SymbioticConverter()
converter.convert(
source="meta-llama/Llama-3.2-1B",
output_dir="./my-llama-symb",
quantize="int4"
)
Supports: LLaMA, Qwen, Mistral, Phi, Gemma, Mixtral, DeepSeek, DBRX.
Important Notes
β οΈ This is a test release of the SYNAXIM engine.
This model was converted from standard Transformer weights (trained with KV-cache self-attention). The SYNAXIM Symbiotic State Engine uses a fundamentally different inference paradigm (O(1) associative memory). Output quality from standard models running through the Symbiotic Gate will differ from their original behavior β this is by design.
This release demonstrates the complete pipeline: install β download β convert β load β generate β serve. Future releases will include models specifically trained for the Symbiotic Gate paradigm.
Links
- Engine Source: github.com/GRRN-MAKER/SYNAXIM
- Original Model: meta-llama/Llama-3.2-1B
- Author: GRRNMAKER
Citation
@software{synaxim,
title={SYNAXIM: Symbiotic Native Axiom Inference Machine},
author={GRRNMAKER},
year={2026},
url={https://github.com/GRRN-MAKER/SYNAXIM}
}
SYNAXIM β Because inference should be a machine, not a framework. Built by GRRNMAKER
Model tree for GRRNNOB/SYNAXIM
Base model
meta-llama/Llama-3.2-1B