Olifant: Memory-Based Language Model
Olifant is a memory-based language model that uses TiMBL (Tilburg Memory-Based Learner) instead of neural networks. It stores training instances in an indexed instance base and uses k-nearest neighbors for prediction.
Note that this model is not a neural language model. Its instruction is a decision tree with nodes, not a neural network with layers of units and weighted connections between them. A memory-based models does not have parameters; it has nodes.
Key Features
- No neural network weights - Uses .ibase files (memory-based k-NN model)
- Full prediction explainability - See which training instances influenced each prediction
- CPU-only inference - No GPU required
- Lower CO2 emissions - Significantly more environmentally friendly than neural LMs
- HuggingFace compatible - Works with standard transformers API
Requirements
This model requires the Olifant package:
pip install olifant
Olifant also relies on the command-line version of TiMBL memory-based classification engine for training. Install TiMBL on Debian/Ubuntu systems with
apt install timbl
On Alpine Linux:
apk add timbl
On macOS with brew, invoke
brew install timbl
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model (requires trust_remote_code for custom architecture)
model = AutoModelForCausalLM.from_pretrained(
"antalvdb/olifant-hf",
trust_remote_code=True
)
# Load tokenizer (uses GPT-2 tokenizer)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.set_tokenizer(tokenizer)
# Generate text
input_ids = tokenizer.encode("The quick brown", return_tensors="pt")
output_ids = model.generate(
input_ids,
max_length=20,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(output_ids[0]))
Model Details
- Architecture: Memory-based learning with TiMBL (TRIBL2 algorithm)
- Training data: EduFineWeb subset (100K lines, 24M tokens)
- Context window: 4 tokens (l4r0 configuration)
- Vocabulary: GPT-2 tokenizer (50,257 tokens)
How It Works
Unlike neural language models that learn distributed representations, Olifant:
- Stores all training n-grams as instances in an indexed database
- At inference time, finds the k-nearest neighbors to the input context
- Returns a probability distribution based on the class labels of neighbors
This approach provides full transparency: you can inspect exactly which training examples influenced each prediction.
Spaces
- Explainability demo: https://huggingface.co/spaces/antalvdb/olifant-explainability-demo
- Autoregressive generation demo: https://huggingface.co/spaces/antalvdb/olifant-generate-server
Citation
If you use this model, please cite:
@misc{bosch2025memorybasedlanguagemodelsefficient,
title={Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling},
author={Antal van den Bosch and Ainhoa Risco Patón and Teun Buijse and Peter Berck and Maarten van Gompel},
year={2025},
eprint={2510.22317},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.22317},
}
License
GPL3.0 License
- Downloads last month
- 46