OLIFANT EduFineweb Chatbot

OLIFANT (Memory-Based Language Model) is a CPU-based, fully explainable language model that replaces neural networks with memory-based learning. Every prediction can be traced back to specific training examples, providing complete transparency.

Model Description

This model is trained on EduFineweb (high-quality educational web text) combined with chatbot instruction data, enabling conversational text generation with full explainability. Three model sizes are available. All models are based on the TiMBL memory-based engine, and use IGTree as classifier; IGTree is TiMBL's fast decision-tree approximation of k-nearest neighbor classification. All three models make use of the GPT-2 tokenizer.

XS model, edufineweb_chatbot_71M.l4r0.igtree.ibase

Feature	Value
Context Window	4 tokens
Training Data	EduFineweb shard 1, first 50M tokens + Chatbot/Instruct Data (~21M tokens)
Model Size (file)	~1.4 GB

S model, edufineweb_chatbot_121M.l4r0.igtree.ibase

Feature	Value
Context Window	4 tokens
Training Data	EduFineweb shard 1, 100M tokens + Chatbot/Instruct Data (~21M tokens)
Model Size (file)	~2.4 GB

M model, edufineweb_train_1-3_chatbot_tok.l16r0.igtree.ibase

Feature	Value
Context Window	16 tokens
Training Data	EduFineweb shards 1-3, 300M tokens + Chatbot/Instruct Data (~21M tokens)
Model Size (file)	~8.1 GB

Key Features

🔍 Full Explainability: Every prediction includes references to specific training examples with similarity scores
🌱 Eco-Friendly: 1,000x lower CO2 emissions than neural LLMs - CPU-only training and inference
📋 Regulatory Compliance: Complete audit trail for healthcare, finance, and legal applications
💻 No GPU Required: Runs on standard CPUs with ~8-10 GB RAM

Intended Use

Conversational AI with explainable outputs
Regulated industries requiring decision audit trails
Edge computing and resource-constrained environments
Green AI applications prioritizing sustainability
Research into interpretable language models

How to Use

With the Gradio Demo

Try the interactive demo: antalvdb/olifant-generate

Programmatic Usage

from transformers import GPT2Tokenizer
import timbl

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Load OLIFANT model
classifier = timbl.TimblClassifier(
    "olifant",
    "-a1 +D +vdb+di"  # IB1 algorithm with distance weighting
)
classifier.load("edufineweb_train_1-3_chatbot_tok.l16r0.igtree.ibase")

# Prepare context (16 tokens, underscore-padded)
prompt = "The capital of France is"
tokens = tokenizer.tokenize(prompt)
context = ["_"] * (16 - len(tokens)) + tokens[-16:]

# Predict next token
result = classifier.classify(context)
predicted_token = result[0]
print(f"Predicted: {tokenizer.convert_tokens_to_string([predicted_token])}")

Training Data

EduFineweb: High-quality educational web text (shards 1-3)
Chatbot Instructions: Conversational prompt-response pairs
Total: ~73 million tokens indexed in prefix trie structure

Performance

Metric	Value
Inference Speed	10-50 tokens/sec (CPU)
RAM Required	~8-10 GB
Accuracy	Approaching GPT-2 level
Best Use	Short-form completions (20-50 tokens)

Limitations

Context window: 4-16 tokens (considerably shorter than modern neural LLMs)
Creativity: Memory-based retrieval limits novel generation, stays close to training data
Optimal for: Factual completions, recitations and structured responses
Dependencies: Requires TiMBL system package for training

Environmental Impact

OLIFANT achieves 1,000x lower carbon footprint compared to GPU-based neural language models:

No GPU required for training or inference
Efficient prefix trie storage
Minimal compute requirements

Citation

@article{vandenbosch2025olifant,
  title={Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling},
  author={Van den Bosch, Antal and Risco Pat{\'o}n, Alejandro and Buijse, Thom and Berck, Peter and Van Gompel, Maarten},
  journal={arXiv preprint arXiv:2510.22317},
  year={2025}
}

Links

Paper: https://arxiv.org/abs/2510.22317
GitHub: https://github.com/antalvdb/olifant
Demo Space: https://huggingface.co/spaces/antalvdb/olifant-generate
Explainability Demo: https://huggingface.co/spaces/antalvdb/olifant-explainability-demo

License

GPL-3.0

This model card covers:

Model overview and architecture
Key differentiators (explainability, eco-friendly, CPU-based)
Technical specifications (context window, tokenizer, training data)
Usage examples with code
Performance characteristics and limitations
Environmental impact claims
Academic citation and links

Downloads last month: -; Downloads are not tracked for this model. How to track

Datasets used to train antalvdb/olifant-edufineweb-chatbot

Paper for antalvdb/olifant-edufineweb-chatbot

Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling

Paper • 2510.22317 • Published Oct 25, 2025 • 4