Improve model card: Add pipeline tag, library name, paper, code, and usage

This PR significantly enhances the model card for `mblm-chatbot-instruction-prompts-igtree` by:
- Adding the `pipeline_tag: text-generation` to correctly categorize the model for next-token prediction tasks.
- Adding `library_name: transformers`. The `TimblHuggingFaceModel` architecture, as seen in `config.json` and the GitHub README's "Hugging Face style" usage, indicates compatibility with the Hugging Face `transformers` library (inheriting from `transformers.PreTrainedModel`), which enables an automated "How to use" widget.
- Including a direct link to the research paper: [Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling](https://huggingface.co/papers/2510.22317).
- Adding a link to the official GitHub repository: [https://github.com/antalvdb/olifant](https://github.com/antalvdb/olifant).
- Providing a sample usage code snippet, adapted from the GitHub repository, demonstrating how to initialize and use the model for text generation with `transformers`.
- Adding a comprehensive model description and the relevant BibTeX citation.

Please review and merge this PR to improve the discoverability and usability of this model on the Hub.

Files changed (1) hide show

README.md +84 -2

README.md CHANGED Viewed

@@ -1,7 +1,89 @@
 ---
-license: gpl-3.0
 datasets:
 - alespalla/chatbot_instruction_prompts
 language:
 - en
----

 ---
 datasets:
 - alespalla/chatbot_instruction_prompts
 language:
 - en
+license: gpl-3.0
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Olifant: Memory-based language modeling
+This repository contains the **Olifant** model, an implementation of memory-based language modeling, presented in the paper [Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling](https://huggingface.co/papers/2510.22317).
+Olifant models offer several unique properties and benefits:
+*   **Scalable Learning**: Learning is scalable and incremental. Model performance increases approximately log-linearly with more data; model size, learning time, and RAM usage co-scale linearly with more data.
+*   **Low CO2 Emissions**: Consistently low CO2 emissions during training and inference. *Olifant* runs on CPUs, with estimated CO2 emissions significantly lower than neural LM training (1,000 times fewer) and inference (10-100 times lower).
+*   **Transparent Functioning**: Fully transparent functioning. *Olifant* offers nearest-neighbor-based explanations for predictions, based on individual examples, allowing for full provenance.
+*   **Intentional Memorization**: Depending on context size settings, *Olifant* models can recite the majority of tokens from their training data faithfully.
+For more details, installation instructions, and further usage examples, please refer to the [official GitHub repository](https://github.com/antalvdb/olifant).
+## Usage (Hugging Face style)
+You can use the `TimblHuggingFaceModel` with the Hugging Face `transformers` library for GPT-style text completion. This requires the `olifant` library to be installed (e.g., `pip install olifant`).
+**Note:** For actual inference, you will need a trained `.ibase` classifier file. The `CLASSIFIER_PATH` in the example below should point to your `.ibase` file. You can generate this file by following the training instructions in the [Olifant GitHub repository](https://github.com/antalvdb/olifant#training).
+```python
+import torch
+from transformers import AutoTokenizer, AutoConfig
+from olifant.model.hf_wrapper import TimblHuggingFaceModel
+from olifant.classifier import timbl
+# Define paths and arguments
+# IMPORTANT: Replace "path/to/your/textfile_tok.l4r0.ibase" with the actual path to your .ibase file.
+CLASSIFIER_PATH = "path/to/your/textfile_tok.l4r0.ibase"
+TOKENIZER_NAME = "gpt2" # The tokenizer used during training (e.g., 'gpt2' as per olifant-tok)
+TIMBL_ARGS = "-a4" # For TRIBL2 k-NN approximation (as recommended for inference)
+# Initialize the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
+tokenizer.add_special_tokens({'pad_token': '_'})
+tokenizer.pad_token = "_" # Ensure pad token is set
+# Initialize the Timbl classifier
+classifier_core = timbl.TimblClassifier(CLASSIFIER_PATH, TIMBL_ARGS)
+classifier_core.load()
+# Load the model configuration from the Hugging Face Hub
+config = AutoConfig.from_pretrained("antalvdb/mblm-chatbot-instruction-prompts-igtree")
+# Initialize the TimblHuggingFaceModel
+model = TimblHuggingFaceModel(config, classifier_core, tokenizer)
+# Example text generation
+input_text = "The quick brown fox jumps over the lazy"
+input_ids = tokenizer.encode(input_text, return_tensors="pt")
+# Perform text generation
+with torch.no_grad():
+    output_ids = model.generate(
+        input_ids,
+        max_new_tokens=10,
+        num_beams=1,
+        do_sample=False, # Use greedy decoding for simplicity
+        pad_token_id=tokenizer.pad_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+print(f"Input: {input_text}")
+print(f"Generated: {generated_text}")
+```
+## Citation
+If you find this work helpful, please consider citing the paper:
+```bibtex
+@article{van_den_bosch_risco_paton_buijse_berck_van_gompel_2025,
+    title={Memory-based language models: An efficient, explainable, and eco-friendly approach to large language modeling},
+    author={Van den Bosch, Antal and Risco Patón, Ainhoa and Buijse, Teun and Berck, Peter and Van Gompel, Maarten},
+    year={2025},
+    eprint={2510.22317},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL},
+    url={https://arxiv.org/abs/2510.22317},
+}
+```