Improve model card: Add abstract and full paper title to link
#2
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,15 +1,15 @@
|
|
| 1 |
---
|
| 2 |
-
language: en
|
| 3 |
-
license: apache-2.0
|
| 4 |
-
library_name: transformers
|
| 5 |
-
tags:
|
| 6 |
-
- tptt
|
| 7 |
-
- peft
|
| 8 |
-
- trust_remote_code
|
| 9 |
-
pipeline_tag: text-generation
|
| 10 |
base_model: allenai/OLMo-1B-hf
|
| 11 |
datasets:
|
| 12 |
- yahma/alpaca-cleaned
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# Titanesque-OLMo-1B-hf
|
|
@@ -34,8 +34,10 @@ datasets:
|
|
| 34 |
|
| 35 |
Titanesque version of `allenai/OLMo-1B-hf` with parallel linearized attention (TPTT 😊) and PEFT.
|
| 36 |
|
| 37 |
-
The architecture was presented in the paper [TPTT](https://huggingface.co/papers/2506.17671).
|
| 38 |
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Model list
|
| 41 |
|
|
@@ -45,7 +47,7 @@ Classic model parameter with LiZA injection :
|
|
| 45 |
|-------------------------------|----------------------|------------|------------|----------------|---------------|------|-------------------------------------------------------|
|
| 46 |
| delta_rule | 8192 (default) | 0.5 | False | 64 | False | Yes | Parallel linearized attention with delta_rule operator|
|
| 47 |
| delta_rule_gelu | 8192 (default) | 0.5 | False | 64 | False | Yes | Non-linear operator with gelu activation |
|
| 48 |
-
| delta_product | 8192 (default) | 0.5 | False
|
| 49 |
| delta_product_r | 8192 (default) | 0.5 | False | 64 | False | Yes | Second order operator with rotative trick |
|
| 50 |
| delta_product_c | 8192 (default) | 0.5 | False | 64 | False | Yes | Second order operator with combined trick |
|
| 51 |
|
|
@@ -73,5 +75,4 @@ print(tokenizer.decode(outputs, skip_special_tokens=True))
|
|
| 73 |
|
| 74 |
If you use TPTT in your academic work, please cite [Furfaro](https://huggingface.co/ffurfaro). For questions or support, please open an issue on the [GitHub repository](https://github.com/fabienfrfr/tptt) or contact the maintainer.
|
| 75 |
|
| 76 |
-
|
| 77 |
---
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model: allenai/OLMo-1B-hf
|
| 3 |
datasets:
|
| 4 |
- yahma/alpaca-cleaned
|
| 5 |
+
language: en
|
| 6 |
+
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: text-generation
|
| 9 |
+
tags:
|
| 10 |
+
- tptt
|
| 11 |
+
- peft
|
| 12 |
+
- trust_remote_code
|
| 13 |
---
|
| 14 |
|
| 15 |
# Titanesque-OLMo-1B-hf
|
|
|
|
| 34 |
|
| 35 |
Titanesque version of `allenai/OLMo-1B-hf` with parallel linearized attention (TPTT 😊) and PEFT.
|
| 36 |
|
| 37 |
+
The architecture was presented in the paper [TPTT: Transforming Pretrained Transformers into Titans](https://huggingface.co/papers/2506.17671).
|
| 38 |
|
| 39 |
+
## Abstract
|
| 40 |
+
Transformer-based large language models (LLMs) have achieved strong performance across many natural language processing tasks. Nonetheless, their quadratic computational and memory requirements, particularly in self-attention layers, pose challenges for efficient inference on long contexts and for deployment in resource-limited environments. We present TPTT (Transforming Pretrained Transformers into Titans), a framework designed to augment pretrained Transformers with linearized attention (LiZA) and internal memory gating via Memory as Gate (MaG), applied without full retraining. TPTT supports parameter-efficient fine-tuning (LoRA) and integrates with standard toolkits such as Hugging Face Transformers. We evaluated TPTT on several pretrained models, including Llama-1B, OlMoE-1B-7B, Qwen2.5-1.5B, Gemma3-270m, OpenELM-1.3B, and Mistral-7B, in order to assess applicability across architectures of different scales. Experiments on models with approximately 1 billion parameters, evaluated primarily on the MMLU benchmark, suggest potential improvements in both efficiency and accuracy compared to baseline models. For example, Titans-Llama-1B exhibited up to a 20% relative increase in Exact Match scores in one-shot evaluation. An additional finding is that it is possible to convert a quadratic-attention model into a purely linear-attention model using the DeltaProduct mechanism. All training runs were carried out with modest computational resources. These preliminary findings indicate that TPTT may help adapt pretrained LLMs for long-context tasks with limited overhead. Further studies on larger models and a broader set of benchmarks will be necessary to evaluate the generality and robustness of the framework. Code is available at this https URL . Python package at this https URL .
|
| 41 |
|
| 42 |
## Model list
|
| 43 |
|
|
|
|
| 47 |
|-------------------------------|----------------------|------------|------------|----------------|---------------|------|-------------------------------------------------------|
|
| 48 |
| delta_rule | 8192 (default) | 0.5 | False | 64 | False | Yes | Parallel linearized attention with delta_rule operator|
|
| 49 |
| delta_rule_gelu | 8192 (default) | 0.5 | False | 64 | False | Yes | Non-linear operator with gelu activation |
|
| 50 |
+
| delta_product | 8192 (default) | 0.5 | False | 64 | False | Yes | Second order operator with derivative trick |
|
| 51 |
| delta_product_r | 8192 (default) | 0.5 | False | 64 | False | Yes | Second order operator with rotative trick |
|
| 52 |
| delta_product_c | 8192 (default) | 0.5 | False | 64 | False | Yes | Second order operator with combined trick |
|
| 53 |
|
|
|
|
| 75 |
|
| 76 |
If you use TPTT in your academic work, please cite [Furfaro](https://huggingface.co/ffurfaro). For questions or support, please open an issue on the [GitHub repository](https://github.com/fabienfrfr/tptt) or contact the maintainer.
|
| 77 |
|
|
|
|
| 78 |
---
|