|
|
--- |
|
|
datasets: |
|
|
- HuggingFaceTB/smollm-corpus |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Raw 1B Shared |
|
|
|
|
|
This model is a 1B parameter language model pre-trained as a baseline for the research presented in the paper [Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks](https://huggingface.co/papers/2601.03448). |
|
|
|
|
|
L2T (Language Learning Tasks) is a pre-training framework that integrates structured linguistic tasks alongside standard next-token prediction to explicitly optimize for linguistic competence in Large Language Models (LLMs). This specific checkpoint is the baseline model trained on raw text. |
|
|
|
|
|
- **Paper:** [Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks](https://huggingface.co/papers/2601.03448) |
|
|
- **Repository:** [gucci-j/l2t](https://github.com/gucci-j/l2t) |
|
|
|
|
|
## How to Get Started with the Model |
|
|
Use the code below to get started with the model. |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"l2t-project/raw-1b-shared" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"l2t-project/raw-1b-shared" |
|
|
) |
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{yamaguchi2026enhancinglinguisticcompetencelanguage, |
|
|
title={Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks}, |
|
|
author={Atsuki Yamaguchi and Maggie Mi and Nikolaos Aletras}, |
|
|
year={2026}, |
|
|
eprint={2601.03448}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2601.03448}, |
|
|
journal={arXiv}, |
|
|
volume={abs/2601.03448} |
|
|
} |
|
|
``` |