ARC8_Encoder_Mistral / README.md

nielsr HF Staff

Add pipeline tag, sample usage, and update GitHub link

f59eeeb verified 3 months ago

preview code

raw

history blame

3.43 kB

metadata

language:
  - en
license: cc-by-4.0
tags:
  - model_hub_mixin
  - pytorch_model_hub_mixin
pipeline_tag: feature-extraction

ARC-Encoder models

This page houses ARC8-Encoder_Mistral from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper ARC-Encoder: learning compressed text representations for large language models available here. A code to reproduce the pretraining, further fine-tune the encoders or even evaluate them on downstream tasks is available at ARC-Encoder repository.

Sample Usage

First, use the following code to load the released models and format the folders accurately in your <TMP_PATH>. You just need to perform it once per model:

from embed_llm.models.augmented_model import load_and_save_released_models

# Example for ARC8_Encoder_Mistral, other options include "ARC8_Encoder_Llama" or "ARC8_Encoder_multi"
load_and_save_released_models("ARC8_Encoder_Mistral", hf_token="<YOUR_HF_TOKEN>")

Remark: This code snippet loads the model from Hugging Face and then creates the appropriate folder at <TMP_PATH> containing the checkpoint and additional necessary files to perform finetuning or evaluation with this codebase. To reduce the occupied memory space, you can then delete the model from your Hugging Face cache.

Models Details

All the encoders released here are trained on web crawl filtered using Dactory based on a Llama3.2-3B base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time:

ARC8-Encoder_Llama, trained on 2.6B tokens on Llama3.1-8B base specifically with a pooling factor of 8.
ARC8-Encoder_Mistral, trained on 2.6B tokens on Mistral-7B base specifically with a pooling factor of 8.
ARC8-Encoder_multi, trained by sampling among the two decoders with a pooling factor of 8.

Uses

As described in the paper, the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks. You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF. For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining. To reproduce the results presented in the paper, you can use our released fine-tuning dataset, ARC_finetuning.

Licensing

ARC-Encoders are licensed under the CC-BY 4.0 license.

Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at Llama license.

Citations

If you use one of these models, please cite:

@misc{pilchen2025arcencoderlearningcompressedtext,
      title={ARC-Encoder: learning compressed text representations for large language models}, 
      author={Hippolyte Pilchen and Edouard Grave and Patrick Pérez},
      year={2025},
      eprint={2510.20535},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.20535}, 
}