nielsr HF Staff commited on
Commit
f59eeeb
·
verified ·
1 Parent(s): b68d310

Add pipeline tag, sample usage, and update GitHub link

Browse files

This PR improves the model card for ARC-Encoder by making the following changes:

- **Adds `pipeline_tag: feature-extraction`** to the metadata. This accurately reflects the model's core functionality (compressing text into continuous representations) and enhances its discoverability on the Hugging Face Hub.
- **Adds a "Sample Usage" section** with a Python code snippet and a helpful remark, directly sourced from the project's GitHub README. This provides users with a quick and easy way to get started with the model.
- **Updates the GitHub repository link** in the introductory paragraph to point to the project's root URL (`https://github.com/kyutai-labs/ARC-Encoder`) for consistency with the project's GitHub README.
- **Corrects a typo** from "dowstream tasks" to "downstream tasks" for better readability.

These changes aim to make the model card more informative and user-friendly.

Files changed (1) hide show
  1. README.md +23 -12
README.md CHANGED
@@ -1,41 +1,52 @@
1
  ---
2
- license: cc-by-4.0
3
  language:
4
  - en
 
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
 
8
  ---
9
 
10
- # ARC-Encoder models
 
 
 
 
 
 
 
11
 
12
- This page houses `ARC8-Encoder_Mistral` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper *ARC-Encoder: learning compressed text representations for large language models* available [here](https://arxiv.org/abs/2510.20535). A code to reproduce the pretraining, further fine-tune the encoders or even evaluate them on dowstream tasks is available at [ARC-Encoder repository](https://github.com/kyutai-labs/ARC-Encoder/tree/main).
 
 
 
13
 
14
- ## Models Details
15
 
16
- All the encoders released here are trained on web crawl filtered using [Dactory](https://github.com/kyutai-labs/dactory) based on a [Llama3.2-3B](https://github.com/meta-llama/llama-cookbook) base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time:
17
  - `ARC8-Encoder_Llama`, trained on 2.6B tokens on [Llama3.1-8B](https://github.com/meta-llama/llama-cookbook) base specifically with a pooling factor of 8.
18
  - `ARC8-Encoder_Mistral`, trained on 2.6B tokens on [Mistral-7B](https://github.com/mistralai/mistral-finetune?tab=readme-ov-file) base specifically with a pooling factor of 8.
19
  - `ARC8-Encoder_multi`, trained by sampling among the two decoders with a pooling factor of 8.
20
 
21
- ### Uses
22
 
23
- As described in the [paper](https://arxiv.org/abs/2510.20535), the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks.
24
  You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF.
25
  For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining.
26
  To reproduce the results presented in the paper, you can use our released fine-tuning dataset, [ARC_finetuning](https://huggingface.co/datasets/kyutai/ARC_finetuning).
27
 
28
- ### Licensing
29
 
30
- ARC-Encoders are licensed under the CC-BY 4.0 license.
31
 
32
  Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at [Llama license](https://www.llama.com/license/).
33
 
34
- ## Citations
35
 
36
- If you use one of these models, please cite:
37
 
38
- ```bibtex
39
  @misc{pilchen2025arcencoderlearningcompressedtext,
40
  title={ARC-Encoder: learning compressed text representations for large language models},
41
  author={Hippolyte Pilchen and Edouard Grave and Patrick Pérez},
 
1
  ---
 
2
  language:
3
  - en
4
+ license: cc-by-4.0
5
  tags:
6
  - model_hub_mixin
7
  - pytorch_model_hub_mixin
8
+ pipeline_tag: feature-extraction
9
  ---
10
 
11
+ # ARC-Encoder models
12
+
13
+ This page houses `ARC8-Encoder_Mistral` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper *ARC-Encoder: learning compressed text representations for large language models* available [here](https://arxiv.org/abs/2510.20535). A code to reproduce the pretraining, further fine-tune the encoders or even evaluate them on downstream tasks is available at [ARC-Encoder repository](https://github.com/kyutai-labs/ARC-Encoder).
14
+
15
+ ## Sample Usage
16
+ First, use the following code to load the released models and format the folders accurately in your `<TMP_PATH>`. You just need to perform it once per model:
17
+ ```python
18
+ from embed_llm.models.augmented_model import load_and_save_released_models
19
 
20
+ # Example for ARC8_Encoder_Mistral, other options include "ARC8_Encoder_Llama" or "ARC8_Encoder_multi"
21
+ load_and_save_released_models("ARC8_Encoder_Mistral", hf_token="<YOUR_HF_TOKEN>")
22
+ ```
23
+ *Remark:* This code snippet loads the model from Hugging Face and then creates the appropriate folder at `<TMP_PATH>` containing the checkpoint and additional necessary files to perform finetuning or evaluation with this codebase. To reduce the occupied memory space, you can then delete the model from your Hugging Face cache.
24
 
25
+ ## Models Details
26
 
27
+ All the encoders released here are trained on web crawl filtered using [Dactory](https://github.com/kyutai-labs/dactory) based on a [Llama3.2-3B](https://github.com/meta-llama/llama-cookbook) base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time:
28
  - `ARC8-Encoder_Llama`, trained on 2.6B tokens on [Llama3.1-8B](https://github.com/meta-llama/llama-cookbook) base specifically with a pooling factor of 8.
29
  - `ARC8-Encoder_Mistral`, trained on 2.6B tokens on [Mistral-7B](https://github.com/mistralai/mistral-finetune?tab=readme-ov-file) base specifically with a pooling factor of 8.
30
  - `ARC8-Encoder_multi`, trained by sampling among the two decoders with a pooling factor of 8.
31
 
32
+ ### Uses
33
 
34
+ As described in the [paper](https://arxiv.org/abs/2510.20535), the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks.
35
  You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF.
36
  For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining.
37
  To reproduce the results presented in the paper, you can use our released fine-tuning dataset, [ARC_finetuning](https://huggingface.co/datasets/kyutai/ARC_finetuning).
38
 
39
+ ### Licensing
40
 
41
+ ARC-Encoders are licensed under the CC-BY 4.0 license.
42
 
43
  Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at [Llama license](https://www.llama.com/license/).
44
 
45
+ ## Citations
46
 
47
+ If you use one of these models, please cite:
48
 
49
+ ```bibtex
50
  @misc{pilchen2025arcencoderlearningcompressedtext,
51
  title={ARC-Encoder: learning compressed text representations for large language models},
52
  author={Hippolyte Pilchen and Edouard Grave and Patrick Pérez},