jhu-clsp
/

ettin-encoder-17m

@@ -28,10 +28,11 @@ This model is part of the Ettin suite - the first collection of paired encoder-o
   - [Decoder Models](#decoder-models)
   - [Cross-Objective Models](#cross-objective-models)
 - [Accessing Training Checkpoints](#accessing-training-checkpoints)
-- [Usage Examples](#usage-examples)
 - [Research Applications](#research-applications)
 - [Training Details](#training-details)
 - [Model Architecture](#model-architecture)
 - [Citation](#citation)
 ## 📊 Performance Highlights
@@ -53,7 +54,10 @@ This model is part of the Ettin suite - the first collection of paired encoder-o
 ### Installation
 ```bash
-pip install torch>=1.9.0 transformers>=4.21.0
 ```
 ### 30-Second Examples
@@ -198,6 +202,62 @@ model = AutoModelForCausalLM.from_pretrained(
 This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process.
 ## Usage Examples
 ### Encoder: Masked Language Modeling
@@ -274,74 +334,613 @@ print(generated)
 </details>
-## 🔬 Research Applications
-### What Makes Ettin Unique
-Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
-- **Identical Training Data**: Same 2T token mixture across all models
-- **Matched Architectures**: Only attention patterns and objectives differ
-- **Open Everything**: Training data, model weights, and batch-level training order
-- **Multiple Scales**: Fair comparison from 17M to 1B parameters
-- **250+ Checkpoints**: Complete training trajectory analysis
-### Key Research Findings
-1. **Architecture Specialization Persists**:
-   - Encoders excel at classification/retrieval even vs. larger decoders
-   - Decoders excel at generation even vs. larger encoders
-   - A 400M encoder beats a 1B decoder on MNLI (89.2 vs 88.2)
-2. **Cross-Training Limitations**:
-   - Converting decoder→encoder or encoder→decoder underperforms
-   - 50B tokens of continued training insufficient to close gaps
-   - Native training objective remains superior
-3. **Scaling Insights**:
-   - Performance gaps between architectures widen with size
-   - Decoder-from-encoder adaptation scales particularly poorly
-### Use Cases for Researchers
-- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
-- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
-- **Scaling Laws**: Study how architectural advantages change with scale
-- **Transfer Learning**: Investigate cross-objective training effectiveness
-- **Replication Studies**: First open replication of ModernBERT training recipe
-### Reproducibility
-All training artifacts are publicly available:
-- Training data with exact batch ordering
-- Model checkpoints every 8.5B tokens
-- Complete hyperparameter configurations
-- Training code and evaluation scripts
-## Training Details
-**Data:** High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens
-**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
-**Training Phases:**
-- **Pre-training**: 1.7T tokens with diverse data mixture
-- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
-- **Decay phase**: 100B tokens with premium data sources
-**Key Features:**
-- Context length: Up to 8K tokens
-- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
-- Deep but efficient architectures following MobileLLM principles
-## Model Architecture
-| Parameter | 17M | 32M | 68M | 150M | 400M | 1B |
-|:----------|:----|:----|:----|:-----|:-----|:---|
-| Layers | 7 | 10 | 19 | 22 | 28 | 28 |
-| Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 |
-| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
-| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
 ## Citation

   - [Decoder Models](#decoder-models)
   - [Cross-Objective Models](#cross-objective-models)
 - [Accessing Training Checkpoints](#accessing-training-checkpoints)
 - [Research Applications](#research-applications)
 - [Training Details](#training-details)
 - [Model Architecture](#model-architecture)
+- [Usage Examples](#usage-examples)
+- [Fine-tuning Examples](#fine-tuning-examples)
 - [Citation](#citation)
 ## 📊 Performance Highlights
 ### Installation
 ```bash
+pip install torch>=1.9.0
+# until the new pip release, install from main to use decoders (transformers>=4.54.X will contain it)
+# encoders work with transformers>=4.48.0
+pip install git+https://github.com/huggingface/transformers.git
 ```
 ### 30-Second Examples
 This checkpoint availability enables detailed analysis of training dynamics, loss curves, and capability emergence across the complete 2T token training process.
+## 🔬 Research Applications
+### What Makes Ettin Unique
+Ettin provides the first **controlled comparison** of encoder vs. decoder architectures:
+- **Identical Training Data**: Same 2T token mixture across all models
+- **Matched Architectures**: Only attention patterns and objectives differ
+- **Open Everything**: Training data, model weights, and batch-level training order
+- **Multiple Scales**: Fair comparison from 17M to 1B parameters
+- **250+ Checkpoints**: Complete training trajectory analysis
+### Use Cases for Researchers
+- **Architecture Studies**: Compare encoder vs decoder capabilities fairly
+- **Training Dynamics**: Analyze 250+ checkpoints with batch-level data ordering
+- **Scaling Laws**: Study how architectural advantages change with scale
+- **Transfer Learning**: Investigate cross-objective training effectiveness
+- **Replication Studies**: First open replication of ModernBERT training recipe
+### Reproducibility
+All training artifacts are publicly available:
+- Training data with exact batch ordering
+- Model checkpoints every 8.5B tokens
+- Complete hyperparameter configurations
+- Training code and evaluation scripts
+## Training Details
+**Data:** High-quality mixture including DCLM, Dolma v1.7, scientific papers, code, and curated sources totaling 2T+ tokens
+**Architecture:** Transformer with RoPE, GLU activations, and prenorm layers
+**Training Phases:**
+- **Pre-training**: 1.7T tokens with diverse data mixture
+- **Mid-training**: 250B tokens with higher-quality filtered data and context extension to 8K
+- **Decay phase**: 100B tokens with premium data sources
+**Key Features:**
+- Context length: Up to 8K tokens
+- Vocabulary: 50,368 tokens (ModernBERT tokenizer)
+- Deep but efficient architectures following MobileLLM principles
+## Model Architecture
+| Parameter | 17M | 32M | 68M | 150M | 400M | 1B |
+|:----------|:----|:----|:----|:-----|:-----|:---|
+| Layers | 7 | 10 | 19 | 22 | 28 | 28 |
+| Hidden Size | 256 | 384 | 512 | 768 | 1024 | 1792 |
+| Intermediate Size | 384 | 576 | 768 | 1152 | 2624 | 3840 |
+| Attention Heads | 4 | 6 | 8 | 12 | 16 | 28 |
 ## Usage Examples
 ### Encoder: Masked Language Modeling
 </details>
+## Fine-tuning Examples
+### Encoders
+<details><summary>Click to see how to finetune this into a dense embedding model using Sentence Transformers</summary>
+```python
+import argparse
+from datasets import load_dataset
+from sentence_transformers import (
+    SentenceTransformer,
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+)
+from sentence_transformers.evaluation import TripletEvaluator
+from sentence_transformers.losses import CachedMultipleNegativesRankingLoss
+from sentence_transformers.training_args import BatchSamplers
+def main():
+    # parse the lr & model name
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--lr", type=float, default=8e-5)
+    parser.add_argument("--model_name", type=str, default="jhu-clsp/ettin-encoder-150m")
+    args = parser.parse_args()
+    lr = args.lr
+    model_name = args.model_name
+    model_shortname = model_name.split("/")[-1]
+    # 1. Load a model to finetune
+    model = SentenceTransformer(model_name)
+    # 2. Load a dataset to finetune on
+    dataset = load_dataset(
+        "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1",
+        "triplet-hard",
+        split="train",
+    )
+    dataset_dict = dataset.train_test_split(test_size=1_000, seed=12)
+    train_dataset = dataset_dict["train"].select(range(1_250_000))
+    eval_dataset = dataset_dict["test"]
+    # 3. Define a loss function
+    loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)  # Increase mini_batch_size if you have enough VRAM
+    run_name = f"{model_shortname}-DPR-{lr}"
+    # 4. (Optional) Specify training arguments
+    args = SentenceTransformerTrainingArguments(
+        # Required parameter:
+        output_dir=f"output/{model_shortname}/{run_name}",
+        # Optional training parameters:
+        num_train_epochs=1,
+        per_device_train_batch_size=512,
+        per_device_eval_batch_size=512,
+        warmup_ratio=0.05,
+        fp16=False,  # Set to False if GPU can't handle FP16
+        bf16=True,  # Set to True if GPU supports BF16
+        batch_sampler=BatchSamplers.NO_DUPLICATES,  # (Cached)MultipleNegativesRankingLoss benefits from no duplicates
+        learning_rate=lr,
+        # Optional tracking/debugging parameters:
+        save_strategy="steps",
+        save_steps=500,
+        save_total_limit=2,
+        logging_steps=500,
+        run_name=run_name,  # Used in `wandb`, `tensorboard`, `neptune`, etc. if installed
+    )
+    # 5. (Optional) Create an evaluator & evaluate the base model
+    dev_evaluator = TripletEvaluator(
+        anchors=eval_dataset["query"],
+        positives=eval_dataset["positive"],
+        negatives=eval_dataset["negative"],
+        name="msmarco-co-condenser-dev",
+    )
+    dev_evaluator(model)
+    # 6. Create a trainer & train
+    trainer = SentenceTransformerTrainer(
+        model=model,
+        args=args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        loss=loss,
+        evaluator=dev_evaluator,
+    )
+    trainer.train()
+    # 7. (Optional) Evaluate the trained model on the evaluator after training
+    dev_evaluator(model)
+    # 8. Save the model
+    model.save_pretrained(f"output/{model_shortname}/{run_name}/final")
+    # 9. (Optional) Push it to the Hugging Face Hub
+    model.push_to_hub(run_name, private=False)
+if __name__ == "__main__":
+    main()
+```
+</details>
+<details><summary>Click to see how to finetune this into a multi-vector embedding model with PyLate</summary>
+```python
+from datasets import load_dataset
+from pylate import losses, models, utils
+from sentence_transformers import (
+    SentenceTransformerTrainer,
+    SentenceTransformerTrainingArguments,
+)
+def main():
+    # Load the datasets required for knowledge distillation (train, queries, documents)
+    train = load_dataset(
+        path="lightonai/ms-marco-en-bge",
+        name="train",
+    )
+    queries = load_dataset(
+        path="lightonai/ms-marco-en-bge",
+        name="queries",
+    )
+    documents = load_dataset(
+        path="lightonai/ms-marco-en-bge",
+        name="documents",
+    )
+    # Set the transformation to load the documents/queries texts using the corresponding ids on the fly
+    train.set_transform(
+        utils.KDProcessing(queries=queries, documents=documents).transform,
+    )
+    # Define the base model, training parameters, and output directory
+    num_train_epochs = 1
+    lr = 8e-5
+    batch_size = 16
+    accum_steps = 1
+    model_name = "jhu-clsp/ettin-encoder-150m"
+    model_shortname = model_name.split("/")[-1]
+    # Set the run name for logging and output directory
+    run_name = f"{model_shortname}-colbert-KD-{lr}"
+    output_dir = f"output/{model_shortname}/{run_name}"
+    # Initialize the ColBERT model from the base model
+    model = models.ColBERT(model_name_or_path=model_name)
+    # Configure the training arguments (e.g., epochs, batch size, learning rate)
+    args = SentenceTransformerTrainingArguments(
+        output_dir=output_dir,
+        num_train_epochs=num_train_epochs,
+        per_device_train_batch_size=batch_size,
+        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
+        bf16=True,  # Set to True if you have a GPU that supports BF16
+        run_name=run_name,
+        logging_steps=10,
+        learning_rate=lr,
+        gradient_accumulation_steps=accum_steps,
+        warmup_ratio=0.05,
+    )
+    # Use the Distillation loss function for training
+    train_loss = losses.Distillation(model=model)
+    # Initialize the trainer
+    trainer = SentenceTransformerTrainer(
+        model=model,
+        args=args,
+        train_dataset=train,
+        loss=train_loss,
+        data_collator=utils.ColBERTCollator(tokenize_fn=model.tokenize),
+    )
+    # Start the training process
+    trainer.train()
+    model.save_pretrained(f"{output_dir}/final")
+if __name__ == "__main__":
+    main()
+```
+</details>
+<details><summary>Click to see how to finetune this into a sparse retrieval model using Sentence Transformers</summary>
+```python
+import logging
+from datasets import load_dataset
+from sentence_transformers import (
+    SparseEncoder,
+    SparseEncoderModelCardData,
+    SparseEncoderTrainer,
+    SparseEncoderTrainingArguments,
+)
+from sentence_transformers.sparse_encoder.evaluation import SparseNanoBEIREvaluator
+from sentence_transformers.sparse_encoder.losses import SparseMultipleNegativesRankingLoss, SpladeLoss
+from sentence_transformers.training_args import BatchSamplers
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
+# 1. Load a model to finetune with 2. (Optional) model card data
+model = SparseEncoder(
+    "jhu-clsp/ettin-encoder-150m",
+    model_card_data=SparseEncoderModelCardData(
+        language="en",
+        license="apache-2.0",
+    )
+)
+# 3. Load a dataset to finetune on
+full_dataset = load_dataset("sentence-transformers/natural-questions", split="train").select(range(100_000))
+dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
+train_dataset = dataset_dict["train"]
+eval_dataset = dataset_dict["test"]
+# 4. Define a loss function
+loss = SpladeLoss(
+    model=model,
+    loss=SparseMultipleNegativesRankingLoss(model=model),
+    query_regularizer_weight=5e-5,
+    document_regularizer_weight=3e-5,
+)
+# 5. (Optional) Specify training arguments
+run_name = "splade-distilbert-base-uncased-nq"
+args = SparseEncoderTrainingArguments(
+    # Required parameter:
+    output_dir=f"models/{run_name}",
+    # Optional training parameters:
+    num_train_epochs=1,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    learning_rate=2e-5,
+    warmup_ratio=0.1,
+    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
+    bf16=False,  # Set to True if you have a GPU that supports BF16
+    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
+    # Optional tracking/debugging parameters:
+    eval_strategy="steps",
+    eval_steps=1000,
+    save_strategy="steps",
+    save_steps=1000,
+    save_total_limit=2,
+    logging_steps=200,
+    run_name=run_name,  # Will be used in W&B if `wandb` is installed
+)
+# 6. (Optional) Create an evaluator & evaluate the base model
+dev_evaluator = SparseNanoBEIREvaluator(dataset_names=["msmarco", "nfcorpus", "nq"], batch_size=16)
+# 7. Create a trainer & train
+trainer = SparseEncoderTrainer(
+    model=model,
+    args=args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    loss=loss,
+    evaluator=dev_evaluator,
+)
+trainer.train()
+# 8. Evaluate the model performance again after training
+dev_evaluator(model)
+# 9. Save the trained model
+model.save_pretrained(f"models/{run_name}/final")
+# 10. (Optional) Push it to the Hugging Face Hub
+model.push_to_hub(run_name)
+```
+</details>
+<details><summary>Click to see how to finetune this into a reranker model using Sentence Transformers</summary>
+```python
+import logging
+import traceback
+import torch
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sentence_transformers.cross_encoder import (
+    CrossEncoder,
+    CrossEncoderModelCardData,
+    CrossEncoderTrainer,
+    CrossEncoderTrainingArguments,
+)
+from sentence_transformers.cross_encoder.evaluation import (
+    CrossEncoderNanoBEIREvaluator,
+    CrossEncoderRerankingEvaluator,
+)
+from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss
+from sentence_transformers.evaluation import SequentialEvaluator
+from sentence_transformers.util import mine_hard_negatives
+# Set the log level to INFO to get more information
+logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)
+def main():
+    model_name = "jhu-clsp/ettin-encoder-150m"
+    train_batch_size = 64
+    num_epochs = 1
+    num_hard_negatives = 5  # How many hard negatives should be mined for each question-answer pair
+    # 1a. Load a model to finetune with 1b. (Optional) model card data
+    model = CrossEncoder(
+        model_name,
+        model_card_data=CrossEncoderModelCardData(
+            language="en",
+            license="apache-2.0",
+        ),
+    )
+    print("Model max length:", model.max_length)
+    print("Model num labels:", model.num_labels)
+    # 2a. Load the GooAQ dataset: https://huggingface.co/datasets/sentence-transformers/gooaq
+    logging.info("Read the gooaq training dataset")
+    full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(100_000))
+    dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=12)
+    train_dataset = dataset_dict["train"]
+    eval_dataset = dataset_dict["test"]
+    logging.info(train_dataset)
+    logging.info(eval_dataset)
+    # 2b. Modify our training dataset to include hard negatives using a very efficient embedding model
+    embedding_model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
+    hard_train_dataset = mine_hard_negatives(
+        train_dataset,
+        embedding_model,
+        num_negatives=num_hard_negatives,  # How many negatives per question-answer pair
+        margin=0,  # Similarity between query and negative samples should be x lower than query-positive similarity
+        range_min=0,  # Skip the x most similar samples
+        range_max=100,  # Consider only the x most similar samples
+        sampling_strategy="top",  # Sample the top negatives from the range
+        batch_size=4096,  # Use a batch size of 4096 for the embedding model
+        output_format="labeled-pair",  # The output format is (query, passage, label), as required by BinaryCrossEntropyLoss
+        use_faiss=True,
+    )
+    logging.info(hard_train_dataset)
+    # 2c. (Optionally) Save the hard training dataset to disk
+    # hard_train_dataset.save_to_disk("gooaq-hard-train")
+    # Load again with:
+    # hard_train_dataset = load_from_disk("gooaq-hard-train")
+    # 3. Define our training loss.
+    # pos_weight is recommended to be set as the ratio between positives to negatives, a.k.a. `num_hard_negatives`
+    loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives))
+    # 4a. Define evaluators. We use the CrossEncoderNanoBEIREvaluator, which is a light-weight evaluator for English reranking
+    nano_beir_evaluator = CrossEncoderNanoBEIREvaluator(
+        dataset_names=["msmarco", "nfcorpus", "nq"],
+        batch_size=train_batch_size,
+    )
+    # 4b. Define a reranking evaluator by mining hard negatives given query-answer pairs
+    # We include the positive answer in the list of negatives, so the evaluator can use the performance of the
+    # embedding model as a baseline.
+    hard_eval_dataset = mine_hard_negatives(
+        eval_dataset,
+        embedding_model,
+        corpus=full_dataset["answer"],  # Use the full dataset as the corpus
+        num_negatives=30,  # How many documents to rerank
+        batch_size=4096,
+        include_positives=True,
+        output_format="n-tuple",
+        use_faiss=True,
+    )
+    logging.info(hard_eval_dataset)
+    reranking_evaluator = CrossEncoderRerankingEvaluator(
+        samples=[
+            {
+                "query": sample["question"],
+                "positive": [sample["answer"]],
+                "documents": [sample[column_name] for column_name in hard_eval_dataset.column_names[2:]],
+            }
+            for sample in hard_eval_dataset
+        ],
+        batch_size=train_batch_size,
+        name="gooaq-dev",
+        # Realistic setting: only rerank the positives that the retriever found
+        # Set to True to rerank *all* positives
+        always_rerank_positives=False,
+    )
+    # 4c. Combine the evaluators & run the base model on them
+    evaluator = SequentialEvaluator([reranking_evaluator, nano_beir_evaluator])
+    evaluator(model)
+    # 5. Define the training arguments
+    short_model_name = model_name if "/" not in model_name else model_name.split("/")[-1]
+    run_name = f"reranker-{short_model_name}-gooaq-bce"
+    args = CrossEncoderTrainingArguments(
+        # Required parameter:
+        output_dir=f"models/{run_name}",
+        # Optional training parameters:
+        num_train_epochs=num_epochs,
+        per_device_train_batch_size=train_batch_size,
+        per_device_eval_batch_size=train_batch_size,
+        learning_rate=2e-5,
+        warmup_ratio=0.1,
+        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
+        bf16=True,  # Set to True if you have a GPU that supports BF16
+        dataloader_num_workers=4,
+        load_best_model_at_end=True,
+        metric_for_best_model="eval_gooaq-dev_ndcg@10",
+        # Optional tracking/debugging parameters:
+        eval_strategy="steps",
+        eval_steps=1000,
+        save_strategy="steps",
+        save_steps=1000,
+        save_total_limit=2,
+        logging_steps=200,
+        logging_first_step=True,
+        run_name=run_name,  # Will be used in W&B if `wandb` is installed
+        seed=12,
+    )
+    # 6. Create the trainer & start training
+    trainer = CrossEncoderTrainer(
+        model=model,
+        args=args,
+        train_dataset=hard_train_dataset,
+        loss=loss,
+        evaluator=evaluator,
+    )
+    trainer.train()
+    # 7. Evaluate the final model, useful to include these in the model card
+    evaluator(model)
+    # 8. Save the final model
+    final_output_dir = f"models/{run_name}/final"
+    model.save_pretrained(final_output_dir)
+    # 9. (Optional) save the model to the Hugging Face Hub!
+    # It is recommended to run `huggingface-cli login` to log into your Hugging Face account first
+    try:
+        model.push_to_hub(run_name)
+    except Exception:
+        logging.error(
+            f"Error uploading model to the Hugging Face Hub:\n{traceback.format_exc()}To upload it manually, you can run "
+            f"`huggingface-cli login`, followed by loading the model using `model = CrossEncoder({final_output_dir!r})` "
+            f"and saving it using `model.push_to_hub('{run_name}')`."
+        )
+if __name__ == "__main__":
+    main()
+```
+</details>
+### Decoders
+<details>
+<summary>Click to expand decoder training code</summary>
+# Full training
+```bash
+python trl/scripts/sft.py \
+    --model_name_or_path jhu-clsp/ettin-decoder-17m \
+    --dataset_name trl-lib/Capybara \
+    --learning_rate 2.0e-5 \
+    --num_train_epochs 1 \
+    --packing \
+    --per_device_train_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --gradient_checkpointing \
+    --eos_token '<|im_end|>' \
+    --eval_strategy steps \
+    --eval_steps 100 \
+    --output_dir ettin-decoder-17m \
+    --push_to_hub
+```
+# LoRA
+```bash
+python trl/scripts/sft.py \
+    --model_name_or_path jhu-clsp/ettin-decoder-17m \
+    --dataset_name trl-lib/Capybara \
+    --learning_rate 2.0e-4 \
+    --num_train_epochs 1 \
+    --packing \
+    --per_device_train_batch_size 2 \
+    --gradient_accumulation_steps 8 \
+    --gradient_checkpointing \
+    --eos_token '<|im_end|>' \
+    --eval_strategy steps \
+    --eval_steps 100 \
+    --use_peft \
+    --lora_r 32 \
+    --lora_alpha 16 \
+    --output_dir ettin-decoder-17m \
+    --push_to_hub
+```
+with `sft.py`:
+```python
+import argparse
+from datasets import load_dataset
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+from transformers.models.auto.modeling_auto import MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
+from trl import (
+    ModelConfig,
+    ScriptArguments,
+    SFTConfig,
+    SFTTrainer,
+    TrlParser,
+    clone_chat_template,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+)
+def main(script_args, training_args, model_args):
+    ################
+    # Model init kwargs & Tokenizer
+    ################
+    quantization_config = get_quantization_config(model_args)
+    model_kwargs = dict(
+        revision=model_args.model_revision,
+        trust_remote_code=model_args.trust_remote_code,
+        attn_implementation=model_args.attn_implementation,
+        torch_dtype=model_args.torch_dtype,
+        use_cache=False if training_args.gradient_checkpointing else True,
+        device_map=get_kbit_device_map() if quantization_config is not None else None,
+        quantization_config=quantization_config,
+    )
+    # Create model
+    config = AutoConfig.from_pretrained(model_args.model_name_or_path)
+    valid_image_text_architectures = MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES.values()
+    if config.architectures and any(arch in valid_image_text_architectures for arch in config.architectures):
+        from transformers import AutoModelForImageTextToText
+        model_kwargs.pop("use_cache", None)  # Image models do not support cache
+        model = AutoModelForImageTextToText.from_pretrained(model_args.model_name_or_path, **model_kwargs)
+    else:
+        model = AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, **model_kwargs)
+    # Create tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True
+    )
+    # Set default chat template if needed
+    if tokenizer.chat_template is None:
+        # TODO: source should be passed as an argument
+        model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B")
+    ################
+    # Dataset
+    ################
+    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
+    ################
+    # Training
+    ################
+    trainer = SFTTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,
+        peft_config=get_peft_config(model_args),
+    )
+    trainer.train()
+    # Save and push to hub
+    trainer.save_model(training_args.output_dir)
+    if training_args.push_to_hub:
+        trainer.push_to_hub(dataset_name=script_args.dataset_name)
+def make_parser(subparsers: argparse._SubParsersAction = None):
+    dataclass_types = (ScriptArguments, SFTConfig, ModelConfig)
+    if subparsers is not None:
+        parser = subparsers.add_parser("sft", help="Run the SFT training script", dataclass_types=dataclass_types)
+    else:
+        parser = TrlParser(dataclass_types)
+    return parser
+if __name__ == "__main__":
+    parser = make_parser()
+    # When using the trl cli, this script may be run with additional arguments, corresponding accelerate arguments.
+    # To ensure that their parsing does not interfere with the script arguments, parse the arguments with
+    # `return_remaining_strings=True`, then ignore the remaining strings.
+    script_args, training_args, model_args, _ = parser.parse_args_and_config(return_remaining_strings=True)
+    main(script_args, training_args, model_args)
+```
+</details>
 ## Citation