Update README.md
Browse files
README.md
CHANGED
|
@@ -14,194 +14,6 @@ short_description: Fast and Effective Text Classification with Many Classes
|
|
| 14 |
|
| 15 |
FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multi-class classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds.
|
| 16 |
|
| 17 |
-
## Running the Training Script
|
| 18 |
-
|
| 19 |
-
Our package provides a convenient command-line tool `train_fastfit` to train text classification models. This tool comes with a variety of configurable parameters to customize your training process.
|
| 20 |
-
|
| 21 |
-
### Prerequisites
|
| 22 |
-
|
| 23 |
-
Before running the training script, ensure you have Python installed along with our package and its dependencies. If you haven't already installed our package, you can do so using pip:
|
| 24 |
-
|
| 25 |
```bash
|
| 26 |
pip install fast-fit
|
| 27 |
-
```
|
| 28 |
-
|
| 29 |
-
### Usage
|
| 30 |
-
|
| 31 |
-
To run the training script with custom configurations, use the `train_fastfit` command followed by the necessary arguments similar to huggingface training args with few additions relevant for fast-fit.
|
| 32 |
-
|
| 33 |
-
### Example Command
|
| 34 |
-
|
| 35 |
-
Here's an example of how to use the `run_train` command with specific settings:
|
| 36 |
-
|
| 37 |
-
```bash
|
| 38 |
-
train_fastfit \
|
| 39 |
-
--model_name_or_path "roberta-base" \
|
| 40 |
-
--train_file $TRAIN_FILE \
|
| 41 |
-
--validation_file $VALIDATION_FILE \
|
| 42 |
-
--output_dir ./tmp/try \
|
| 43 |
-
--overwrite_output_dir \
|
| 44 |
-
--report_to none \
|
| 45 |
-
--label_column_name label\
|
| 46 |
-
--text_column_name text \
|
| 47 |
-
--num_train_epochs 40 \
|
| 48 |
-
--dataloader_drop_last true \
|
| 49 |
-
--per_device_train_batch_size 32 \
|
| 50 |
-
--per_device_eval_batch_size 64 \
|
| 51 |
-
--evaluation_strategy steps \
|
| 52 |
-
--max_text_length 128 \
|
| 53 |
-
--logging_steps 100 \
|
| 54 |
-
--dataloader_drop_last=False \
|
| 55 |
-
--num_repeats 4 \
|
| 56 |
-
--save_strategy no \
|
| 57 |
-
--optim adafactor \
|
| 58 |
-
--clf_loss_factor 0.1 \
|
| 59 |
-
--do_train \
|
| 60 |
-
--fp16 \
|
| 61 |
-
--projection_dim 128
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
### Output
|
| 65 |
-
|
| 66 |
-
Upon execution, `train_fastfit` will start the training process based on your parameters and output the results, including logs and model checkpoints, to the designated directory.
|
| 67 |
-
|
| 68 |
-
## Training with python
|
| 69 |
-
You can simply run it with your python
|
| 70 |
-
|
| 71 |
-
```python
|
| 72 |
-
from datasets import load_dataset
|
| 73 |
-
from fastfit import FastFitTrainer, sample_dataset
|
| 74 |
-
|
| 75 |
-
# Load a dataset from the Hugging Face Hub
|
| 76 |
-
dataset = load_dataset("mteb/banking77")
|
| 77 |
-
dataset["validation"] = dataset["test"]
|
| 78 |
-
|
| 79 |
-
# Down sample the train data for 5-shot training
|
| 80 |
-
dataset["train"] = sample_dataset(dataset["train"], label_column="label_text", num_samples_per_label=5)
|
| 81 |
-
|
| 82 |
-
trainer = FastFitTrainer(
|
| 83 |
-
model_name_or_path="roberta-base",
|
| 84 |
-
label_column_name="label_text",
|
| 85 |
-
text_column_name="text",
|
| 86 |
-
num_train_epochs=40,
|
| 87 |
-
per_device_train_batch_size=32,
|
| 88 |
-
per_device_eval_batch_size=64,
|
| 89 |
-
max_text_length=128,
|
| 90 |
-
dataloader_drop_last=False,
|
| 91 |
-
num_repeats=4,
|
| 92 |
-
optim="adafactor",
|
| 93 |
-
clf_loss_factor=0.1,
|
| 94 |
-
fp16=True,
|
| 95 |
-
dataset=dataset,
|
| 96 |
-
)
|
| 97 |
-
|
| 98 |
-
model = trainer.train()
|
| 99 |
-
results = trainer.evaluate()
|
| 100 |
-
|
| 101 |
-
print("Accuracy: {:.1f}".format(results["eval_accuracy"] * 100))
|
| 102 |
-
```
|
| 103 |
-
Output: `Accuracy: 82.4`
|
| 104 |
-
|
| 105 |
-
Then the model can be saved:
|
| 106 |
-
```python
|
| 107 |
-
model.save_pretrained("fast-fit")
|
| 108 |
-
```
|
| 109 |
-
Then you can use the model for inference
|
| 110 |
-
```python
|
| 111 |
-
from fastfit import FastFit
|
| 112 |
-
from transformers import AutoTokenizer, pipeline
|
| 113 |
-
|
| 114 |
-
model = FastFit.from_pretrained("fast-fit")
|
| 115 |
-
tokenizer = AutoTokenizer.from_pretrained("roberta-large")
|
| 116 |
-
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
| 117 |
-
|
| 118 |
-
print(classifier("I love this package!"))
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
## All avialble parameters:
|
| 122 |
-
**Optional Arguments:**
|
| 123 |
-
|
| 124 |
-
- `-h, --help`: Show this help message and exit.
|
| 125 |
-
- `--num_repeats NUM_REPEATS`: The number of times to repeat the queries and docs in every batch. (default: 1)
|
| 126 |
-
- `--proj_dim PROJ_DIM`: The dimension of the projection layer. (default: 128)
|
| 127 |
-
- `--clf_loss_factor CLF_LOSS_FACTOR`: The factor to scale the classification loss. (default: 0.1)
|
| 128 |
-
- `--pretrain_mode [PRETRAIN_MODE]`: Whether to do pre-training. (default: False)
|
| 129 |
-
- `--inference_type INFERENCE_TYPE`: The inference type to be used. (default: sim)
|
| 130 |
-
- `--rep_tokens REP_TOKENS`: The tokens to use for representation when calculating the similarity in training and inference. (default: all)
|
| 131 |
-
- `--length_norm [LENGTH_NORM]`: Whether to normalize by length while considering pad (default: False)
|
| 132 |
-
- `--mlm_factor MLM_FACTOR`: The factor to scale the MLM loss. (default: 0.0)
|
| 133 |
-
- `--mask_prob MASK_PROB`: The probability of masking a token. (default: 0.0)
|
| 134 |
-
- `--model_name_or_path MODEL_NAME_OR_PATH`: Path to pretrained model or model identifier from huggingface.co/models (default: None)
|
| 135 |
-
- `--config_name CONFIG_NAME`: Pretrained config name or path if not the same as model_name (default: None)
|
| 136 |
-
- `--tokenizer_name TOKENIZER_NAME`: Pretrained tokenizer name or path if not the same as model_name (default: None)
|
| 137 |
-
- `--cache_dir CACHE_DIR`: Where do you want to store the pretrained models downloaded from huggingface.co (default: None)
|
| 138 |
-
- `--use_fast_tokenizer [USE_FAST_TOKENIZER]`: Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: True)
|
| 139 |
-
- `--no_use_fast_tokenizer`: Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: False)
|
| 140 |
-
- `--model_revision MODEL_REVISION`: The specific model version to use (can be a branch name, tag name, or commit id). (default: main)
|
| 141 |
-
- `--use_auth_token [USE_AUTH_TOKEN]`: Will use the token generated when running `transformers-cli login` (necessary to use this script with private models). (default: False)
|
| 142 |
-
- `--ignore_mismatched_sizes [IGNORE_MISMATCHED_SIZES]`: Will enable to load a pretrained model whose head dimensions are different. (default: False)
|
| 143 |
-
- `--load_from_FastFit [LOAD_FROM_FASTFIT]`: Will load the model from the trained model directory. (default: False)
|
| 144 |
-
- `--task_name TASK_NAME`: The name of the task to train on: custom (default: None)
|
| 145 |
-
- `--metric_name METRIC_NAME`: The name of the task to train on: custom (default: accuracy)
|
| 146 |
-
- `--dataset_name DATASET_NAME`: The name of the dataset to use (via the datasets library). (default: None)
|
| 147 |
-
- `--dataset_config_name DATASET_CONFIG_NAME`: The configuration name of the dataset to use (via the datasets library). (default: None)
|
| 148 |
-
- `--max_seq_length MAX_SEQ_LENGTH`: The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. (default: 128)
|
| 149 |
-
- `--overwrite_cache [OVERWRITE_CACHE]`: Overwrite the cached preprocessed datasets or not. (default: False)
|
| 150 |
-
- `--pad_to_max_length [PAD_TO_MAX_LENGTH]`: Whether to pad all samples to `max_seq_length`. If False, will pad the samples dynamically when batching to the maximum length in the batch. (default: True)
|
| 151 |
-
- `--no_pad_to_max_length`: Whether to pad all samples to `max_seq_length`. If False, will pad the samples dynamically when batching to the maximum length in the batch. (default: False)
|
| 152 |
-
- `--max_train_samples MAX_TRAIN_SAMPLES`: For debugging purposes or quicker training, truncate the number of training examples to this value if set. (default: None)
|
| 153 |
-
- `--max_eval_samples MAX_EVAL_SAMPLES`: For debugging purposes or quicker training, truncate the number of evaluation examples to this value if set. (default: None)
|
| 154 |
-
- `--max_predict_samples MAX_PREDICT_SAMPLES`: For debugging purposes or quicker training, truncate the number of prediction examples to this value if set. (default: None)
|
| 155 |
-
- `--train_file TRAIN_FILE`: A csv or a json file containing the training data. (default: None)
|
| 156 |
-
- `--validation_file VALIDATION_FILE`: A csv or a json file containing the validation data. (default: None)
|
| 157 |
-
- `--test_file TEST_FILE`: A csv or a json file containing the test data. (default: None)
|
| 158 |
-
- `--custom_goal_acc CUSTOM_GOAL_ACC`: If set, save the model every this number of steps. (default: None)
|
| 159 |
-
- `--text_column_name TEXT_COLUMN_NAME`: The name of the column in the datasets containing the full texts (for summarization). (default: None)
|
| 160 |
-
- `--label_column_name LABEL_COLUMN_NAME`: The name of the column in the datasets containing the labels. (default: None)
|
| 161 |
-
- `--max_text_length MAX_TEXT_LENGTH`: The maximum total input sequence length after tokenization for text. (default: 32)
|
| 162 |
-
- `--max_label_length MAX_LABEL_LENGTH`: The maximum total input sequence length after tokenization for label. (default: 32)
|
| 163 |
-
- `--pre_train [PRE_TRAIN]`: The path to the pretrained model. (default: False)
|
| 164 |
-
- `--added_tokens_per_label ADDED_TOKENS_PER_LABEL`: The number of added tokens to add to every class. (default: None)
|
| 165 |
-
- `--added_tokens_mask_factor ADDED_TOKENS_MASK_FACTOR`: How much of the added tokens should be consisted of mask tokens embedding. (default: 0.0)
|
| 166 |
-
- `--added_tokens_tfidf_factor ADDED_TOKENS_TFIDF_FACTOR`: How much of the added tokens should be consisted of tfidf tokens embedding. (default: 0.0)
|
| 167 |
-
- `--pad_query_with_mask [PAD_QUERY_WITH_MASK]`: Whether to pad the query with the mask token. (default: False)
|
| 168 |
-
- `--pad_doc_with_mask [PAD_DOC_WITH_MASK]`: Whether to pad the docs with the mask token. (default: False)
|
| 169 |
-
- `--doc_mapper DOC_MAPPER`: The source for mapping docs to augmented docs (default: None)
|
| 170 |
-
- `--doc_mapper_type DOC_MAPPER_TYPE`: The type of doc mapper (default: file)
|
| 171 |
-
- `--output_dir OUTPUT_DIR`: The output directory where the model predictions and checkpoints will be written. (default: None)
|
| 172 |
-
- `--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]`: Overwrite the content of the output directory. Use this to continue training if output_dir points to a checkpoint directory. (default: False)
|
| 173 |
-
- `--do_train [DO_TRAIN]`: Whether to run training. (default: False)
|
| 174 |
-
- `--do_eval [DO_EVAL]`: Whether to run eval on the dev set. (default: False)
|
| 175 |
-
- `--do_predict [DO_PREDICT]`: Whether to run predictions on the test set. (default: False)
|
| 176 |
-
- `--evaluation_strategy {no,steps,epoch}`: The evaluation strategy to use. (default: no)
|
| 177 |
-
- `--prediction_loss_only [PREDICTION_LOSS_ONLY]`: When performing evaluation and predictions, only returns the loss. (default: False)
|
| 178 |
-
- `--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE`: Batch size per GPU/TPU core/CPU for training. (default: 8)
|
| 179 |
-
- `--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE`: Batch size per GPU/TPU core/CPU for evaluation. (default: 8)
|
| 180 |
-
- `--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE`: Deprecated, the use of `--per_device_train_batch_size` is preferred. Batch size per GPU/TPU core/CPU for training. (default: None)
|
| 181 |
-
- `--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE`: Deprecated, the use of `--per_device_eval_batch_size` is preferred. Batch size per GPU/TPU core/CPU for evaluation. (default: None)
|
| 182 |
-
- `--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS`: Number of updates steps to accumulate before performing a backward/update pass. (default: 1)
|
| 183 |
-
- `--eval_accumulation_steps EVAL_ACCUMULATION_STEPS`: Number of predictions steps to accumulate before moving the tensors to the CPU. (default: None)
|
| 184 |
-
- `--eval_delay EVAL_DELAY`: Number of epochs or steps to wait for before the first evaluation can be performed, depending on the evaluation_strategy. (default: 0)
|
| 185 |
-
- `--learning_rate LEARNING_RATE`: The initial learning rate for AdamW. (default: 5e-05)
|
| 186 |
-
- `--weight_decay WEIGHT_DECAY`: Weight decay for AdamW if we apply some. (default: 0.0)
|
| 187 |
-
- `--adam_beta1 ADAM_BETA1`: Beta1 for AdamW optimizer (default: 0.9)
|
| 188 |
-
- `--adam_beta2 ADAM_BETA2`: Beta2 for AdamW optimizer (default: 0.999)
|
| 189 |
-
- `--adam_epsilon ADAM_EPSILON`: Epsilon for AdamW optimizer. (default: 1e-08)
|
| 190 |
-
- `--max_grad_norm MAX_GRAD_NORM`: Max gradient norm. (default: 1.0)
|
| 191 |
-
- `--num_train_epochs NUM_TRAIN_EPOCHS`: Total number of training epochs to perform. (default: 3.0)
|
| 192 |
-
- `--max_steps MAX_STEPS`: If > 0: set the total number of training steps to perform. Override num_train_epochs. (default: -1)
|
| 193 |
-
- `--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}`: The scheduler type to use. (default: linear)
|
| 194 |
-
- `--warmup_ratio WARMUP_RATIO`: Linear warmup over warmup_ratio fraction of total steps. (default: 0.0)
|
| 195 |
-
- `--warmup_steps WARMUP_STEPS`: Linear warmup over warmup_steps. (default: 0)
|
| 196 |
-
- `--log_level {debug,info,warning,error,critical,passive}`: Logger log level to use on the main node. Possible choices are the log levels as strings: 'debug', 'info', 'warning', 'error', and 'critical', plus a 'passive' level which doesn't set anything and lets the application set the level. Defaults to 'passive'. (default: passive)
|
| 197 |
-
- `--log_level_replica {debug,info,warning,error,critical,passive}`: Logger log level to use on replica nodes. Same choices and defaults as `log_level` (default: passive)
|
| 198 |
-
- `--log_on_each_node [LOG_ON_EACH_NODE]`: When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: True)
|
| 199 |
-
- `--no_log_on_each_node`: When doing a multinode distributed training, whether to log once per node or just once on the main node. (default: False)
|
| 200 |
-
- `--logging_dir LOGGING_DIR`: Tensorboard log dir. (default: None)
|
| 201 |
-
- `--logging_strategy {no,steps,epoch}`: The logging strategy to use. (default: steps)
|
| 202 |
-
- `--logging_first_step [LOGGING_FIRST_STEP]`: Log the first global_step (default: False)
|
| 203 |
-
- `--logging_steps LOGGING_STEPS`: Log every X updates steps. (default: 500)
|
| 204 |
-
- `--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]`: Filter nan and inf losses for logging. (default: True)
|
| 205 |
-
- `--no_logging_nan_inf_filter`: Filter nan and inf losses for logging. (default: False)
|
| 206 |
-
- `--save_strategy {no,steps,epoch}`: The checkpoint save strategy to use. (default: steps)
|
| 207 |
-
- `--save_steps SAVE_STEPS`: Save checkpoint every X updates steps. (default: 500)
|
|
|
|
| 14 |
|
| 15 |
FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multi-class classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds.
|
| 16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
```bash
|
| 18 |
pip install fast-fit
|
| 19 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|