Help with training dataset output and validation .txts

#33

by miraclewizard - opened Oct 24, 2023

Oct 24, 2023

Hello! This is such a great resource and I'm really looking forward to using this. I want to train on a dataset (relatively small, <677 lines so ~10k tokens). I followed the directions and created a training.txt file with <|endoftext|> as a header for each AA sequence. I saved ~10% (70 lines) as a validation.txt dataset. My command: python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --output_dir /home/grant/test

The command runs successfully without any errors and to my knowledge there is no error.txt output (yay!). However, my output only has a README.md that reads as follows (see below). Clearly my results is null ([]) so I'm assuming that the training dataset is too small (I saw some other posts where people used >500 sequences). Am I right in my assumption? Is this the end of the road?

Another way to answer this would be if there were an example training.txt (and accompanying validation.txt) I could download so I know what a "good" validation run looks like.

ANY help would be appreicated. THANKS!

my output: README.md

license: apache-2.0
base_model: nferruz/ProtGPT2
tags:
- generated_from_trainer
model-index:
- name: test
results: []

test

This model is a fine-tuned version of nferruz/ProtGPT2 on an unknown dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3.0

Framework versions

Transformers 4.35.0.dev0
Pytorch 2.1.0+cu118
Datasets 2.14.5
Tokenizers 0.14.1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Help with training dataset output and validation .txts

my output: README.md

license: apache-2.0base_model: nferruz/ProtGPT2tags:- generated_from_trainermodel-index:- name: test results: []

test

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Framework versions

license: apache-2.0
base_model: nferruz/ProtGPT2
tags:
- generated_from_trainer
model-index:
- name: test
results: []