vesteinn
/

gpt2-dna

@@ -44,7 +44,6 @@ The tokenizer is a minimal GPT-2-style vocabulary:
 * Implemented via `GPT2TokenizerFast`
 * Merges file is empty (no BPE applied)
-* Saved to the `dna_tokenizer/` directory for reuse
 ---
@@ -53,9 +52,7 @@ The tokenizer is a minimal GPT-2-style vocabulary:
 * Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
 * Sequences are chunked into segments of length 1024
 * Very short chunks (<200bp) are discarded
-* Resulting split sizes are saved as plain text in `processed_dna_data/`
-If no validation set is provided, a 10% split is made from the training set.
 ---

 * Implemented via `GPT2TokenizerFast`
 * Merges file is empty (no BPE applied)
 ---
 * Original dataset is cleaned to keep only `A`, `C`, `G`, `T`
 * Sequences are chunked into segments of length 1024
 * Very short chunks (<200bp) are discarded
+* A 10% split validation is made from the training set.
 ---