edeneldith commited on
Commit
aa67673
·
verified ·
1 Parent(s): 172d88b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -42,7 +42,7 @@ COLM is a novel autoregressive language model that operates entirely in the comp
42
  | **Age group estimate** | 84% rated age 13-16 |
43
  | **Training time** | 8.7 hours |
44
  | **Hardware** | Single RTX 5060 Ti 16GB |
45
- | **Tokenizer** | 499-token word+character hybrid |
46
  | **Domain** | Theological-philosophical prose |
47
 
48
  At 498k parameters — roughly half the size of TinyStories' smallest coherent model — COLM generates thematically coherent philosophical prose at temperature 1 with no spell correction.
@@ -114,8 +114,8 @@ Trained on the [DCDM dataset](https://huggingface.co/datasets/edeneldith/DCDM)
114
 
115
  ## Limitations
116
 
117
- - **Spelling:** The 499-token vocabulary means most words are assembled from character tokens, producing spelling variation
118
- - **Single domain:** Trained only on theological-philosophical text; cross-domain performance is untested
119
  - **Batch size:** Final run used batch_size=4 rather than intended 32 — results are a lower bound
120
 
121
  ## Citation
 
42
  | **Age group estimate** | 84% rated age 13-16 |
43
  | **Training time** | 8.7 hours |
44
  | **Hardware** | Single RTX 5060 Ti 16GB |
45
+ | **Tokenizer** | 499-token word+character hybrid (396 word tokens, 98 character fallback) |
46
  | **Domain** | Theological-philosophical prose |
47
 
48
  At 498k parameters — roughly half the size of TinyStories' smallest coherent model — COLM generates thematically coherent philosophical prose at temperature 1 with no spell correction.
 
114
 
115
  ## Limitations
116
 
117
+ - **Spelling:** The 499-token vocabulary contains 396 whole-word tokens covering common English and corpus-specific domain words; words outside this vocabulary require character-level assembly, producing spelling variation on out-of-vocabulary terms
118
+ - **Single trained model:** The released checkpoint has only generated text in the DCDM theological-philosophical register; cross-domain output from the trained model is untested. The data generation pipeline has been validated across approximately 894,000 tokens of private source material spanning archaeology, theology, mythology, philosophy, political history, intelligence studies, science fiction, and AI research.
119
  - **Batch size:** Final run used batch_size=4 rather than intended 32 — results are a lower bound
120
 
121
  ## Citation