Harley-ml commited on
Commit
1675b57
·
verified ·
1 Parent(s): ff554d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +412 -3
README.md CHANGED
@@ -1,3 +1,412 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - es
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - word-generator
9
+ - mini
10
+ - tiny
11
+ - experiment
12
+ - small
13
+ - mistral-lm
14
+ - text-generation-inference
15
+ - word-generation
16
+ - test
17
+ - fun
18
+ - explore
19
+ - lexical
20
+ - words
21
+ - word
22
+ ---
23
+
24
+ # Tiny-Word
25
+
26
+ Tiny-Word is an extremely tiny Mistral-like model, approximately ~81k parameters. It generates English or Spanish words or word-like sequences.
27
+
28
+ ## Architecture
29
+
30
+ | Key | Value |
31
+ | :---------------: | :---: |
32
+ | hidden_size | 32 |
33
+ | num_layers | 2 |
34
+ | num_heads | 1 |
35
+ | num_kv_heads | 1 |
36
+ | intermediate_size | 256 |
37
+ | vocab_size | 1200 |
38
+
39
+ ## Training
40
+
41
+ Tiny-Word was trained on 753,232 unique words (entries), 3,225,398 tokens, and 7,022,310 characters. ~660k of those words are English, while ~90k of them are Spanish.
42
+
43
+ ### Dataset
44
+
45
+ | Key | Value |
46
+ | :---------------------: | :-------: |
47
+ | Entries (words) | 753,232 |
48
+ | Tokens | 3,225,398 |
49
+ | Characters | 7,022,310 |
50
+ | Avg. Tokens Per Entry | ~4.2 |
51
+ | Avg. Words Per Entry | 1 |
52
+ | Avg. Chars Per Entry | ~9.3 |
53
+ | Longest Entry (Tokens) | 36 |
54
+ | Shortest Entry (Tokens) | 1 |
55
+ | English Words | ~660k |
56
+ | Spanish Words | ~90k |
57
+
58
+ ### Training Setup
59
+
60
+ We trained the model for 6 epochs with a batch size of 128 and a gradient accumulation of 2.
61
+ The chosen sliding_window was 64, even though the longest word is only 36 tokens, which is inefficient and suboptimal. However, this shouldn’t affect the model in any way; it only slows training down.
62
+
63
+ #### Hardware
64
+
65
+ Tiny-Word was trained on Google Colaboratory, with 1 Nvidia Tesla T4 GPU, 15 GB of VRAM, and 12.7 GB of RAM.
66
+
67
+ ### Training Results
68
+
69
+ | step | train_loss | val_loss | train_ppl | val_ppl |
70
+ | :---- | :--------- | :------- | :-------- | :------ |
71
+ | 1000 | 4.9619 | 4.5201 | ~143.0 | ~91.8 |
72
+ | 3000 | 4.0093 | 3.9156 | ~55.0 | ~50.2 |
73
+ | 4000 | 3.8464 | 3.7951 | ~46.8 | ~44.5 |
74
+ | 6000 | 3.6814 | 3.6612 | ~39.7 | ~38.9 |
75
+ | 7000 | 3.6329 | 3.6182 | ~37.8 | ~37.2 |
76
+ | 9000 | 3.5684 | 3.5636 | ~35.5 | ~35.3 |
77
+ | 10000 | 3.5452 | 3.5444 | ~34.7 | ~34.6 |
78
+ | 12000 | 3.5139 | 3.5161 | ~33.6 | ~33.7 |
79
+ | 15000 | 3.4784 | 3.4861 | ~32.4 | ~32.6 |
80
+
81
+ Tiny-Word shows promising results, even at its tiny size (~81k parameters). Given the relatively easy task (predicting subwords inside single words), this is expected.
82
+
83
+ ## Generation Examples
84
+
85
+ Prompt:
86
+
87
+ ```
88
+ d
89
+ ```
90
+
91
+ Output:
92
+
93
+ ```
94
+ desmounder's's's
95
+ ```
96
+
97
+ Prompt:
98
+
99
+ ```
100
+ 0333333333
101
+ ```
102
+
103
+ Output:
104
+
105
+ ```
106
+ ruperperse'sf
107
+ ```
108
+
109
+ Prompt:
110
+
111
+ ```
112
+ a
113
+ ```
114
+
115
+ Output:
116
+
117
+ ```
118
+ utomatographic'sphon
119
+ ```
120
+
121
+ Prompt:
122
+
123
+ ```
124
+ e
125
+ ```
126
+
127
+ Output:
128
+
129
+ ```
130
+ equip’s’s’s
131
+ ```
132
+
133
+ The model generates plausible word-like sequences that can be pronounced; sometimes it produces real words as well. It can handle almost all input; even if it’s nonsensical, it’ll still try to generate a word.
134
+
135
+ ## Limitations
136
+
137
+ 1. It does not generate sentences, prose, code, or anything besides a single word-like sequence.
138
+ 2. It cannot reason or produce complex language.
139
+ 3. It often appends common artifacts after the word is generated, such as: "'s", "'sphon", etc.
140
+ 4. Most generated words aren’t real and instead reflect the lexicon and morphology of the English and Spanish languages.
141
+
142
+ ## Quick Demo
143
+
144
+ ```python
145
+ #!/usr/bin/env python3
146
+ """
147
+ Tiny Mistral REPL demo — streaming tokens (TextStreamer if available, else manual sampling).
148
+ Commands: :quit, :help, :show, :set <param> <value> (max_new_tokens, temperature, top_p, full_output)
149
+ """
150
+ from __future__ import annotations
151
+ import shlex
152
+ import time
153
+ import torch
154
+ from typing import Optional
155
+
156
+ from transformers import AutoTokenizer, MistralForCausalLM
157
+
158
+ # --------- CONFIG ----------
159
+ MODEL_DIR = "Harley-ml/tiny-word"
160
+ TOKENIZER_DIR = MODEL_DIR
161
+ DEFAULT_MAX_NEW_TOKENS = 16
162
+ DEFAULT_TEMPERATURE = 0.4
163
+ DEFAULT_TOP_P = 0.9
164
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
165
+ PROMPT = ">>> "
166
+ # ---------------------------
167
+
168
+ def load_tokenizer(path: str):
169
+ print("Loading tokenizer...", path)
170
+ tok = AutoTokenizer.from_pretrained(path, use_fast=True, local_files_only=True)
171
+ if tok.pad_token is None:
172
+ if getattr(tok, "eos_token", None) is not None:
173
+ tok.add_special_tokens({"pad_token": tok.eos_token})
174
+ else:
175
+ tok.add_special_tokens({"pad_token": "<pad>", "eos_token": "</s>"})
176
+ print("Tokenizer ready. vocab_size=", getattr(tok, "vocab_size", "N/A"))
177
+ return tok
178
+
179
+ def load_model(path: str, device: str):
180
+ print("Loading model...", path)
181
+ model = None
182
+ try:
183
+ desired_dtype = torch.float16 if device.startswith("cuda") else torch.float32
184
+ model = MistralForCausalLM.from_pretrained(path, local_files_only=True, dtype=desired_dtype)
185
+ print("Loaded with dtype arg.")
186
+ except TypeError:
187
+ model = MistralForCausalLM.from_pretrained(path, local_files_only=True)
188
+ print("Loaded without dtype; will convert.")
189
+ except Exception as e:
190
+ print("Load warning, retrying without dtype:", e)
191
+ model = MistralForCausalLM.from_pretrained(path, local_files_only=True)
192
+
193
+ try:
194
+ model.to(device)
195
+ if device.startswith("cuda") and next(model.parameters()).dtype != torch.float16:
196
+ model.half()
197
+ if not device.startswith("cuda") and next(model.parameters()).dtype != torch.float32:
198
+ model.to(torch.float32)
199
+ except Exception as e:
200
+ print("Model move/convert warning:", e)
201
+
202
+ model.config.pad_token_id = getattr(model.config, "pad_token_id", None)
203
+ model.eval()
204
+ return model
205
+
206
+ # Simple nucleus/top-p filtering for a single logits vector
207
+ def top_p_filtering(logits: torch.Tensor, top_p: float, min_keep: int = 1) -> torch.Tensor:
208
+ if top_p <= 0 or top_p >= 1.0:
209
+ return logits
210
+ sorted_logits, sorted_idx = torch.sort(logits, descending=True)
211
+ probs = torch.softmax(sorted_logits, dim=-1)
212
+ cumprobs = torch.cumsum(probs, dim=-1)
213
+ cutoff = (cumprobs > top_p).nonzero(as_tuple=False)
214
+ if cutoff.numel() > 0:
215
+ idx = int(cutoff[0].item())
216
+ cutoff_idx = max(idx + 1, min_keep)
217
+ else:
218
+ cutoff_idx = sorted_logits.size(-1)
219
+ mask = torch.ones_like(sorted_logits, dtype=torch.bool)
220
+ mask[cutoff_idx:] = False
221
+ filtered = sorted_logits.masked_fill(~mask, -float("inf"))
222
+ return torch.empty_like(filtered).scatter_(0, sorted_idx, filtered)
223
+
224
+ # Manual streaming generator (single-batch)
225
+ def manual_stream_generate(model, tokenizer, prompt: str, device: str,
226
+ max_new_tokens: int = 64, temperature: float = 1.0, top_p: float = 0.9,
227
+ eos_token_id: Optional[int] = None):
228
+ inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
229
+ input_ids = inputs["input_ids"].to(device)
230
+ attention_mask = inputs.get("attention_mask", None)
231
+ if attention_mask is not None:
232
+ attention_mask = attention_mask.to(device)
233
+
234
+ past = None
235
+ with torch.no_grad():
236
+ out = model(input_ids=input_ids, attention_mask=attention_mask, use_cache=True)
237
+ past = getattr(out, "past_key_values", None)
238
+
239
+ # start sampling tokens
240
+ next_input = input_ids[:, -1:].to(device) if past is not None else input_ids.to(device)
241
+ for _ in range(max_new_tokens):
242
+ with torch.no_grad():
243
+ out = model(input_ids=next_input, past_key_values=past, use_cache=True)
244
+ logits = out.logits[:, -1, :] # (batch, vocab)
245
+ past = getattr(out, "past_key_values", past)
246
+
247
+ if temperature != 1.0:
248
+ logits = logits / max(temperature, 1e-8)
249
+
250
+ filtered = top_p_filtering(logits[0].cpu(), top_p).to(device)
251
+ probs = torch.nn.functional.softmax(filtered.unsqueeze(0), dim=-1)
252
+ next_token = torch.multinomial(probs, num_samples=1)
253
+ token_id = int(next_token[0, 0].item())
254
+
255
+ token_text = tokenizer.decode([token_id], clean_up_tokenization_spaces=False)
256
+ yield token_id, token_text
257
+
258
+ if eos_token_id is not None and token_id == eos_token_id:
259
+ break
260
+ next_input = torch.tensor([[token_id]], dtype=torch.long, device=device)
261
+
262
+ def has_text_streamer():
263
+ try:
264
+ from transformers import TextStreamer # type: ignore
265
+ return True
266
+ except Exception:
267
+ return False
268
+
269
+ # tiny REPL state
270
+ class State:
271
+ def __init__(self):
272
+ self.max_new_tokens = DEFAULT_MAX_NEW_TOKENS
273
+ self.temperature = DEFAULT_TEMPERATURE
274
+ self.top_p = DEFAULT_TOP_P
275
+ self.full_output = False
276
+ self.stream = True
277
+
278
+ def handle_generation(model, tokenizer, prompt: str, device: str, state: State):
279
+ eos = getattr(tokenizer, "eos_token_id", None)
280
+ try:
281
+ if has_text_streamer():
282
+ from transformers import TextStreamer
283
+ streamer = TextStreamer(tokenizer, skip_prompt=not state.full_output, skip_special_tokens=True)
284
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, add_special_tokens=False)
285
+ inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
286
+ inputs.pop("token_type_ids", None)
287
+ model.generate(**inputs,
288
+ max_new_tokens=state.max_new_tokens,
289
+ do_sample=True,
290
+ temperature=state.temperature,
291
+ top_p=state.top_p,
292
+ pad_token_id=tokenizer.pad_token_id,
293
+ eos_token_id=tokenizer.eos_token_id,
294
+ streamer=streamer)
295
+ print("") # newline after streamer
296
+ return
297
+ # fallback: manual streaming
298
+ gen = manual_stream_generate(model, tokenizer, prompt, device,
299
+ max_new_tokens=state.max_new_tokens,
300
+ temperature=state.temperature,
301
+ top_p=state.top_p,
302
+ eos_token_id=eos)
303
+ if state.full_output:
304
+ print("PROMPT:", prompt)
305
+ print("GENERATING:", end=" ", flush=True)
306
+ else:
307
+ print("GENERATING:", end=" ", flush=True)
308
+
309
+ count = 0
310
+ t0 = time.time()
311
+ for _tok_id, tok_text in gen:
312
+ count += 1
313
+ print(tok_text, end="", flush=True)
314
+ print()
315
+ print(f"(generated {count} tokens in {time.time()-t0:.2f}s)")
316
+ except KeyboardInterrupt:
317
+ print("\n[interrupted] Generation aborted by user.")
318
+ except Exception as e:
319
+ print("Generation error:", e)
320
+
321
+ def repl(model, tokenizer, device):
322
+ state = State()
323
+ help_text = (
324
+ "Commands:\n"
325
+ " :quit\n"
326
+ " :help\n"
327
+ " :show\n"
328
+ " :set <param> <value> # params: max_new_tokens, temperature, top_p, full_output, stream\n"
329
+ " (blank line repeats last prompt)\n"
330
+ )
331
+ print("Tiny Mistral REPL — device:", device)
332
+ print(help_text)
333
+ last = ""
334
+ while True:
335
+ try:
336
+ raw = input(PROMPT).strip()
337
+ except (EOFError, KeyboardInterrupt):
338
+ print("\nExiting.")
339
+ break
340
+ if not raw:
341
+ raw = last
342
+ if not raw:
343
+ continue
344
+
345
+ if raw.startswith(":"):
346
+ toks = shlex.split(raw)
347
+ cmd = toks[0].lower()
348
+ if cmd == ":quit":
349
+ print("bye.")
350
+ break
351
+ if cmd == ":help":
352
+ print(help_text); continue
353
+ if cmd == ":show":
354
+ print(f"max_new_tokens={state.max_new_tokens}, temperature={state.temperature}, top_p={state.top_p}, full_output={state.full_output}, stream={state.stream}")
355
+ continue
356
+ if cmd == ":set":
357
+ if len(toks) < 3:
358
+ print("usage: :set <param> <value>"); continue
359
+ k, v = toks[1], toks[2]
360
+ try:
361
+ if k == "max_new_tokens":
362
+ state.max_new_tokens = int(v)
363
+ elif k == "temperature":
364
+ state.temperature = float(v)
365
+ elif k == "top_p":
366
+ state.top_p = float(v)
367
+ elif k in ("full_output", "full"):
368
+ state.full_output = v.lower() in ("1", "true", "yes", "y")
369
+ elif k == "stream":
370
+ state.stream = v.lower() in ("1", "true", "yes", "y")
371
+ else:
372
+ print("unknown param:", k)
373
+ continue
374
+ print("OK.")
375
+ except Exception as e:
376
+ print("set error:", e)
377
+ continue
378
+ print("unknown command")
379
+ continue
380
+
381
+ last = raw
382
+ if state.stream:
383
+ handle_generation(model, tokenizer, raw, device, state)
384
+ else:
385
+ # non-streaming generate
386
+ try:
387
+ inputs = tokenizer(raw, return_tensors="pt", truncation=True, add_special_tokens=False)
388
+ inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
389
+ inputs.pop("token_type_ids", None)
390
+ out = model.generate(**inputs,
391
+ max_new_tokens=state.max_new_tokens,
392
+ do_sample=True,
393
+ temperature=state.temperature,
394
+ top_p=state.top_p,
395
+ pad_token_id=tokenizer.pad_token_id,
396
+ eos_token_id=tokenizer.eos_token_id)
397
+ seq = out[0]
398
+ input_len = inputs["input_ids"].shape[1] if "input_ids" in inputs else 0
399
+ text = tokenizer.decode(seq if state.full_output else seq[input_len:], skip_special_tokens=True)
400
+ print("\nOUTPUT\n", text)
401
+ except Exception as e:
402
+ print("Generation failed:", e)
403
+
404
+ def main():
405
+ device = DEVICE
406
+ tokenizer = load_tokenizer(TOKENIZER_DIR)
407
+ model = load_model(MODEL_DIR, device)
408
+ repl(model, tokenizer, device)
409
+
410
+ if __name__ == "__main__":
411
+ main()
412
+ ```