Update README.md

Browse files

Files changed (1) hide show

README.md +412 -3

README.md CHANGED Viewed

@@ -1,3 +1,412 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+- es
+pipeline_tag: text-generation
+tags:
+- word-generator
+- mini
+- tiny
+- experiment
+- small
+- mistral-lm
+- text-generation-inference
+- word-generation
+- test
+- fun
+- explore
+- lexical
+- words
+- word
+---
+# Tiny-Word
+Tiny-Word is an extremely tiny Mistral-like model, approximately ~81k parameters. It generates English or Spanish words or word-like sequences.
+## Architecture
+|        Key        | Value |
+| :---------------: | :---: |
+|    hidden_size    |   32  |
+|     num_layers    |   2   |
+|     num_heads     |   1   |
+|    num_kv_heads   |   1   |
+| intermediate_size |  256  |
+|     vocab_size    |  1200 |
+## Training
+Tiny-Word was trained on 753,232 unique words (entries), 3,225,398 tokens, and 7,022,310 characters. ~660k of those words are English, while ~90k of them are Spanish.
+### Dataset
+|           Key           |   Value   |
+| :---------------------: | :-------: |
+|     Entries (words)     |  753,232  |
+|          Tokens         | 3,225,398 |
+|        Characters       | 7,022,310 |
+|  Avg. Tokens Per Entry  |    ~4.2   |
+|   Avg. Words Per Entry  |     1     |
+|   Avg. Chars Per Entry  |    ~9.3   |
+|  Longest Entry (Tokens) |     36    |
+| Shortest Entry (Tokens) |     1     |
+|      English Words      |   ~660k   |
+|      Spanish Words      |    ~90k   |
+### Training Setup
+We trained the model for 6 epochs with a batch size of 128 and a gradient accumulation of 2.
+The chosen sliding_window was 64, even though the longest word is only 36 tokens, which is inefficient and suboptimal. However, this shouldn’t affect the model in any way; it only slows training down.
+#### Hardware
+Tiny-Word was trained on Google Colaboratory, with 1 Nvidia Tesla T4 GPU, 15 GB of VRAM, and 12.7 GB of RAM.
+### Training Results
+| step  | train_loss | val_loss | train_ppl | val_ppl |
+| :---- | :--------- | :------- | :-------- | :------ |
+| 1000  | 4.9619     | 4.5201   | ~143.0    | ~91.8   |
+| 3000  | 4.0093     | 3.9156   | ~55.0     | ~50.2   |
+| 4000  | 3.8464     | 3.7951   | ~46.8     | ~44.5   |
+| 6000  | 3.6814     | 3.6612   | ~39.7     | ~38.9   |
+| 7000  | 3.6329     | 3.6182   | ~37.8     | ~37.2   |
+| 9000  | 3.5684     | 3.5636   | ~35.5     | ~35.3   |
+| 10000 | 3.5452     | 3.5444   | ~34.7     | ~34.6   |
+| 12000 | 3.5139     | 3.5161   | ~33.6     | ~33.7   |
+| 15000 | 3.4784     | 3.4861   | ~32.4     | ~32.6   |
+Tiny-Word shows promising results, even at its tiny size (~81k parameters). Given the relatively easy task (predicting subwords inside single words), this is expected.
+## Generation Examples
+Prompt:
+```
+d
+```
+Output:
+```
+desmounder's's's
+```
+Prompt:
+```
+0333333333
+```
+Output:
+```
+ruperperse'sf
+```
+Prompt:
+```
+a
+```
+Output:
+```
+utomatographic'sphon
+```
+Prompt:
+```
+e
+```
+Output:
+```
+equip’s’s’s
+```
+The model generates plausible word-like sequences that can be pronounced; sometimes it produces real words as well. It can handle almost all input; even if it’s nonsensical, it’ll still try to generate a word.
+## Limitations
+1. It does not generate sentences, prose, code, or anything besides a single word-like sequence.
+2. It cannot reason or produce complex language.
+3. It often appends common artifacts after the word is generated, such as: "'s", "'sphon", etc.
+4. Most generated words aren’t real and instead reflect the lexicon and morphology of the English and Spanish languages.
+## Quick Demo
+```python
+#!/usr/bin/env python3
+"""
+Tiny Mistral REPL demo — streaming tokens (TextStreamer if available, else manual sampling).
+Commands: :quit, :help, :show, :set <param> <value> (max_new_tokens, temperature, top_p, full_output)
+"""
+from __future__ import annotations
+import shlex
+import time
+import torch
+from typing import Optional
+from transformers import AutoTokenizer, MistralForCausalLM
+# --------- CONFIG ----------
+MODEL_DIR = "Harley-ml/tiny-word"
+TOKENIZER_DIR = MODEL_DIR
+DEFAULT_MAX_NEW_TOKENS = 16
+DEFAULT_TEMPERATURE = 0.4
+DEFAULT_TOP_P = 0.9
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+PROMPT = ">>> "
+# ---------------------------
+def load_tokenizer(path: str):
+    print("Loading tokenizer...", path)
+    tok = AutoTokenizer.from_pretrained(path, use_fast=True, local_files_only=True)
+    if tok.pad_token is None:
+        if getattr(tok, "eos_token", None) is not None:
+            tok.add_special_tokens({"pad_token": tok.eos_token})
+        else:
+            tok.add_special_tokens({"pad_token": "<pad>", "eos_token": "</s>"})
+    print("Tokenizer ready. vocab_size=", getattr(tok, "vocab_size", "N/A"))
+    return tok
+def load_model(path: str, device: str):
+    print("Loading model...", path)
+    model = None
+    try:
+        desired_dtype = torch.float16 if device.startswith("cuda") else torch.float32
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=True, dtype=desired_dtype)
+        print("Loaded with dtype arg.")
+    except TypeError:
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=True)
+        print("Loaded without dtype; will convert.")
+    except Exception as e:
+        print("Load warning, retrying without dtype:", e)
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=True)
+    try:
+        model.to(device)
+        if device.startswith("cuda") and next(model.parameters()).dtype != torch.float16:
+            model.half()
+        if not device.startswith("cuda") and next(model.parameters()).dtype != torch.float32:
+            model.to(torch.float32)
+    except Exception as e:
+        print("Model move/convert warning:", e)
+    model.config.pad_token_id = getattr(model.config, "pad_token_id", None)
+    model.eval()
+    return model
+# Simple nucleus/top-p filtering for a single logits vector
+def top_p_filtering(logits: torch.Tensor, top_p: float, min_keep: int = 1) -> torch.Tensor:
+    if top_p <= 0 or top_p >= 1.0:
+        return logits
+    sorted_logits, sorted_idx = torch.sort(logits, descending=True)
+    probs = torch.softmax(sorted_logits, dim=-1)
+    cumprobs = torch.cumsum(probs, dim=-1)
+    cutoff = (cumprobs > top_p).nonzero(as_tuple=False)
+    if cutoff.numel() > 0:
+        idx = int(cutoff[0].item())
+        cutoff_idx = max(idx + 1, min_keep)
+    else:
+        cutoff_idx = sorted_logits.size(-1)
+    mask = torch.ones_like(sorted_logits, dtype=torch.bool)
+    mask[cutoff_idx:] = False
+    filtered = sorted_logits.masked_fill(~mask, -float("inf"))
+    return torch.empty_like(filtered).scatter_(0, sorted_idx, filtered)
+# Manual streaming generator (single-batch)
+def manual_stream_generate(model, tokenizer, prompt: str, device: str,
+                           max_new_tokens: int = 64, temperature: float = 1.0, top_p: float = 0.9,
+                           eos_token_id: Optional[int] = None):
+    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+    input_ids = inputs["input_ids"].to(device)
+    attention_mask = inputs.get("attention_mask", None)
+    if attention_mask is not None:
+        attention_mask = attention_mask.to(device)
+    past = None
+    with torch.no_grad():
+        out = model(input_ids=input_ids, attention_mask=attention_mask, use_cache=True)
+        past = getattr(out, "past_key_values", None)
+    # start sampling tokens
+    next_input = input_ids[:, -1:].to(device) if past is not None else input_ids.to(device)
+    for _ in range(max_new_tokens):
+        with torch.no_grad():
+            out = model(input_ids=next_input, past_key_values=past, use_cache=True)
+            logits = out.logits[:, -1, :]  # (batch, vocab)
+            past = getattr(out, "past_key_values", past)
+            if temperature != 1.0:
+                logits = logits / max(temperature, 1e-8)
+            filtered = top_p_filtering(logits[0].cpu(), top_p).to(device)
+            probs = torch.nn.functional.softmax(filtered.unsqueeze(0), dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            token_id = int(next_token[0, 0].item())
+        token_text = tokenizer.decode([token_id], clean_up_tokenization_spaces=False)
+        yield token_id, token_text
+        if eos_token_id is not None and token_id == eos_token_id:
+            break
+        next_input = torch.tensor([[token_id]], dtype=torch.long, device=device)
+def has_text_streamer():
+    try:
+        from transformers import TextStreamer  # type: ignore
+        return True
+    except Exception:
+        return False
+# tiny REPL state
+class State:
+    def __init__(self):
+        self.max_new_tokens = DEFAULT_MAX_NEW_TOKENS
+        self.temperature = DEFAULT_TEMPERATURE
+        self.top_p = DEFAULT_TOP_P
+        self.full_output = False
+        self.stream = True
+def handle_generation(model, tokenizer, prompt: str, device: str, state: State):
+    eos = getattr(tokenizer, "eos_token_id", None)
+    try:
+        if has_text_streamer():
+            from transformers import TextStreamer
+            streamer = TextStreamer(tokenizer, skip_prompt=not state.full_output, skip_special_tokens=True)
+            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, add_special_tokens=False)
+            inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
+            inputs.pop("token_type_ids", None)
+            model.generate(**inputs,
+                           max_new_tokens=state.max_new_tokens,
+                           do_sample=True,
+                           temperature=state.temperature,
+                           top_p=state.top_p,
+                           pad_token_id=tokenizer.pad_token_id,
+                           eos_token_id=tokenizer.eos_token_id,
+                           streamer=streamer)
+            print("")  # newline after streamer
+            return
+        # fallback: manual streaming
+        gen = manual_stream_generate(model, tokenizer, prompt, device,
+                                     max_new_tokens=state.max_new_tokens,
+                                     temperature=state.temperature,
+                                     top_p=state.top_p,
+                                     eos_token_id=eos)
+        if state.full_output:
+            print("PROMPT:", prompt)
+            print("GENERATING:", end=" ", flush=True)
+        else:
+            print("GENERATING:", end=" ", flush=True)
+        count = 0
+        t0 = time.time()
+        for _tok_id, tok_text in gen:
+            count += 1
+            print(tok_text, end="", flush=True)
+        print()
+        print(f"(generated {count} tokens in {time.time()-t0:.2f}s)")
+    except KeyboardInterrupt:
+        print("\n[interrupted] Generation aborted by user.")
+    except Exception as e:
+        print("Generation error:", e)
+def repl(model, tokenizer, device):
+    state = State()
+    help_text = (
+        "Commands:\n"
+        " :quit\n"
+        " :help\n"
+        " :show\n"
+        " :set <param> <value>  # params: max_new_tokens, temperature, top_p, full_output, stream\n"
+        " (blank line repeats last prompt)\n"
+    )
+    print("Tiny Mistral REPL — device:", device)
+    print(help_text)
+    last = ""
+    while True:
+        try:
+            raw = input(PROMPT).strip()
+        except (EOFError, KeyboardInterrupt):
+            print("\nExiting.")
+            break
+        if not raw:
+            raw = last
+            if not raw:
+                continue
+        if raw.startswith(":"):
+            toks = shlex.split(raw)
+            cmd = toks[0].lower()
+            if cmd == ":quit":
+                print("bye.")
+                break
+            if cmd == ":help":
+                print(help_text); continue
+            if cmd == ":show":
+                print(f"max_new_tokens={state.max_new_tokens}, temperature={state.temperature}, top_p={state.top_p}, full_output={state.full_output}, stream={state.stream}")
+                continue
+            if cmd == ":set":
+                if len(toks) < 3:
+                    print("usage: :set <param> <value>"); continue
+                k, v = toks[1], toks[2]
+                try:
+                    if k == "max_new_tokens":
+                        state.max_new_tokens = int(v)
+                    elif k == "temperature":
+                        state.temperature = float(v)
+                    elif k == "top_p":
+                        state.top_p = float(v)
+                    elif k in ("full_output", "full"):
+                        state.full_output = v.lower() in ("1", "true", "yes", "y")
+                    elif k == "stream":
+                        state.stream = v.lower() in ("1", "true", "yes", "y")
+                    else:
+                        print("unknown param:", k)
+                        continue
+                    print("OK.")
+                except Exception as e:
+                    print("set error:", e)
+                continue
+            print("unknown command")
+            continue
+        last = raw
+        if state.stream:
+            handle_generation(model, tokenizer, raw, device, state)
+        else:
+            # non-streaming generate
+            try:
+                inputs = tokenizer(raw, return_tensors="pt", truncation=True, add_special_tokens=False)
+                inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
+                inputs.pop("token_type_ids", None)
+                out = model.generate(**inputs,
+                                     max_new_tokens=state.max_new_tokens,
+                                     do_sample=True,
+                                     temperature=state.temperature,
+                                     top_p=state.top_p,
+                                     pad_token_id=tokenizer.pad_token_id,
+                                     eos_token_id=tokenizer.eos_token_id)
+                seq = out[0]
+                input_len = inputs["input_ids"].shape[1] if "input_ids" in inputs else 0
+                text = tokenizer.decode(seq if state.full_output else seq[input_len:], skip_special_tokens=True)
+                print("\nOUTPUT\n", text)
+            except Exception as e:
+                print("Generation failed:", e)
+def main():
+    device = DEVICE
+    tokenizer = load_tokenizer(TOKENIZER_DIR)
+    model = load_model(MODEL_DIR, device)
+    repl(model, tokenizer, device)
+if __name__ == "__main__":
+    main()
+```