Harley-ml
/

Mini-MD

+---
+license: mit
+language:
+- en
+tags:
+- code
+- markdown
+- tiny
+- small
+- quick
+- fast
+- 28M
+- mistral
+- text-generation-inference
+---
+# **Mini-MD**
+Mini-MD is a **\~28M parameter transformer-decoder** trained on \~200k markdown files from Github.
+## Architecture
+| Key | Value |
+| :---: | :---: |
+| `hidden_size` | 384 |
+| `num_layers` | 8 |
+| `num_heads` | 6 |
+| `num_kv_heads` | 2 |
+| `head_dim` | 64 |
+| `intermediate_size` | 1536 |
+| `vocab_size` | 14002 |
+| `sliding_window` | 640 |
+| `rope_theta` | 10000.0 |
+| `tie_embeddings` | True |
+| `total_params` | 28061568 |
+## Training
+### Training Parameters
+| Key | Value |
+| :---: | :---: |
+| `num_epochs` | 3 |
+| `batch_size` | 5 |
+| `stride` | 620 |
+| `seq_len` | 640 |
+| `val_split` | 0.09 |
+| `learning_rate` | 2e-4 |
+### Training Results
+| `train_loss` | `val_loss` | `step` | `epoch` |
+| :---: | :---: | :---: | :---: |
+| 6.8138 | 5.7706 | 1200 | 0.02 |
+| 2.4274 | 2.5915 | 12000 | 0.24 |
+| 2.1519 | 2.2091 | 30000 | 0.59 |
+| 2.0411 | 2.0464 | 48000 | 0.95 |
+| 1.7728 | 1.8912 | 84000 | 1.66 |
+| 1.7304 | 1.8494 | 100800 | 1.99 |
+| 1.6394 | 1.7599 | 132000 | 2.60 |
+| 1.6794 | 1.7234 | 151200 | 2.98 |
+### Hardware
+GPU: One singular NVIDIA RTX 2060 6GB (14GB Shared RAM).
+CPU: Ryzen 5 2600
+RAM: 16GB
+## Generations
+Input:
+```
+# README
+```
+Output:
+```
+Cover Evolution
+Official documentation for dis curated documentation: cosu:
+**README.md** (Recommended for reference)
+- **Repository**:
+- **Bloomberg**:
+**Welcome to Open Questions about Apache Dis curriculum**
+- **Security:** All files are securely required through social media.
+- **International (FONTLOG)**: Only include relevant code and resources when necessary to see new or better info.
+```
+---
+Input:
+```
+# Getting Started with Rust
+```
+Output:
+```
+You need a Rust crate for your operating system or platform environment, but you can
+install the Rust organization for Rust programs, and it's enough you'd really want it
+to work with your operating system.
+### Building Tools
+#### FFI Compatibility
+Rust's Rust has a compiler that can handle the following:
+- Using a toolchain with the AVX-based kernels
+- FFI Compatibility on ARM for iOS, iOS, Robot, etc
+- FFI Support from Linux for XPC platforms
+```
+---
+Input:
+```
+## Installation
+```
+Output:
+```
+in this project
+For building the Heft and French benchmarks, there are two popular tools you can use to build
+various benchmarks:
+* [Jim]( and [iL2](
+* [Lucid]( is supported, with [t5]( the
+[`LuxSrc32`]( CLI to get a better benchmark
+## Usage in this project:
+Npm install
+[LuxSrc32]( It is recommended to use precompiled GopherLM files
+```
+---
+Input:
+```
+The quick brown fox
+```
+Output:
+```
+ispepper is the skype of the games and you just want to generate them to your be.
+To use this you will need a Galicu sensor that has 2x 3x2 pesus.
+The other four mandaracters share it - you want the German light/highest food wey no
+your browses for game. Geometric, like grove, grove, and bus are pencilumo.
+```
+---
+Input:
+```
+Hello world
+```
+Output:
+```
+I can find a brown, I know it may need the full-fledip data into my phone
+buying you...
+my first time I went my email, we're going through it.
+You should now see you.
+Thank you the wonderful! Thanks! It's very cool when the dog would bed.
+We can't go through the world until you've started it!
+```
+---
+Input:
+```
+Once upon a time,
+```
+Output:
+```
+all data are moved to any place the user is aware of their operation and then removed.
+When the following is present:
+* The ability to move state through the same service, which allows
+  the ability of clients to move state across a network interface.
+During a [DaemonSet controller](/docs/concepts/cluster-administration/deployment/),
+all webhooks of the DaemonSet controller may be invoked from a user through a
+"Service Cluster" resource without any actions.
+```
+## Capabilities & Restrictions
+### Capabilities
+- Continues GitHub-style markdown documents plausibly, particularly README sections, changelogs, installation guides, and API documentation
+- Produces syntactically well-formed code blocks across multiple languages (Python, Rust, Go, C++, JavaScript)
+- Sustains a single topic for several paragraphs when the prompt closely matches training distribution (e.g. `# Getting Started with <common language>`)
+### Restrictions
+- Not an instruction-following model — treats all input as a document prefix to continue, not a query to answer
+- Out-of-distribution prompts (natural language, fiction, conversation) produce incoherent or nonsensical output
+- Prone to topic drift over longer generations, gradually sliding into unrelated documentation
+- Prone to repetition loops, particularly on short or ambiguous prompts
+- Generates hallucinated URLs, package names, library names, and version numbers with no grounding
+- Multilingual output may appear mid-generation, inherited from non-English READMEs in the training corpus; coherence in non-English output is lower than English
+- Not suitable for any production use
+## Inference
+```python
+#!/usr/bin/env python3
+"""
+Tiny Mistral REPL demo — streaming tokens (TextStreamer if available, else manual sampling).
+Commands: :quit, :help, :show, :set <param> <value> (max_new_tokens, temperature, top_p, full_output)
+"""
+from __future__ import annotations
+import shlex
+import time
+import torch
+from typing import Optional
+from transformers import AutoTokenizer, MistralForCausalLM
+# --------- CONFIG ----------
+MODEL_DIR = "Harley-ml/Mini-MD"
+TOKENIZER_DIR = MODEL_DIR
+DEFAULT_MAX_NEW_TOKENS = 640
+DEFAULT_TEMPERATURE = 0.9
+DEFAULT_TOP_P = 0.9
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+PROMPT = ">>> "
+# ---------------------------
+def load_tokenizer(path: str):
+    print("Loading tokenizer...", path)
+    tok = AutoTokenizer.from_pretrained(path, use_fast=True, local_files_only=False)
+    if tok.pad_token is None:
+        if getattr(tok, "eos_token", None) is not None:
+            tok.add_special_tokens({"pad_token": tok.eos_token})
+        else:
+            tok.add_special_tokens({"pad_token": "<pad>", "eos_token": "</s>"})
+    print("Tokenizer ready. vocab_size=", getattr(tok, "vocab_size", "N/A"))
+    return tok
+def load_model(path: str, device: str):
+    print("Loading model...", path)
+    model = None
+    try:
+        desired_dtype = torch.float16 if device.startswith("cuda") else torch.float32
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=False, dtype=desired_dtype)
+        print("Loaded with dtype arg.")
+    except TypeError:
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=False)
+        print("Loaded without dtype; will convert.")
+    except Exception as e:
+        print("Load warning, retrying without dtype:", e)
+        model = MistralForCausalLM.from_pretrained(path, local_files_only=False)
+    try:
+        model.to(device)
+        if device.startswith("cuda") and next(model.parameters()).dtype != torch.float16:
+            model.half()
+        if not device.startswith("cuda") and next(model.parameters()).dtype != torch.float32:
+            model.to(torch.float32)
+    except Exception as e:
+        print("Model move/convert warning:", e)
+    model.config.pad_token_id = getattr(model.config, "pad_token_id", None)
+    model.eval()
+    return model
+# Simple nucleus/top-p filtering for a single logits vector
+def top_p_filtering(logits: torch.Tensor, top_p: float, min_keep: int = 1) -> torch.Tensor:
+    if top_p <= 0 or top_p >= 1.0:
+        return logits
+    sorted_logits, sorted_idx = torch.sort(logits, descending=True)
+    probs = torch.softmax(sorted_logits, dim=-1)
+    cumprobs = torch.cumsum(probs, dim=-1)
+    cutoff = (cumprobs > top_p).nonzero(as_tuple=False)
+    if cutoff.numel() > 0:
+        idx = int(cutoff[0].item())
+        cutoff_idx = max(idx + 1, min_keep)
+    else:
+        cutoff_idx = sorted_logits.size(-1)
+    mask = torch.ones_like(sorted_logits, dtype=torch.bool)
+    mask[cutoff_idx:] = False
+    filtered = sorted_logits.masked_fill(~mask, -float("inf"))
+    return torch.empty_like(filtered).scatter_(0, sorted_idx, filtered)
+# Manual streaming generator (single-batch)
+def manual_stream_generate(model, tokenizer, prompt: str, device: str,
+                           max_new_tokens: int = 64, temperature: float = 1.0, top_p: float = 0.9,
+                           eos_token_id: Optional[int] = None):
+    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
+    input_ids = inputs["input_ids"].to(device)
+    attention_mask = inputs.get("attention_mask", None)
+    if attention_mask is not None:
+        attention_mask = attention_mask.to(device)
+    past = None
+    with torch.no_grad():
+        out = model(input_ids=input_ids, attention_mask=attention_mask, use_cache=True)
+        past = getattr(out, "past_key_values", None)
+    # start sampling tokens
+    next_input = input_ids[:, -1:].to(device) if past is not None else input_ids.to(device)
+    for _ in range(max_new_tokens):
+        with torch.no_grad():
+            out = model(input_ids=next_input, past_key_values=past, use_cache=True)
+            logits = out.logits[:, -1, :]  # (batch, vocab)
+            past = getattr(out, "past_key_values", past)
+            if temperature != 1.0:
+                logits = logits / max(temperature, 1e-8)
+            filtered = top_p_filtering(logits[0].cpu(), top_p).to(device)
+            probs = torch.nn.functional.softmax(filtered.unsqueeze(0), dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)
+            token_id = int(next_token[0, 0].item())
+        token_text = tokenizer.decode([token_id], clean_up_tokenization_spaces=False)
+        yield token_id, token_text
+        if eos_token_id is not None and token_id == eos_token_id:
+            break
+        next_input = torch.tensor([[token_id]], dtype=torch.long, device=device)
+def has_text_streamer():
+    try:
+        from transformers import TextStreamer  # type: ignore
+        return True
+    except Exception:
+        return False
+# tiny REPL state
+class State:
+    def __init__(self):
+        self.max_new_tokens = DEFAULT_MAX_NEW_TOKENS
+        self.temperature = DEFAULT_TEMPERATURE
+        self.top_p = DEFAULT_TOP_P
+        self.full_output = False
+        self.stream = True
+def handle_generation(model, tokenizer, prompt: str, device: str, state: State):
+    eos = getattr(tokenizer, "eos_token_id", None)
+    try:
+        if has_text_streamer():
+            from transformers import TextStreamer
+            streamer = TextStreamer(tokenizer, skip_prompt=not state.full_output, skip_special_tokens=True)
+            inputs = tokenizer(prompt, return_tensors="pt", truncation=True, add_special_tokens=False)
+            inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
+            inputs.pop("token_type_ids", None)
+            model.generate(**inputs,
+                           max_new_tokens=state.max_new_tokens,
+                           do_sample=True,
+                           temperature=state.temperature,
+                           top_p=state.top_p,
+                           pad_token_id=tokenizer.pad_token_id,
+                           eos_token_id=tokenizer.eos_token_id,
+                           streamer=streamer)
+            print("")  # newline after streamer
+            return
+        # fallback: manual streaming
+        gen = manual_stream_generate(model, tokenizer, prompt, device,
+                                     max_new_tokens=state.max_new_tokens,
+                                     temperature=state.temperature,
+                                     top_p=state.top_p,
+                                     eos_token_id=eos)
+        if state.full_output:
+            print("PROMPT:", prompt)
+            print("GENERATING:", end=" ", flush=True)
+        else:
+            print("GENERATING:", end=" ", flush=True)
+        count = 0
+        t0 = time.time()
+        for _tok_id, tok_text in gen:
+            count += 1
+            print(tok_text, end="", flush=True)
+        print()
+        print(f"(generated {count} tokens in {time.time()-t0:.2f}s)")
+    except KeyboardInterrupt:
+        print("\n[interrupted] Generation aborted by user.")
+    except Exception as e:
+        print("Generation error:", e)
+def repl(model, tokenizer, device):
+    state = State()
+    help_text = (
+        "Commands:\n"
+        " :quit\n"
+        " :help\n"
+        " :show\n"
+        " :set <param> <value>  # params: max_new_tokens, temperature, top_p, full_output, stream\n"
+        " (blank line repeats last prompt)\n"
+    )
+    print("Tiny Mistral REPL — device:", device)
+    print(help_text)
+    last = ""
+    while True:
+        try:
+            raw = input(PROMPT).strip()
+        except (EOFError, KeyboardInterrupt):
+            print("\nExiting.")
+            break
+        if not raw:
+            raw = last
+            if not raw:
+                continue
+        if raw.startswith(":"):
+            toks = shlex.split(raw)
+            cmd = toks[0].lower()
+            if cmd == ":quit":
+                print("bye.")
+                break
+            if cmd == ":help":
+                print(help_text); continue
+            if cmd == ":show":
+                print(f"max_new_tokens={state.max_new_tokens}, temperature={state.temperature}, top_p={state.top_p}, full_output={state.full_output}, stream={state.stream}")
+                continue
+            if cmd == ":set":
+                if len(toks) < 3:
+                    print("usage: :set <param> <value>"); continue
+                k, v = toks[1], toks[2]
+                try:
+                    if k == "max_new_tokens":
+                        state.max_new_tokens = int(v)
+                    elif k == "temperature":
+                        state.temperature = float(v)
+                    elif k == "top_p":
+                        state.top_p = float(v)
+                    elif k in ("full_output", "full"):
+                        state.full_output = v.lower() in ("1", "true", "yes", "y")
+                    elif k == "stream":
+                        state.stream = v.lower() in ("1", "true", "yes", "y")
+                    else:
+                        print("unknown param:", k)
+                        continue
+                    print("OK.")
+                except Exception as e:
+                    print("set error:", e)
+                continue
+            print("unknown command")
+            continue
+        last = raw
+        if state.stream:
+            handle_generation(model, tokenizer, raw, device, state)
+        else:
+            # non-streaming generate
+            try:
+                inputs = tokenizer(raw, return_tensors="pt", truncation=True, add_special_tokens=False)
+                inputs = {k: v.to(device) for k, v in inputs.items() if isinstance(v, torch.Tensor)}
+                inputs.pop("token_type_ids", None)
+                out = model.generate(**inputs,
+                                     max_new_tokens=state.max_new_tokens,
+                                     do_sample=True,
+                                     temperature=state.temperature,
+                                     top_p=state.top_p,
+                                     pad_token_id=tokenizer.pad_token_id,
+                                     eos_token_id=tokenizer.eos_token_id)
+                seq = out[0]
+                input_len = inputs["input_ids"].shape[1] if "input_ids" in inputs else 0
+                text = tokenizer.decode(seq if state.full_output else seq[input_len:], skip_special_tokens=True)
+                print("\nOUTPUT\n", text)
+            except Exception as e:
+                print("Generation failed:", e)
+def main():
+    device = DEVICE
+    tokenizer = load_tokenizer(TOKENIZER_DIR)
+    model = load_model(MODEL_DIR, device)
+    repl(model, tokenizer, device)
+if __name__ == "__main__":
+    main()
+```