📜 Hamlet Next-Word Prediction (LSTM | Keras)

A lightweight LSTM language model trained to predict the next word from a short text prompt (Hamlet-style).
This Hugging Face repo hosts the trained model + tokenizer used by a Streamlit inference app.

Training → Model → Inference

Training notebook (Colab): https://colab.research.google.com/drive/1Hh7BKYroKbbZnMxAQ8R6mzsIW1xawaP6
Inference app (Streamlit): https://github.com/sparklerz/Deep-Learning-Fundamentals-Suite
(page: pages/04_Hamlet_Next_Word_LSTM.py)

What’s in this repo

artifacts/next_word_lstm.h5 — trained Keras model
artifacts/tokenizer.pickle — fitted Keras Tokenizer
artifacts/config.json — generation/config values (e.g., max_sequence_len, vocab cap)
hamlet.txt — training text used in the notebook

Inputs

Seed text (English).
The text is tokenized with the saved tokenizer and padded to a fixed context length:
- max_sequence_len = 40 (from artifacts/config.json)

Output

Next-word probabilities over the vocabulary.
In the Streamlit app you can:
- show top-k next-word suggestions
- generate multiple words using top-k + top-p (nucleus) sampling with temperature and a small repeat penalty

Quickstart (load + predict next word)

import pickle
import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download
from tensorflow.keras.preprocessing.sequence import pad_sequences

REPO_ID = "ash001/hamlet-nextword-lstm"

# Download artifacts
model_path = hf_hub_download(REPO_ID, "artifacts/next_word_lstm.h5")
tok_path   = hf_hub_download(REPO_ID, "artifacts/tokenizer.pickle")
cfg_path   = hf_hub_download(REPO_ID, "artifacts/config.json")

model = tf.keras.models.load_model(model_path, compile=False)
with open(tok_path, "rb") as f:
    tokenizer = pickle.load(f)

import json
cfg = json.load(open(cfg_path, "r"))
max_sequence_len = int(cfg["max_sequence_len"])

def next_word_topk(seed_text: str, k: int = 10):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding="pre")
    probs = model.predict(token_list, verbose=0)[0]
    top_idx = np.argsort(probs)[-k:][::-1]
    return [(tokenizer.index_word.get(int(i), ""), float(probs[i])) for i in top_idx]

print(next_word_topk("what a piece of work", k=10))

license: apache-2.0

Downloads last month: -