Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,102 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
pipeline_tag: text-classification
|
| 5 |
+
library_name: tf-keras
|
| 6 |
+
tags:
|
| 7 |
+
- nlp
|
| 8 |
+
- text-classification
|
| 9 |
+
- sentiment-analysis
|
| 10 |
+
- imdb
|
| 11 |
+
- simplernn
|
| 12 |
+
- tensorflow
|
| 13 |
+
- keras
|
| 14 |
+
- streamlit
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# 🎬 IMDB Movie Review Sentiment (SimpleRNN | Keras)
|
| 18 |
+
|
| 19 |
+
A lightweight **SimpleRNN** model trained on the **Keras IMDB** dataset to predict **movie review sentiment**.
|
| 20 |
+
|
| 21 |
+
This Hugging Face repo hosts the trained model artifact used by a Streamlit inference app.
|
| 22 |
+
|
| 23 |
+
## Training → Model → Inference
|
| 24 |
+
- **Training notebook (Colab):** https://colab.research.google.com/drive/14A_qc4aLvx5I0cFsK9lJYHRymGjzZIyK
|
| 25 |
+
- **Inference app (Streamlit):** https://github.com/sparklerz/Deep-Learning-Fundamentals-Suite
|
| 26 |
+
(page: `pages/03_IMDB_Sentiment_SimpleRNN.py`)
|
| 27 |
+
|
| 28 |
+
## What’s in this repo
|
| 29 |
+
- `artifacts/simple_rnn_imdb.h5` — trained Keras model
|
| 30 |
+
- `artifacts/config.json` — key inference settings:
|
| 31 |
+
- `max_features` (vocab size cap)
|
| 32 |
+
- `max_len` (sequence length)
|
| 33 |
+
- `threshold_default` (classification threshold)
|
| 34 |
+
|
| 35 |
+
## Inputs
|
| 36 |
+
- A short English movie review (free text).
|
| 37 |
+
|
| 38 |
+
## Preprocessing (same as Streamlit app)
|
| 39 |
+
- Lowercase + tokenize with regex: `[a-z']+`
|
| 40 |
+
- Convert tokens to integer IDs using the **Keras IMDB word index** (`tensorflow.keras.datasets.imdb.get_word_index()`)
|
| 41 |
+
- Apply the standard Keras IMDB offset:
|
| 42 |
+
- start token = `1`
|
| 43 |
+
- unknown token = `2`
|
| 44 |
+
- word indices are shifted by `+3`
|
| 45 |
+
- Clip words to `max_features`; anything outside becomes `2` (unknown)
|
| 46 |
+
- Pad/truncate to `max_len` using `pad_sequences` (padding="pre", truncating="post")
|
| 47 |
+
|
| 48 |
+
## Output
|
| 49 |
+
- A single probability: **P(positive)** in `[0, 1]`.
|
| 50 |
+
- Decision rule:
|
| 51 |
+
- `Positive` if `P(positive) >= threshold`
|
| 52 |
+
- `Negative` otherwise
|
| 53 |
+
- Default threshold is read from `artifacts/config.json` (typically `0.5`).
|
| 54 |
+
|
| 55 |
+
## Quickstart (load + predict)
|
| 56 |
+
```python
|
| 57 |
+
import re
|
| 58 |
+
import numpy as np
|
| 59 |
+
import tensorflow as tf
|
| 60 |
+
from huggingface_hub import hf_hub_download
|
| 61 |
+
from tensorflow.keras.preprocessing.sequence import pad_sequences
|
| 62 |
+
from tensorflow.keras.datasets import imdb
|
| 63 |
+
import json
|
| 64 |
+
|
| 65 |
+
REPO_ID = "ash001/imdb-sentiment-simple-rnn"
|
| 66 |
+
|
| 67 |
+
# Load model + config
|
| 68 |
+
model_path = hf_hub_download(REPO_ID, "artifacts/simple_rnn_imdb.h5")
|
| 69 |
+
cfg_path = hf_hub_download(REPO_ID, "artifacts/config.json")
|
| 70 |
+
cfg = json.load(open(cfg_path, "r"))
|
| 71 |
+
|
| 72 |
+
model = tf.keras.models.load_model(model_path, compile=False)
|
| 73 |
+
word_index = imdb.get_word_index()
|
| 74 |
+
|
| 75 |
+
max_features = int(cfg["max_features"])
|
| 76 |
+
max_len = int(cfg["max_len"])
|
| 77 |
+
threshold = float(cfg.get("threshold_default", 0.5))
|
| 78 |
+
|
| 79 |
+
def text_to_sequence(text: str):
|
| 80 |
+
text = text.lower()
|
| 81 |
+
tokens = re.findall(r"[a-z']+", text)
|
| 82 |
+
|
| 83 |
+
seq = [1] # start token
|
| 84 |
+
for w in tokens:
|
| 85 |
+
idx = word_index.get(w, 2) + 3
|
| 86 |
+
if idx >= max_features:
|
| 87 |
+
idx = 2
|
| 88 |
+
seq.append(idx)
|
| 89 |
+
|
| 90 |
+
return pad_sequences([seq], maxlen=max_len, truncating="post", padding="pre")
|
| 91 |
+
|
| 92 |
+
text = "This movie was surprisingly good, with great acting and a strong ending."
|
| 93 |
+
X = text_to_sequence(text)
|
| 94 |
+
|
| 95 |
+
prob_pos = float(model.predict(X, verbose=0).reshape(-1)[0])
|
| 96 |
+
label = "Positive" if prob_pos >= threshold else "Negative"
|
| 97 |
+
print("P(positive) =", prob_pos, "|", label)
|
| 98 |
+
````
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
license: apache-2.0
|
| 102 |
+
---
|