ash001 commited on
Commit
28a28f6
·
verified ·
1 Parent(s): 0054658

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ pipeline_tag: text-classification
5
+ library_name: tf-keras
6
+ tags:
7
+ - nlp
8
+ - text-classification
9
+ - sentiment-analysis
10
+ - imdb
11
+ - simplernn
12
+ - tensorflow
13
+ - keras
14
+ - streamlit
15
+ ---
16
+
17
+ # 🎬 IMDB Movie Review Sentiment (SimpleRNN | Keras)
18
+
19
+ A lightweight **SimpleRNN** model trained on the **Keras IMDB** dataset to predict **movie review sentiment**.
20
+
21
+ This Hugging Face repo hosts the trained model artifact used by a Streamlit inference app.
22
+
23
+ ## Training → Model → Inference
24
+ - **Training notebook (Colab):** https://colab.research.google.com/drive/14A_qc4aLvx5I0cFsK9lJYHRymGjzZIyK
25
+ - **Inference app (Streamlit):** https://github.com/sparklerz/Deep-Learning-Fundamentals-Suite
26
+ (page: `pages/03_IMDB_Sentiment_SimpleRNN.py`)
27
+
28
+ ## What’s in this repo
29
+ - `artifacts/simple_rnn_imdb.h5` — trained Keras model
30
+ - `artifacts/config.json` — key inference settings:
31
+ - `max_features` (vocab size cap)
32
+ - `max_len` (sequence length)
33
+ - `threshold_default` (classification threshold)
34
+
35
+ ## Inputs
36
+ - A short English movie review (free text).
37
+
38
+ ## Preprocessing (same as Streamlit app)
39
+ - Lowercase + tokenize with regex: `[a-z']+`
40
+ - Convert tokens to integer IDs using the **Keras IMDB word index** (`tensorflow.keras.datasets.imdb.get_word_index()`)
41
+ - Apply the standard Keras IMDB offset:
42
+ - start token = `1`
43
+ - unknown token = `2`
44
+ - word indices are shifted by `+3`
45
+ - Clip words to `max_features`; anything outside becomes `2` (unknown)
46
+ - Pad/truncate to `max_len` using `pad_sequences` (padding="pre", truncating="post")
47
+
48
+ ## Output
49
+ - A single probability: **P(positive)** in `[0, 1]`.
50
+ - Decision rule:
51
+ - `Positive` if `P(positive) >= threshold`
52
+ - `Negative` otherwise
53
+ - Default threshold is read from `artifacts/config.json` (typically `0.5`).
54
+
55
+ ## Quickstart (load + predict)
56
+ ```python
57
+ import re
58
+ import numpy as np
59
+ import tensorflow as tf
60
+ from huggingface_hub import hf_hub_download
61
+ from tensorflow.keras.preprocessing.sequence import pad_sequences
62
+ from tensorflow.keras.datasets import imdb
63
+ import json
64
+
65
+ REPO_ID = "ash001/imdb-sentiment-simple-rnn"
66
+
67
+ # Load model + config
68
+ model_path = hf_hub_download(REPO_ID, "artifacts/simple_rnn_imdb.h5")
69
+ cfg_path = hf_hub_download(REPO_ID, "artifacts/config.json")
70
+ cfg = json.load(open(cfg_path, "r"))
71
+
72
+ model = tf.keras.models.load_model(model_path, compile=False)
73
+ word_index = imdb.get_word_index()
74
+
75
+ max_features = int(cfg["max_features"])
76
+ max_len = int(cfg["max_len"])
77
+ threshold = float(cfg.get("threshold_default", 0.5))
78
+
79
+ def text_to_sequence(text: str):
80
+ text = text.lower()
81
+ tokens = re.findall(r"[a-z']+", text)
82
+
83
+ seq = [1] # start token
84
+ for w in tokens:
85
+ idx = word_index.get(w, 2) + 3
86
+ if idx >= max_features:
87
+ idx = 2
88
+ seq.append(idx)
89
+
90
+ return pad_sequences([seq], maxlen=max_len, truncating="post", padding="pre")
91
+
92
+ text = "This movie was surprisingly good, with great acting and a strong ending."
93
+ X = text_to_sequence(text)
94
+
95
+ prob_pos = float(model.predict(X, verbose=0).reshape(-1)[0])
96
+ label = "Positive" if prob_pos >= threshold else "Negative"
97
+ print("P(positive) =", prob_pos, "|", label)
98
+ ````
99
+
100
+ ---
101
+ license: apache-2.0
102
+ ---