hblim commited on
Commit
30984a9
·
1 Parent(s): a6576f0

Add Spaces front-matter to README

Browse files
Files changed (1) hide show
  1. README.md +50 -39
README.md CHANGED
@@ -1,9 +1,20 @@
1
- # Reddit Sentiment Pipeline
 
 
 
 
 
 
 
 
 
2
 
3
- [![CI Status](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml/badge.svg)](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml)
4
- [![Streamlit App](https://img.shields.io/badge/demo-streamlit-ff4b4b?logo=streamlit)](https://redditsentimentpipeline.streamlit.app/)
5
 
6
- A fully‑automated **end‑to‑end MLOps** pipeline that tracks daily sentiment trends on Reddit, scores posts with a transformer‑based model served from Replicate, summarizes the results, and publishes an interactive Streamlit dashboard—all orchestrated by GitHub Actions.
 
 
 
7
 
8
  ***Analyzing the Public Discourse of AI News***
9
 
@@ -20,7 +31,7 @@ We use the [DistilBERT sentiment analysis model](https://github.com/halstonblim/
20
 
21
  ---
22
 
23
- ## Table of Contents
24
  1. [Project Structure](#project-structure)
25
  2. [Installation & Quick start](#installation)
26
  3. [Configuration](#configuration)
@@ -35,31 +46,31 @@ We use the [DistilBERT sentiment analysis model](https://github.com/halstonblim/
35
 
36
  ## Project Structure
37
 
38
- ````text
39
  reddit_sentiment_pipeline/
40
- ├── reddit_analysis/   # Back‑end
41
-    ├── __init__.py
42
-    ├── scraper/
43
-       └── scrape.py   # Collect raw posts → HF dataset (data_raw)
44
-    ├── inference/
45
-       └── score.py   # Call Replicate model → adds sentiment scores
46
-    ├── summarizer/
47
-       └── summarize.py   # Aggregate + export CSV summaries (data_scored)
48
-    ├── config_utils.py   # Secrets & YAML helper
49
-    ├── tests/ # Pytest test-suite
50
  |
51
- ├── streamlit_app/   # Front‑end
52
-    └── app.py
53
 
54
  ├── .github/
55
-    └── workflows/
56
-    ├── daily.yml   # Cron‑triggered ETL + summarize
57
 
58
- ├── config.yaml   # Default runtime config (subreddits, models …)
59
  ├── requirements.txt # requirements for front end only
60
  ├── requirements-dev.txt # requirements for local development
61
  └── README.md
62
- ````
63
 
64
  ### Automated Workflow
65
  ```
@@ -112,15 +123,15 @@ Once those are configured you can run the following which should scrape Reddit,
112
  pip install -r requirements-dev.txt
113
 
114
  # Run the full pipeline for today
115
- $ python -m reddit_analysis.scraper.scrape   --date $(date +%F)
116
- $ python -m reddit_analysis.inference.score  --date $(date +%F)
117
  $ python -m reddit_analysis.summarizer.summarize --date $(date +%F)
118
  ```
119
  ---
120
 
121
  ## Configuration
122
 
123
- All non‑secret settings live in **`config.yaml`**; sensitive tokens are supplied via environment variables or a `.env` file.
124
 
125
  ```yaml
126
  # config.yaml (excerpt)
@@ -134,7 +145,7 @@ subreddits:
134
 
135
  | Variable | Where to set | Description |
136
  |----------|-------------|-------------|
137
- | `HF_TOKEN` | GitHub → *Settings  Secrets and variables* <br>or local `.env` | Personal access token with **write** permission to the HF dataset |
138
  | `REPLICATE_API_TOKEN` | same | Token to invoke the Replicate model |
139
  | `ENV` | optional | `local`, `ci`, `prod` – toggles logging & Streamlit behaviour |
140
 
@@ -142,8 +153,8 @@ subreddits:
142
 
143
  ## Backend reddit analysis
144
 
145
- ### 1. `scraper.scrape`
146
- Collects the top *N* daily posts from each configured subreddit and appends them to a [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_raw) (`data_raw`).
147
 
148
  ```bash
149
  python -m reddit_analysis.scraper.scrape \
@@ -152,12 +163,12 @@ python -m reddit_analysis.scraper.scrape \
152
  --overwrite # re‑upload if already exists
153
  ```
154
 
155
- * **Dependencies:** [`praw`](https://praw.readthedocs.io/), `huggingface‑hub`
156
  * **De‑duplication:** handled server‑side via dataset row `post_id` as primary key—**no local state needed**.
157
 
158
  ---
159
 
160
- ### 2. `inference.score`
161
  Downloads one day of raw posts, sends raw text consisting of `title + selftext` to the **Replicate** hosted model in batches for optimized parallel computation, and pushes a scored Parquet file to a separate [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_scored) `data_scored`.
162
 
163
  ```bash
@@ -170,7 +181,7 @@ python -m reddit_analysis.inference.score \
170
  * **Retry logic:** automatic back‑off for `httpx.RemoteProtocolError`.
171
  ---
172
 
173
- ### 3. `summarizer.summarize`
174
  Aggregates daily sentiment by subreddit (mean & weighted means) and writes a compact CSV plus a Parquet summary.
175
 
176
  ```bash
@@ -179,20 +190,20 @@ python -m reddit_analysis.summarizer.summarize \
179
  --output_format csv parquet
180
  ```
181
 
182
- * **Uses `pandas` `groupby` (no default sorting—explicitly sorts by date + subreddit).**
183
  * **Exports** are placed under `data_summary/` in the same HF dataset repo.
184
 
185
  ---
186
 
187
  ## Unit tests
188
 
189
- The backend test‑suite lives in `reddit_analysis/tests/` and can be executed with **pytest**:
190
 
191
  ```bash
192
  pytest -q
193
  ```
194
 
195
- | File | What it tests | Key fixtures / mocks |
196
  |------|--------------|----------------------|
197
  | `tests/scraper/test_scrape.py` | Reddit fetch logic, de‑duplication rules | `praw.Reddit`, `huggingface_hub.HfApi` mocked via `monkeypatch` |
198
  | `tests/inference/test_score.py` | Batching, error handling when HF token missing | Fake Replicate API via `httpx.MockTransport` |
@@ -222,15 +233,15 @@ streamlit run streamlit_app/app.py
222
 
223
  | Step | What it does |
224
  |------|--------------|
225
- | **Setup** | Checkout repo, install Python 3.12, cache pip deps |
226
  | **Scrape** | `python -m reddit_analysis.scraper.scrape --date $DATE` |
227
  | **Score** | `python -m reddit_analysis.inference.score --date $DATE` |
228
  | **Summarize** | `python -m reddit_analysis.summarizer.summarize --date $DATE` |
229
  | **Tests** | `pytest -q` |
230
 
231
- *Trigger:* `cron: "0 21 * * *"` → 4pm America/Chicago every day.
232
 
233
- Secrets (`HF_TOKEN`, `REPLICATE_API_TOKEN`) are injected via **repository secrets** so the workflow can push to Hugging Face and call Replicate. The runner is completely stateless—every job starts on a fresh VM and writes data only to external storage (HF dataset).
234
 
235
  ---
236
 
@@ -250,7 +261,7 @@ Example of failure state:
250
 
251
  ## Extending / Customizing
252
 
253
- * **Change subreddits** – edit the list in `config.yaml` or pass `--subreddits` to the scraper.
254
- * **Swap sentiment models** – point `replicate_model` to any text‑classification model on Replicate with single‑sentence input.
255
  * **Augment summaries** – create additional aggregator modules (e.g. keyword extraction) and add a new step in `daily.yml`.
256
 
 
1
+ ---
2
+ title: "Reddit Sentiment Tracker"
3
+ emoji: "📰"
4
+ colorFrom: "indigo"
5
+ colorTo: "red"
6
+ sdk: "streamlit"
7
+ sdk_version: "1.44.1"
8
+ app_file: "app.py"
9
+ pinned: false
10
+ ---
11
 
12
+ # Reddit Sentiment Pipeline
 
13
 
14
+ [![CI Status](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml/badge.svg)](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml)
15
+ [![Streamlit App](https://img.shields.io/badge/demo-streamlit-ff4b4b?logo=streamlit)](https://redditsentimentpipeline.streamlit.app/)
16
+
17
+ A fully‑automated **end‑to‑end MLOps** pipeline that tracks daily sentiment trends on Reddit, scores posts with a transformer‑based model served from Replicate, summarizes the results, and publishes an interactive Streamlit dashboard—all orchestrated by GitHub Actions.
18
 
19
  ***Analyzing the Public Discourse of AI News***
20
 
 
31
 
32
  ---
33
 
34
+ ## Table of Contents
35
  1. [Project Structure](#project-structure)
36
  2. [Installation & Quick start](#installation)
37
  3. [Configuration](#configuration)
 
46
 
47
  ## Project Structure
48
 
49
+ ```text
50
  reddit_sentiment_pipeline/
51
+ ├── reddit_analysis/ # Back‑end
52
+ ├── __init__.py
53
+ ├── scraper/
54
+ └── scrape.py # Collect raw posts → HF dataset (data_raw)
55
+ ├── inference/
56
+ └── score.py # Call Replicate model → adds sentiment scores
57
+ ├── summarizer/
58
+ └── summarize.py # Aggregate + export CSV summaries (data_scored)
59
+ ├── config_utils.py # Secrets & YAML helper
60
+ ├── tests/ # Pytest test-suite
61
  |
62
+ ├── streamlit_app/ # Front‑end
63
+ └── app.py
64
 
65
  ├── .github/
66
+ └── workflows/
67
+ ├── daily.yml # Cron‑triggered ETL + summarize
68
 
69
+ ├── config.yaml # Default runtime config (subreddits, models …)
70
  ├── requirements.txt # requirements for front end only
71
  ├── requirements-dev.txt # requirements for local development
72
  └── README.md
73
+ ```
74
 
75
  ### Automated Workflow
76
  ```
 
123
  pip install -r requirements-dev.txt
124
 
125
  # Run the full pipeline for today
126
+ $ python -m reddit_analysis.scraper.scrape --date $(date +%F)
127
+ $ python -m reddit_analysis.inference.score --date $(date +%F)
128
  $ python -m reddit_analysis.summarizer.summarize --date $(date +%F)
129
  ```
130
  ---
131
 
132
  ## Configuration
133
 
134
+ All non‑secret settings live in **`config.yaml`**; sensitive tokens are supplied via environment variables or a `.env` file.
135
 
136
  ```yaml
137
  # config.yaml (excerpt)
 
145
 
146
  | Variable | Where to set | Description |
147
  |----------|-------------|-------------|
148
+ | `HF_TOKEN` | GitHub → *Settings Secrets and variables* <br>or local `.env` | Personal access token with **write** permission to the HF dataset |
149
  | `REPLICATE_API_TOKEN` | same | Token to invoke the Replicate model |
150
  | `ENV` | optional | `local`, `ci`, `prod` – toggles logging & Streamlit behaviour |
151
 
 
153
 
154
  ## Backend reddit analysis
155
 
156
+ ### 1. `scraper.scrape`
157
+ Collects the top *N* daily posts from each configured subreddit and appends them to a [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_raw) (`data_raw`).
158
 
159
  ```bash
160
  python -m reddit_analysis.scraper.scrape \
 
163
  --overwrite # re‑upload if already exists
164
  ```
165
 
166
+ * **Dependencies:** [praw](https://praw.readthedocs.io/), `huggingface‑hub`
167
  * **De‑duplication:** handled server‑side via dataset row `post_id` as primary key—**no local state needed**.
168
 
169
  ---
170
 
171
+ ### 2. `inference.score`
172
  Downloads one day of raw posts, sends raw text consisting of `title + selftext` to the **Replicate** hosted model in batches for optimized parallel computation, and pushes a scored Parquet file to a separate [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_scored) `data_scored`.
173
 
174
  ```bash
 
181
  * **Retry logic:** automatic back‑off for `httpx.RemoteProtocolError`.
182
  ---
183
 
184
+ ### 3. `summarizer.summarize`
185
  Aggregates daily sentiment by subreddit (mean & weighted means) and writes a compact CSV plus a Parquet summary.
186
 
187
  ```bash
 
190
  --output_format csv parquet
191
  ```
192
 
193
+ * **Uses `pandas` `groupby` (no default sorting—explicitly sorts by date + subreddit).**
194
  * **Exports** are placed under `data_summary/` in the same HF dataset repo.
195
 
196
  ---
197
 
198
  ## Unit tests
199
 
200
+ The backend test‑suite lives in `reddit_analysis/tests/` and can be executed with **pytest**:
201
 
202
  ```bash
203
  pytest -q
204
  ```
205
 
206
+ | File | What it tests | Key fixtures / mocks |
207
  |------|--------------|----------------------|
208
  | `tests/scraper/test_scrape.py` | Reddit fetch logic, de‑duplication rules | `praw.Reddit`, `huggingface_hub.HfApi` mocked via `monkeypatch` |
209
  | `tests/inference/test_score.py` | Batching, error handling when HF token missing | Fake Replicate API via `httpx.MockTransport` |
 
233
 
234
  | Step | What it does |
235
  |------|--------------|
236
+ | **Setup** | Checkout repo, install Python 3.12, cache pip deps |
237
  | **Scrape** | `python -m reddit_analysis.scraper.scrape --date $DATE` |
238
  | **Score** | `python -m reddit_analysis.inference.score --date $DATE` |
239
  | **Summarize** | `python -m reddit_analysis.summarizer.summarize --date $DATE` |
240
  | **Tests** | `pytest -q` |
241
 
242
+ *Trigger:* `cron: "0 21 * * *"` → 4 pm America/Chicago every day.
243
 
244
+ Secrets (`HF_TOKEN`, `REPLICATE_API_TOKEN`) are injected via **repository secrets** so the workflow can push to Hugging Face and call Replicate. The runner is completely stateless—every job starts on a fresh VM and writes data only to external storage (HF dataset).
245
 
246
  ---
247
 
 
261
 
262
  ## Extending / Customizing
263
 
264
+ * **Change subreddits** – edit the list in `config.yaml` or pass `--subreddits` to the scraper.
265
+ * **Swap sentiment models** – point `replicate_model` to any text‑classification model on Replicate with single‑sentence input.
266
  * **Augment summaries** – create additional aggregator modules (e.g. keyword extraction) and add a new step in `daily.yml`.
267