Spaces:
Sleeping
Sleeping
Add Spaces front-matter to README
Browse files
README.md
CHANGED
|
@@ -1,9 +1,20 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
[](https://redditsentimentpipeline.streamlit.app/)
|
| 5 |
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
***Analyzing the Public Discourse of AI News***
|
| 9 |
|
|
@@ -20,7 +31,7 @@ We use the [DistilBERT sentiment analysis model](https://github.com/halstonblim/
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
## Table
|
| 24 |
1. [Project Structure](#project-structure)
|
| 25 |
2. [Installation & Quick start](#installation)
|
| 26 |
3. [Configuration](#configuration)
|
|
@@ -35,31 +46,31 @@ We use the [DistilBERT sentiment analysis model](https://github.com/halstonblim/
|
|
| 35 |
|
| 36 |
## Project Structure
|
| 37 |
|
| 38 |
-
|
| 39 |
reddit_sentiment_pipeline/
|
| 40 |
-
├── reddit_analysis/
|
| 41 |
-
│
|
| 42 |
-
│
|
| 43 |
-
│
|
| 44 |
-
│
|
| 45 |
-
│
|
| 46 |
-
│
|
| 47 |
-
│
|
| 48 |
-
│
|
| 49 |
-
│
|
| 50 |
|
|
| 51 |
-
├── streamlit_app/
|
| 52 |
-
│
|
| 53 |
│
|
| 54 |
├── .github/
|
| 55 |
-
│
|
| 56 |
-
│
|
| 57 |
│
|
| 58 |
-
├── config.yaml
|
| 59 |
├── requirements.txt # requirements for front end only
|
| 60 |
├── requirements-dev.txt # requirements for local development
|
| 61 |
└── README.md
|
| 62 |
-
|
| 63 |
|
| 64 |
### Automated Workflow
|
| 65 |
```
|
|
@@ -112,15 +123,15 @@ Once those are configured you can run the following which should scrape Reddit,
|
|
| 112 |
pip install -r requirements-dev.txt
|
| 113 |
|
| 114 |
# Run the full pipeline for today
|
| 115 |
-
$ python -m reddit_analysis.scraper.scrape
|
| 116 |
-
$ python -m reddit_analysis.inference.score
|
| 117 |
$ python -m reddit_analysis.summarizer.summarize --date $(date +%F)
|
| 118 |
```
|
| 119 |
---
|
| 120 |
|
| 121 |
## Configuration
|
| 122 |
|
| 123 |
-
All non‑secret settings live in **`config.yaml`**; sensitive tokens are supplied via environment variables or a
|
| 124 |
|
| 125 |
```yaml
|
| 126 |
# config.yaml (excerpt)
|
|
@@ -134,7 +145,7 @@ subreddits:
|
|
| 134 |
|
| 135 |
| Variable | Where to set | Description |
|
| 136 |
|----------|-------------|-------------|
|
| 137 |
-
| `HF_TOKEN` | GitHub → *Settings
|
| 138 |
| `REPLICATE_API_TOKEN` | same | Token to invoke the Replicate model |
|
| 139 |
| `ENV` | optional | `local`, `ci`, `prod` – toggles logging & Streamlit behaviour |
|
| 140 |
|
|
@@ -142,8 +153,8 @@ subreddits:
|
|
| 142 |
|
| 143 |
## Backend reddit analysis
|
| 144 |
|
| 145 |
-
### 1.
|
| 146 |
-
Collects the top *N* daily posts from each configured subreddit and appends them to a [Hugging
|
| 147 |
|
| 148 |
```bash
|
| 149 |
python -m reddit_analysis.scraper.scrape \
|
|
@@ -152,12 +163,12 @@ python -m reddit_analysis.scraper.scrape \
|
|
| 152 |
--overwrite # re‑upload if already exists
|
| 153 |
```
|
| 154 |
|
| 155 |
-
* **Dependencies:**
|
| 156 |
* **De‑duplication:** handled server‑side via dataset row `post_id` as primary key—**no local state needed**.
|
| 157 |
|
| 158 |
---
|
| 159 |
|
| 160 |
-
### 2.
|
| 161 |
Downloads one day of raw posts, sends raw text consisting of `title + selftext` to the **Replicate** hosted model in batches for optimized parallel computation, and pushes a scored Parquet file to a separate [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_scored) `data_scored`.
|
| 162 |
|
| 163 |
```bash
|
|
@@ -170,7 +181,7 @@ python -m reddit_analysis.inference.score \
|
|
| 170 |
* **Retry logic:** automatic back‑off for `httpx.RemoteProtocolError`.
|
| 171 |
---
|
| 172 |
|
| 173 |
-
### 3.
|
| 174 |
Aggregates daily sentiment by subreddit (mean & weighted means) and writes a compact CSV plus a Parquet summary.
|
| 175 |
|
| 176 |
```bash
|
|
@@ -179,20 +190,20 @@ python -m reddit_analysis.summarizer.summarize \
|
|
| 179 |
--output_format csv parquet
|
| 180 |
```
|
| 181 |
|
| 182 |
-
* **Uses `pandas`
|
| 183 |
* **Exports** are placed under `data_summary/` in the same HF dataset repo.
|
| 184 |
|
| 185 |
---
|
| 186 |
|
| 187 |
## Unit tests
|
| 188 |
|
| 189 |
-
The backend test‑suite lives in `reddit_analysis/tests/` and can be executed with
|
| 190 |
|
| 191 |
```bash
|
| 192 |
pytest -q
|
| 193 |
```
|
| 194 |
|
| 195 |
-
| File | What it
|
| 196 |
|------|--------------|----------------------|
|
| 197 |
| `tests/scraper/test_scrape.py` | Reddit fetch logic, de‑duplication rules | `praw.Reddit`, `huggingface_hub.HfApi` mocked via `monkeypatch` |
|
| 198 |
| `tests/inference/test_score.py` | Batching, error handling when HF token missing | Fake Replicate API via `httpx.MockTransport` |
|
|
@@ -222,15 +233,15 @@ streamlit run streamlit_app/app.py
|
|
| 222 |
|
| 223 |
| Step | What it does |
|
| 224 |
|------|--------------|
|
| 225 |
-
| **Setup** | Checkout repo, install Python 3.12, cache pip
|
| 226 |
| **Scrape** | `python -m reddit_analysis.scraper.scrape --date $DATE` |
|
| 227 |
| **Score** | `python -m reddit_analysis.inference.score --date $DATE` |
|
| 228 |
| **Summarize** | `python -m reddit_analysis.summarizer.summarize --date $DATE` |
|
| 229 |
| **Tests** | `pytest -q` |
|
| 230 |
|
| 231 |
-
*Trigger:* `cron: "0 21 * * *"` → 4
|
| 232 |
|
| 233 |
-
Secrets (`HF_TOKEN`, `REPLICATE_API_TOKEN`) are injected via **repository
|
| 234 |
|
| 235 |
---
|
| 236 |
|
|
@@ -250,7 +261,7 @@ Example of failure state:
|
|
| 250 |
|
| 251 |
## Extending / Customizing
|
| 252 |
|
| 253 |
-
* **Change subreddits** – edit the list in
|
| 254 |
-
* **Swap sentiment models** – point `replicate_model` to any text‑classification model on
|
| 255 |
* **Augment summaries** – create additional aggregator modules (e.g. keyword extraction) and add a new step in `daily.yml`.
|
| 256 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Reddit Sentiment Tracker"
|
| 3 |
+
emoji: "📰"
|
| 4 |
+
colorFrom: "indigo"
|
| 5 |
+
colorTo: "red"
|
| 6 |
+
sdk: "streamlit"
|
| 7 |
+
sdk_version: "1.44.1"
|
| 8 |
+
app_file: "app.py"
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
|
| 12 |
+
# Reddit Sentiment Pipeline
|
|
|
|
| 13 |
|
| 14 |
+
[](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml)
|
| 15 |
+
[](https://redditsentimentpipeline.streamlit.app/)
|
| 16 |
+
|
| 17 |
+
A fully‑automated **end‑to‑end MLOps** pipeline that tracks daily sentiment trends on Reddit, scores posts with a transformer‑based model served from Replicate, summarizes the results, and publishes an interactive Streamlit dashboard—all orchestrated by GitHub Actions.
|
| 18 |
|
| 19 |
***Analyzing the Public Discourse of AI News***
|
| 20 |
|
|
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
+
## Table of Contents
|
| 35 |
1. [Project Structure](#project-structure)
|
| 36 |
2. [Installation & Quick start](#installation)
|
| 37 |
3. [Configuration](#configuration)
|
|
|
|
| 46 |
|
| 47 |
## Project Structure
|
| 48 |
|
| 49 |
+
```text
|
| 50 |
reddit_sentiment_pipeline/
|
| 51 |
+
├── reddit_analysis/ # Back‑end
|
| 52 |
+
│ ├── __init__.py
|
| 53 |
+
│ ├── scraper/
|
| 54 |
+
│ │ └── scrape.py # Collect raw posts → HF dataset (data_raw)
|
| 55 |
+
│ ├── inference/
|
| 56 |
+
│ │ └── score.py # Call Replicate model → adds sentiment scores
|
| 57 |
+
│ ├── summarizer/
|
| 58 |
+
│ │ └── summarize.py # Aggregate + export CSV summaries (data_scored)
|
| 59 |
+
│ ├── config_utils.py # Secrets & YAML helper
|
| 60 |
+
│ ├── tests/ # Pytest test-suite
|
| 61 |
|
|
| 62 |
+
├── streamlit_app/ # Front‑end
|
| 63 |
+
│ └── app.py
|
| 64 |
│
|
| 65 |
├── .github/
|
| 66 |
+
│ └── workflows/
|
| 67 |
+
│ ├── daily.yml # Cron‑triggered ETL + summarize
|
| 68 |
│
|
| 69 |
+
├── config.yaml # Default runtime config (subreddits, models …)
|
| 70 |
├── requirements.txt # requirements for front end only
|
| 71 |
├── requirements-dev.txt # requirements for local development
|
| 72 |
└── README.md
|
| 73 |
+
```
|
| 74 |
|
| 75 |
### Automated Workflow
|
| 76 |
```
|
|
|
|
| 123 |
pip install -r requirements-dev.txt
|
| 124 |
|
| 125 |
# Run the full pipeline for today
|
| 126 |
+
$ python -m reddit_analysis.scraper.scrape --date $(date +%F)
|
| 127 |
+
$ python -m reddit_analysis.inference.score --date $(date +%F)
|
| 128 |
$ python -m reddit_analysis.summarizer.summarize --date $(date +%F)
|
| 129 |
```
|
| 130 |
---
|
| 131 |
|
| 132 |
## Configuration
|
| 133 |
|
| 134 |
+
All non‑secret settings live in **`config.yaml`**; sensitive tokens are supplied via environment variables or a `.env` file.
|
| 135 |
|
| 136 |
```yaml
|
| 137 |
# config.yaml (excerpt)
|
|
|
|
| 145 |
|
| 146 |
| Variable | Where to set | Description |
|
| 147 |
|----------|-------------|-------------|
|
| 148 |
+
| `HF_TOKEN` | GitHub → *Settings › Secrets and variables* <br>or local `.env` | Personal access token with **write** permission to the HF dataset |
|
| 149 |
| `REPLICATE_API_TOKEN` | same | Token to invoke the Replicate model |
|
| 150 |
| `ENV` | optional | `local`, `ci`, `prod` – toggles logging & Streamlit behaviour |
|
| 151 |
|
|
|
|
| 153 |
|
| 154 |
## Backend reddit analysis
|
| 155 |
|
| 156 |
+
### 1. `scraper.scrape`
|
| 157 |
+
Collects the top *N* daily posts from each configured subreddit and appends them to a [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_raw) (`data_raw`).
|
| 158 |
|
| 159 |
```bash
|
| 160 |
python -m reddit_analysis.scraper.scrape \
|
|
|
|
| 163 |
--overwrite # re‑upload if already exists
|
| 164 |
```
|
| 165 |
|
| 166 |
+
* **Dependencies:** [praw](https://praw.readthedocs.io/), `huggingface‑hub`
|
| 167 |
* **De‑duplication:** handled server‑side via dataset row `post_id` as primary key—**no local state needed**.
|
| 168 |
|
| 169 |
---
|
| 170 |
|
| 171 |
+
### 2. `inference.score`
|
| 172 |
Downloads one day of raw posts, sends raw text consisting of `title + selftext` to the **Replicate** hosted model in batches for optimized parallel computation, and pushes a scored Parquet file to a separate [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_scored) `data_scored`.
|
| 173 |
|
| 174 |
```bash
|
|
|
|
| 181 |
* **Retry logic:** automatic back‑off for `httpx.RemoteProtocolError`.
|
| 182 |
---
|
| 183 |
|
| 184 |
+
### 3. `summarizer.summarize`
|
| 185 |
Aggregates daily sentiment by subreddit (mean & weighted means) and writes a compact CSV plus a Parquet summary.
|
| 186 |
|
| 187 |
```bash
|
|
|
|
| 190 |
--output_format csv parquet
|
| 191 |
```
|
| 192 |
|
| 193 |
+
* **Uses `pandas` `groupby` (no default sorting—explicitly sorts by date + subreddit).**
|
| 194 |
* **Exports** are placed under `data_summary/` in the same HF dataset repo.
|
| 195 |
|
| 196 |
---
|
| 197 |
|
| 198 |
## Unit tests
|
| 199 |
|
| 200 |
+
The backend test‑suite lives in `reddit_analysis/tests/` and can be executed with **pytest**:
|
| 201 |
|
| 202 |
```bash
|
| 203 |
pytest -q
|
| 204 |
```
|
| 205 |
|
| 206 |
+
| File | What it tests | Key fixtures / mocks |
|
| 207 |
|------|--------------|----------------------|
|
| 208 |
| `tests/scraper/test_scrape.py` | Reddit fetch logic, de‑duplication rules | `praw.Reddit`, `huggingface_hub.HfApi` mocked via `monkeypatch` |
|
| 209 |
| `tests/inference/test_score.py` | Batching, error handling when HF token missing | Fake Replicate API via `httpx.MockTransport` |
|
|
|
|
| 233 |
|
| 234 |
| Step | What it does |
|
| 235 |
|------|--------------|
|
| 236 |
+
| **Setup** | Checkout repo, install Python 3.12, cache pip deps |
|
| 237 |
| **Scrape** | `python -m reddit_analysis.scraper.scrape --date $DATE` |
|
| 238 |
| **Score** | `python -m reddit_analysis.inference.score --date $DATE` |
|
| 239 |
| **Summarize** | `python -m reddit_analysis.summarizer.summarize --date $DATE` |
|
| 240 |
| **Tests** | `pytest -q` |
|
| 241 |
|
| 242 |
+
*Trigger:* `cron: "0 21 * * *"` → 4 pm America/Chicago every day.
|
| 243 |
|
| 244 |
+
Secrets (`HF_TOKEN`, `REPLICATE_API_TOKEN`) are injected via **repository secrets** so the workflow can push to Hugging Face and call Replicate. The runner is completely stateless—every job starts on a fresh VM and writes data only to external storage (HF dataset).
|
| 245 |
|
| 246 |
---
|
| 247 |
|
|
|
|
| 261 |
|
| 262 |
## Extending / Customizing
|
| 263 |
|
| 264 |
+
* **Change subreddits** – edit the list in `config.yaml` or pass `--subreddits` to the scraper.
|
| 265 |
+
* **Swap sentiment models** – point `replicate_model` to any text‑classification model on Replicate with single‑sentence input.
|
| 266 |
* **Augment summaries** – create additional aggregator modules (e.g. keyword extraction) and add a new step in `daily.yml`.
|
| 267 |
|