Spaces:
Sleeping
Sleeping
| title: "Reddit Sentiment Tracker" | |
| emoji: "📰" | |
| colorFrom: "indigo" | |
| colorTo: "red" | |
| sdk: "streamlit" | |
| sdk_version: "1.44.1" | |
| app_file: "frontend/app.py" | |
| pinned: false | |
| # Reddit Sentiment Pipeline | |
| [](https://github.com/halstonblim/reddit_sentiment_pipeline/actions/workflows/daily.yml) | |
| [](https://redditsentimentpipeline.streamlit.app/) | |
| A fully‑automated **end‑to‑end MLOps** pipeline that tracks daily sentiment trends on Reddit, scores posts with a transformer‑based model served from Replicate, summarizes the results, and publishes an interactive Streamlit dashboard—all orchestrated by GitHub Actions. | |
| ***Analyzing the Public Discourse of AI News*** | |
| The pipeline is currently configured in `config.yaml` to analyze AI news discourse across `r/articial`, `r/LocalLLama`, `r/singularity`, and `r/OpenAI`. The data is persisted across three steps | |
| 1. **Scrapes** new submissions from a configurable list of subreddits (→ `data_raw/`) | |
| 2. **Classifies** each post with a sentiment model served on Replicate (→ `data_scored/`) | |
| 3. **Summarises** daily trends for lightweight front-end consumption (→ `daily_summary/`) | |
| More information on the data can be found on the Hugging Face Dataset repo [hblim/top_reddit_posts_daily](https://huggingface.co/datasets/hblim/top_reddit_posts_daily) | |
| ***Sentiment Analysis*** | |
| We use the [DistilBERT sentiment analysis model](https://github.com/halstonblim/batch-bert-sentiment), which is wrapped with Cog for easy deployment on Replicate. The model handles batched input texts in a single API call, which improves performance by parallelizing computation on the GPU. | |
| --- | |
| ## Table of Contents | |
| 1. [Project Structure](#project-structure) | |
| 2. [Installation & Quick start](#installation) | |
| 3. [Configuration](#configuration) | |
| 4. [Back-end reddit_analysis](#backend-reddit-analysis) | |
| 5. [Unit tests](#unit-tests) | |
| 6. [Front-end Streamlit](#front-end-streamlit) | |
| 7. [CI/CD & GitHub Actions](#cicd-github-actions) | |
| 8. [Monitoring with Grafana/Prometheus](#monitoring-with-grafanaprometheus) | |
| 9. [Extending / Customising](#extending--customizing) | |
| --- | |
| ## Project Structure | |
| ```text | |
| reddit_sentiment_pipeline/ | |
| ├── reddit_analysis/ # Back‑end | |
| │ ├── __init__.py | |
| │ ├── scraper/ | |
| │ │ └── scrape.py # Collect raw posts → HF dataset (data_raw) | |
| │ ├── inference/ | |
| │ │ └── score.py # Call Replicate model → adds sentiment scores | |
| │ ├── summarizer/ | |
| │ │ └── summarize.py # Aggregate + export CSV summaries (data_scored) | |
| │ ├── config_utils.py # Secrets & YAML helper | |
| │ ├── tests/ # Pytest test-suite | |
| | | |
| ├── streamlit_app/ # Front‑end | |
| │ └── app.py | |
| │ | |
| ├── .github/ | |
| │ └── workflows/ | |
| │ ├── daily.yml # Cron‑triggered ETL + summarize | |
| │ | |
| ├── config.yaml # Default runtime config (subreddits, models …) | |
| ├── requirements.txt # requirements for front end only | |
| ├── requirements-dev.txt # requirements for local development | |
| └── README.md | |
| ``` | |
| ### Automated Workflow | |
| ``` | |
| [GitHub Actions Cron @ 21:00 UTC] | |
| | | |
| v | |
| +-------+-------------+ | |
| | Scrape Reddit | ← `scraper/scrape.py --date $DATE` | |
| +-------+-------------+ | |
| | | |
| v | |
| +-------+-------------+ | |
| | Sentiment Analysis | ← `inference/score.py --date $DATE` | |
| +-------+-------------+ | |
| | | |
| v | |
| +-------+-------------+ | |
| | Summarize | ← `summarizer/summarize.py --date $DATE` | |
| +-------+-------------+ | |
| | | |
| v | |
| [HF Dataset: data files] | |
| | | |
| Frontend (Streamlit app) | |
| | | |
| Public URL (Streamlit Cloud) | |
| ``` | |
| --- | |
| ## Installation | |
| To run the frontend streamlit app locally | |
| ```bash | |
| git clone https://github.com/halstonblim/reddit_sentiment_pipeline.git | |
| cd reddit_sentiment_pipeline | |
| pip install -r requirements.txt | |
| streamlit run frontend/app.py | |
| ``` | |
| To run the backend reddit analysis locally and set up your own scraper, sentiment analysis, and export pipeline, steps are roughly | |
| - Get Reddit/Hugging Face/Replicate accounts and API tokens | |
| - You must configure a .env with the secrets (HF, Replicate, Reddit tokens) | |
| - Configure the .yaml file to point to the proper Hugging Face repository and Replicate models, and subreddits to scrape | |
| Once those are configured you can run the following which should scrape Reddit, analyze text remotely with a Replicate model, and export results to Hugging Face | |
| ```bash | |
| pip install -r requirements-dev.txt | |
| # Run the full pipeline for today | |
| $ python -m reddit_analysis.scraper.scrape --date $(date +%F) | |
| $ python -m reddit_analysis.inference.score --date $(date +%F) | |
| $ python -m reddit_analysis.summarizer.summarize --date $(date +%F) | |
| ``` | |
| --- | |
| ## Configuration | |
| All non‑secret settings live in **`config.yaml`**; sensitive tokens are supplied via environment variables or a `.env` file. | |
| ```yaml | |
| # config.yaml (excerpt) | |
| repo_id: hblim/top_reddit_posts_daily | |
| push_to_hf: true | |
| subreddits: | |
| - name: apple | |
| post_limit: 100 | |
| comment_limit: 5 | |
| ``` | |
| | Variable | Where to set | Description | | |
| |----------|-------------|-------------| | |
| | `HF_TOKEN` | GitHub → *Settings › Secrets and variables* <br>or local `.env` | Personal access token with **write** permission to the HF dataset | | |
| | `REPLICATE_API_TOKEN` | same | Token to invoke the Replicate model | | |
| | `ENV` | optional | `local`, `ci`, `prod` – toggles logging & Streamlit behaviour | | |
| --- | |
| ## Backend reddit analysis | |
| ### 1. `scraper.scrape` | |
| Collects the top *N* daily posts from each configured subreddit and appends them to a [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_raw) (`data_raw`). | |
| ```bash | |
| python -m reddit_analysis.scraper.scrape \ | |
| --date 2025-04-22 # YYYY‑MM‑DD (defaults to today) | |
| --limit 100 # optional, posts/subreddit | |
| --overwrite # re‑upload if already exists | |
| ``` | |
| * **Dependencies:** [praw](https://praw.readthedocs.io/), `huggingface‑hub` | |
| * **De‑duplication:** handled server‑side via dataset row `post_id` as primary key—**no local state needed**. | |
| --- | |
| ### 2. `inference.score` | |
| Downloads one day of raw posts, sends raw text consisting of `title + selftext` to the **Replicate** hosted model in batches for optimized parallel computation, and pushes a scored Parquet file to a separate [Hugging Face **Parquet** dataset](https://huggingface.co/datasets/hblim/top_reddit_posts_daily/tree/main/data_scored) `data_scored`. | |
| ```bash | |
| python -m reddit_analysis.inference.score \ | |
| --date 2025-04-22 \ | |
| --model your‑org/sentiment‑model:latest \ | |
| --batch_size 64 # Replicate parallelism | |
| ``` | |
| * **Retry logic:** automatic back‑off for `httpx.RemoteProtocolError`. | |
| --- | |
| ### 3. `summarizer.summarize` | |
| Aggregates daily sentiment by subreddit (mean & weighted means) and writes a compact CSV plus a Parquet summary. | |
| ```bash | |
| python -m reddit_analysis.summarizer.summarize \ | |
| --date 2025-04-22 \ | |
| --output_format csv parquet | |
| ``` | |
| * **Uses `pandas` `groupby` (no default sorting—explicitly sorts by date + subreddit).** | |
| * **Exports** are placed under `data_summary/` in the same HF dataset repo. | |
| --- | |
| ## Unit tests | |
| The backend test‑suite lives in `reddit_analysis/tests/` and can be executed with **pytest**: | |
| ```bash | |
| pytest -q | |
| ``` | |
| | File | What it tests | Key fixtures / mocks | | |
| |------|--------------|----------------------| | |
| | `tests/scraper/test_scrape.py` | Reddit fetch logic, de‑duplication rules | `praw.Reddit`, `huggingface_hub.HfApi` mocked via `monkeypatch` | | |
| | `tests/inference/test_score.py` | Batching, error handling when HF token missing | Fake Replicate API via `httpx.MockTransport` | | |
| | `tests/summarizer/test_summarize.py` | Correct aggregation & sorting | `pandas` dummy frames | | |
| CI runs the tests on every push (see [daily.yml](#cicd--github-actions)). | |
| --- | |
| ## Front end (Streamlit) | |
| `streamlit_app/app.py` provides an interactive dashboard that: | |
| 1. Downloads the daily summary CSVs from HF. | |
| 2. Displays time‑series sentiment trends, top posts tables, and subreddit post counts. | |
| 3. Allows filtering by date range or subreddit with responsive Altair charts. | |
| ```bash | |
| # Local preview | |
| streamlit run streamlit_app/app.py | |
| ``` | |
| --- | |
| ## CI/CD Github Actions | |
| ### `.github/workflows/daily.yml` | |
| | Step | What it does | | |
| |------|--------------| | |
| | **Setup** | Checkout repo, install Python 3.12, cache pip deps | | |
| | **Scrape** | `python -m reddit_analysis.scraper.scrape --date $DATE` | | |
| | **Score** | `python -m reddit_analysis.inference.score --date $DATE` | | |
| | **Summarize** | `python -m reddit_analysis.summarizer.summarize --date $DATE` | | |
| | **Tests** | `pytest -q` | | |
| *Trigger:* `cron: "0 21 * * *"` → 4 pm America/Chicago every day. | |
| Secrets (`HF_TOKEN`, `REPLICATE_API_TOKEN`) are injected via **repository secrets** so the workflow can push to Hugging Face and call Replicate. The runner is completely stateless—every job starts on a fresh VM and writes data only to external storage (HF dataset). | |
| --- | |
| ## Monitoring with Grafana/Prometheus | |
| Implemented a local lightweight Prometheus + Grafana stack; each pipeline stage pushes job_success and job_duration_seconds metrics. Dashboard surfaces run health & latency trends. | |
| Example of success state: | |
|  | |
| Example of failure state: | |
|  | |
| -- | |
| ## Extending / Customizing | |
| * **Change subreddits** – edit the list in `config.yaml` or pass `--subreddits` to the scraper. | |
| * **Swap sentiment models** – point `replicate_model` to any text‑classification model on Replicate with single‑sentence input. | |
| * **Augment summaries** – create additional aggregator modules (e.g. keyword extraction) and add a new step in `daily.yml`. | |