Spaces:

abcd1234davidchen
/

PolStance

Sleeping

App Files Files Community

PolStance / README.md

abcd1234davidchen

Update README.md

a556f12 verified 3 months ago

preview code

raw

history blame contribute delete

3.55 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: PolStance
emoji: 👁
colorFrom: blue
colorTo: gray
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
license: mit
short_description: Classify if Chinese sentence is biased to Taiwan Parties

PolStance - a Political Stance Detector

The PolStance project take advantage of the power of transformer-based models to classify political stances in text.

It works by a fine-tuned BERT model on a dataset of political statements labeled with their corresponding stances. The labeling is done by Gemini flash lite model.

This repository contains both the training and inference code, as well as how the dataset is obtained. The model is trained on Chinese and classifies the stances into "KMT", "DPP", and "Neutral".

The result has the accuracy of 72%, but the labeling quality is off, and the model seems to be unable to understand against statements. More work is needed to improve the model performance.

The project is deployed onto Huggingface spaces for easy access. Click here to try it out.

Status

Implemented title crawling from multiple news websites
Implemented data cleaning
Implemented data labeling using Gemini flash lite
Implemented model training using BERT
Setup simplified inference pipeline
Setup web app for easy access

Roadmap

Better crawlers, maybe from inside the articles to get more content.
Improve model performance, right now the labeling quality is off by a lot.
More complete Command Line Interface to manage the whole pipeline

Crawlers and Data cleaning

The crawlers are implemented in getTitle.py. The script uses selenium for web scraping and BeautifulSoup for HTML parsing. It includes functions to crawl titles from multiple news websites. The data cleaning functions are also included in this script. The base cleaning is implemented by removing short titles and empty titles. I collected around 20k titles from various news websites.

Data Labeling

The data labeling is done using Gemini flash lite model. The labeling function is implemented in getTitleLabel.py. The script reads the cleaned data and uses the Gemini flash model to label each title with its stances. The labeled data is saved into the same db. The labeling turns out the data has the ratio of 7:5:7 for KMT/DPP/Neutral.

Model Training

The model training is implemented in trainModel.py. The script uses the Hugging Face Transformers library to fine-tune a BERT model on the labeled dataset. The model adds layers for classification and uses cross-entropy loss for training. The trained model is saved for later use. The process is optimized for GPU, and is done on my base line M3 Pro MacBook Pro with MPS backend. The training achieves an accuracy of around 72% on the validation set.

Inference

The inference pipeline is implemented in inference.py. The script loads the trained model and provides a function to predict the stance of a given sentence.

Web App

The web app is implemented using Gradio in app.py. The app provides a simple interface for users to input a sentence and get the predicted stance from the model. The same file is also used on huggingface spaces for web deployment.

Requirements

The project is managed through UV. pyproject.toml contains the project dependencies. To install the dependencies, run: uv sync. .env file should contain the Gemini API key for data labeling and huggingface API key for model inference.