|
|
--- |
|
|
language: |
|
|
- en |
|
|
- ko |
|
|
- zh |
|
|
- ja |
|
|
- es |
|
|
- fr |
|
|
- ru |
|
|
- hi |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- distilbert/distilbert-base-multilingual-cased |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
--- |
|
|
|
|
|
# BERTopic Model for Serverless Inference |
|
|
|
|
|
A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."** |
|
|
|
|
|
## Overview |
|
|
|
|
|
This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions. |
|
|
|
|
|
> **Thesis Context:** |
|
|
> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Multilingual Support:** |
|
|
Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog. |
|
|
- **Pre-trained & Fine-tuned:** |
|
|
Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity. |
|
|
- **Optimized Serialization:** |
|
|
Uses safetensors for faster and safer model loading. |
|
|
- **Serverless Inference Ready:** |
|
|
Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions. |
|
|
|
|
|
## Model Architecture & Details |
|
|
|
|
|
- **Architecture:** BERTopic |
|
|
- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2` |
|
|
- **Dimensionality Reduction:** UMAP |
|
|
- **Clustering Algorithm:** HDBSCAN |
|
|
- **Vectorizer:** CountVectorizer with TF-IDF preprocessing |
|
|
- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics |
|
|
|
|
|
## Model Performance Metrics |
|
|
|
|
|
- **Topic Coherence Score:** *XX.XX* (placeholder) |
|
|
- **Diversity Score:** *XX.XX* (placeholder) |
|
|
- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system) |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
from bertopic import BERTopic |
|
|
from safetensors.torch import load_file |
|
|
|
|
|
# Load the BERTopic model |
|
|
model = BERTopic.load("path/to/model.safetensors") |
|
|
``` |
|
|
|
|
|
### Performing Topic Modeling |
|
|
|
|
|
```python |
|
|
# Sample documents for topic modeling |
|
|
docs = [ |
|
|
"The hotel had a great view of the beach and excellent service.", |
|
|
"Transportation was a bit difficult to find late at night." |
|
|
] |
|
|
|
|
|
# Extract topics from the documents |
|
|
topics, probs = model.transform(docs) |
|
|
print("Topics:", topics) |
|
|
print("Probabilities:", probs) |
|
|
``` |
|
|
|
|
|
## Deployment Guide |
|
|
|
|
|
- **Serverless Platforms:** |
|
|
Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI. |
|
|
- **Memory Optimization:** |
|
|
Use safetensors for a reduced memory footprint and faster inference. |
|
|
- **Scaling Considerations:** |
|
|
Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Variable Topic Coherence:** |
|
|
Coherence may vary by language. |
|
|
- **Dataset Biases:** |
|
|
The model’s performance may be influenced by biases in the training data. |
|
|
- **Latency Constraints:** |
|
|
Not ideal for real-time low-latency applications (<50ms response time). |
|
|
|
|
|
## License |
|
|
|
|
|
[Insert License Here] |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{your_citation, |
|
|
title={BERTopic Model for Multilingual Tourism Feedback}, |
|
|
author={Paul Andre D. Tadiar}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
*For inquiries or contributions, please open an issue on the Hugging Face repository.* |
|
|
|
|
|
--- |
|
|
|
|
|
|