BERTopic_Tourism_8L / README.md
SCANSKY's picture
Update README.md
d2dcc87 verified
---
language:
- en
- ko
- zh
- ja
- es
- fr
- ru
- hi
metrics:
- accuracy
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
---
---
# BERTopic Model for Serverless Inference
A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**
## Overview
This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.
> **Thesis Context:**
> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.
## Key Features
- **Multilingual Support:**
Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
- **Pre-trained & Fine-tuned:**
Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
- **Optimized Serialization:**
Uses safetensors for faster and safer model loading.
- **Serverless Inference Ready:**
Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.
## Model Architecture & Details
- **Architecture:** BERTopic
- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
- **Dimensionality Reduction:** UMAP
- **Clustering Algorithm:** HDBSCAN
- **Vectorizer:** CountVectorizer with TF-IDF preprocessing
- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics
## Model Performance Metrics
- **Topic Coherence Score:** *XX.XX* (placeholder)
- **Diversity Score:** *XX.XX* (placeholder)
- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)
## How to Use
### Loading the Model
```python
from bertopic import BERTopic
from safetensors.torch import load_file
# Load the BERTopic model
model = BERTopic.load("path/to/model.safetensors")
```
### Performing Topic Modeling
```python
# Sample documents for topic modeling
docs = [
"The hotel had a great view of the beach and excellent service.",
"Transportation was a bit difficult to find late at night."
]
# Extract topics from the documents
topics, probs = model.transform(docs)
print("Topics:", topics)
print("Probabilities:", probs)
```
## Deployment Guide
- **Serverless Platforms:**
Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
- **Memory Optimization:**
Use safetensors for a reduced memory footprint and faster inference.
- **Scaling Considerations:**
Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.
## Limitations
- **Variable Topic Coherence:**
Coherence may vary by language.
- **Dataset Biases:**
The model’s performance may be influenced by biases in the training data.
- **Latency Constraints:**
Not ideal for real-time low-latency applications (<50ms response time).
## License
[Insert License Here]
## Citation
```bibtex
@inproceedings{your_citation,
title={BERTopic Model for Multilingual Tourism Feedback},
author={Paul Andre D. Tadiar},
year={2025}
}
```
---
*For inquiries or contributions, please open an issue on the Hugging Face repository.*
---