File size: 4,030 Bytes
fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c fdc69e3 dd5af4c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
language:
- en
- ko
- zh
- ja
- es
- fr
- ru
- hi
metrics:
- accuracy
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
---
---
# BERTopic Model for Serverless Inference
A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**
## Overview
This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.
> **Thesis Context:**
> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.
## Key Features
- **Multilingual Support:**
Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
- **Pre-trained & Fine-tuned:**
Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
- **Optimized Serialization:**
Uses safetensors for faster and safer model loading.
- **Serverless Inference Ready:**
Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.
## Model Architecture & Details
- **Architecture:** BERTopic
- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
- **Dimensionality Reduction:** UMAP
- **Clustering Algorithm:** HDBSCAN
- **Vectorizer:** CountVectorizer with TF-IDF preprocessing
- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics
## Model Performance Metrics
- **Topic Coherence Score:** *XX.XX* (placeholder)
- **Diversity Score:** *XX.XX* (placeholder)
- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)
## How to Use
### Loading the Model
```python
from bertopic import BERTopic
from safetensors.torch import load_file
# Load the BERTopic model
model = BERTopic.load("path/to/model.safetensors")
```
### Performing Topic Modeling
```python
# Sample documents for topic modeling
docs = [
"The hotel had a great view of the beach and excellent service.",
"Transportation was a bit difficult to find late at night."
]
# Extract topics from the documents
topics, probs = model.transform(docs)
print("Topics:", topics)
print("Probabilities:", probs)
```
## Deployment Guide
- **Serverless Platforms:**
Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
- **Memory Optimization:**
Use safetensors for a reduced memory footprint and faster inference.
- **Scaling Considerations:**
Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.
## Limitations
- **Variable Topic Coherence:**
Coherence may vary by language.
- **Dataset Biases:**
The model’s performance may be influenced by biases in the training data.
- **Latency Constraints:**
Not ideal for real-time low-latency applications (<50ms response time).
## License
[Insert License Here]
## Citation
```bibtex
@inproceedings{your_citation,
title={BERTopic Model for Multilingual Tourism Feedback},
author={Paul Andre D. Tadiar},
year={2025}
}
```
---
*For inquiries or contributions, please open an issue on the Hugging Face repository.*
---
|