---
language:
- en
- ko
- zh
- ja
- es
- fr
- ru
- hi
metrics:
- accuracy
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
---

---

# BERTopic Model for Serverless Inference

A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**

## Overview

This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.

> **Thesis Context:**  
> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.

## Key Features

- **Multilingual Support:**  
  Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
- **Pre-trained & Fine-tuned:**  
  Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
- **Optimized Serialization:**  
  Uses safetensors for faster and safer model loading.
- **Serverless Inference Ready:**  
  Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.

## Model Architecture & Details

- **Architecture:** BERTopic
- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
- **Dimensionality Reduction:** UMAP
- **Clustering Algorithm:** HDBSCAN
- **Vectorizer:** CountVectorizer with TF-IDF preprocessing
- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics

## Model Performance Metrics

- **Topic Coherence Score:** *XX.XX* (placeholder)
- **Diversity Score:** *XX.XX* (placeholder)
- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)

## How to Use

### Loading the Model

```python
from bertopic import BERTopic
from safetensors.torch import load_file

# Load the BERTopic model
model = BERTopic.load("path/to/model.safetensors")
```

### Performing Topic Modeling

```python
# Sample documents for topic modeling
docs = [
    "The hotel had a great view of the beach and excellent service.",
    "Transportation was a bit difficult to find late at night."
]

# Extract topics from the documents
topics, probs = model.transform(docs)
print("Topics:", topics)
print("Probabilities:", probs)
```

## Deployment Guide

- **Serverless Platforms:**  
  Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
- **Memory Optimization:**  
  Use safetensors for a reduced memory footprint and faster inference.
- **Scaling Considerations:**  
  Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.

## Limitations

- **Variable Topic Coherence:**  
  Coherence may vary by language.
- **Dataset Biases:**  
  The model’s performance may be influenced by biases in the training data.
- **Latency Constraints:**  
  Not ideal for real-time low-latency applications (<50ms response time).

## License

[Insert License Here]

## Citation

```bibtex
@inproceedings{your_citation,
  title={BERTopic Model for Multilingual Tourism Feedback},
  author={Paul Andre D. Tadiar},
  year={2025}
}
```

---

*For inquiries or contributions, please open an issue on the Hugging Face repository.*

---