--- language: - en - ko - zh - ja - es - fr - ru - hi metrics: - accuracy base_model: - distilbert/distilbert-base-multilingual-cased pipeline_tag: text-classification --- --- # BERTopic Model for Serverless Inference A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."** ## Overview This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions. > **Thesis Context:** > As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management. ## Key Features - **Multilingual Support:** Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog. - **Pre-trained & Fine-tuned:** Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity. - **Optimized Serialization:** Uses safetensors for faster and safer model loading. - **Serverless Inference Ready:** Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions. ## Model Architecture & Details - **Architecture:** BERTopic - **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2` - **Dimensionality Reduction:** UMAP - **Clustering Algorithm:** HDBSCAN - **Vectorizer:** CountVectorizer with TF-IDF preprocessing - **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics ## Model Performance Metrics - **Topic Coherence Score:** *XX.XX* (placeholder) - **Diversity Score:** *XX.XX* (placeholder) - **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system) ## How to Use ### Loading the Model ```python from bertopic import BERTopic from safetensors.torch import load_file # Load the BERTopic model model = BERTopic.load("path/to/model.safetensors") ``` ### Performing Topic Modeling ```python # Sample documents for topic modeling docs = [ "The hotel had a great view of the beach and excellent service.", "Transportation was a bit difficult to find late at night." ] # Extract topics from the documents topics, probs = model.transform(docs) print("Topics:", topics) print("Probabilities:", probs) ``` ## Deployment Guide - **Serverless Platforms:** Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI. - **Memory Optimization:** Use safetensors for a reduced memory footprint and faster inference. - **Scaling Considerations:** Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments. ## Limitations - **Variable Topic Coherence:** Coherence may vary by language. - **Dataset Biases:** The model’s performance may be influenced by biases in the training data. - **Latency Constraints:** Not ideal for real-time low-latency applications (<50ms response time). ## License [Insert License Here] ## Citation ```bibtex @inproceedings{your_citation, title={BERTopic Model for Multilingual Tourism Feedback}, author={Paul Andre D. Tadiar}, year={2025} } ``` --- *For inquiries or contributions, please open an issue on the Hugging Face repository.* ---