Update README.md

d2dcc87 verified 10 months ago

4.03 kB

	---
	language:
	- en
	- ko
	- zh
	- ja
	- es
	- fr
	- ru
	- hi
	metrics:
	- accuracy
	base_model:
	- distilbert/distilbert-base-multilingual-cased
	pipeline_tag: text-classification
	---

	---

	# BERTopic Model for Serverless Inference

	A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in safetensors format for optimized loading and is designed for serverless inference in cloud environments. It is a key component of our thesis project, "Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."

	## Overview

	This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.

	> Thesis Context:
	> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.

	## Key Features

	- Multilingual Support:
	Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
	- Pre-trained & Fine-tuned:
	Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
	- Optimized Serialization:
	Uses safetensors for faster and safer model loading.
	- Serverless Inference Ready:
	Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.

	## Model Architecture & Details

	- Architecture: BERTopic
	- Embedding Model: `paraphrase-multilingual-MiniLM-L12-v2`
	- Dimensionality Reduction: UMAP
	- Clustering Algorithm: HDBSCAN
	- Vectorizer: CountVectorizer with TF-IDF preprocessing
	- Dataset: 160k synthetic and real tourist reviews categorized by emotional tone and topics

	## Model Performance Metrics

	- Topic Coherence Score: XX.XX (placeholder)
	- Diversity Score: XX.XX (placeholder)
	- Sentiment Analysis Accuracy: ≥ 70% (as part of the complementary system)

	## How to Use

	### Loading the Model

	```python
	from bertopic import BERTopic
	from safetensors.torch import load_file

	# Load the BERTopic model
	model = BERTopic.load("path/to/model.safetensors")
	```

	### Performing Topic Modeling

	```python
	# Sample documents for topic modeling
	docs = [
	"The hotel had a great view of the beach and excellent service.",
	"Transportation was a bit difficult to find late at night."
	]

	# Extract topics from the documents
	topics, probs = model.transform(docs)
	print("Topics:", topics)
	print("Probabilities:", probs)
	```

	## Deployment Guide

	- Serverless Platforms:
	Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
	- Memory Optimization:
	Use safetensors for a reduced memory footprint and faster inference.
	- Scaling Considerations:
	Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.

	## Limitations

	- Variable Topic Coherence:
	Coherence may vary by language.
	- Dataset Biases:
	The model’s performance may be influenced by biases in the training data.
	- Latency Constraints:
	Not ideal for real-time low-latency applications (<50ms response time).

	## License

	[Insert License Here]

	## Citation

	```bibtex
	@inproceedings{your_citation,
	title={BERTopic Model for Multilingual Tourism Feedback},
	author={Paul Andre D. Tadiar},
	year={2025}
	}
	```

	---

	For inquiries or contributions, please open an issue on the Hugging Face repository.

	---