Update README.md
Browse files
README.md
CHANGED
|
@@ -14,76 +14,109 @@ base_model:
|
|
| 14 |
- distilbert/distilbert-base-multilingual-cased
|
| 15 |
pipeline_tag: text-classification
|
| 16 |
---
|
| 17 |
-
#
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- **
|
| 37 |
-
|
| 38 |
-
- **
|
| 39 |
-
|
| 40 |
-
- **
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
## Model
|
| 44 |
-
|
| 45 |
-
- **
|
| 46 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## How to Use
|
| 49 |
-
|
|
|
|
|
|
|
| 50 |
```python
|
| 51 |
from bertopic import BERTopic
|
| 52 |
from safetensors.torch import load_file
|
| 53 |
|
| 54 |
-
# Load model
|
| 55 |
model = BERTopic.load("path/to/model.safetensors")
|
| 56 |
```
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
```python
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
| 61 |
|
|
|
|
| 62 |
topics, probs = model.transform(docs)
|
| 63 |
-
print(topics)
|
|
|
|
| 64 |
```
|
| 65 |
|
| 66 |
## Deployment Guide
|
| 67 |
-
|
| 68 |
-
- **
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
## Limitations
|
| 72 |
-
|
| 73 |
-
- **
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
## License
|
|
|
|
| 77 |
[Insert License Here]
|
| 78 |
|
| 79 |
## Citation
|
| 80 |
-
|
|
|
|
| 81 |
@inproceedings{your_citation,
|
| 82 |
title={BERTopic Model for Multilingual Tourism Feedback},
|
| 83 |
-
author={
|
| 84 |
year={2025}
|
| 85 |
}
|
| 86 |
```
|
| 87 |
|
| 88 |
---
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
- distilbert/distilbert-base-multilingual-cased
|
| 15 |
pipeline_tag: text-classification
|
| 16 |
---
|
| 17 |
+
#Below is the revised README in Markdown format, incorporating additional thesis context and refined structure:
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# BERTopic Model for Serverless Inference
|
| 22 |
+
|
| 23 |
+
A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**
|
| 24 |
+
|
| 25 |
+
## Overview
|
| 26 |
+
|
| 27 |
+
This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.
|
| 28 |
+
|
| 29 |
+
> **Thesis Context:**
|
| 30 |
+
> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.
|
| 31 |
+
|
| 32 |
+
## Key Features
|
| 33 |
+
|
| 34 |
+
- **Multilingual Support:**
|
| 35 |
+
Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
|
| 36 |
+
- **Pre-trained & Fine-tuned:**
|
| 37 |
+
Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
|
| 38 |
+
- **Optimized Serialization:**
|
| 39 |
+
Uses safetensors for faster and safer model loading.
|
| 40 |
+
- **Serverless Inference Ready:**
|
| 41 |
+
Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.
|
| 42 |
+
|
| 43 |
+
## Model Architecture & Details
|
| 44 |
+
|
| 45 |
+
- **Architecture:** BERTopic
|
| 46 |
+
- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
|
| 47 |
+
- **Dimensionality Reduction:** UMAP
|
| 48 |
+
- **Clustering Algorithm:** HDBSCAN
|
| 49 |
+
- **Vectorizer:** CountVectorizer with TF-IDF preprocessing
|
| 50 |
+
- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics
|
| 51 |
+
|
| 52 |
+
## Model Performance Metrics
|
| 53 |
+
|
| 54 |
+
- **Topic Coherence Score:** *XX.XX* (placeholder)
|
| 55 |
+
- **Diversity Score:** *XX.XX* (placeholder)
|
| 56 |
+
- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)
|
| 57 |
|
| 58 |
## How to Use
|
| 59 |
+
|
| 60 |
+
### Loading the Model
|
| 61 |
+
|
| 62 |
```python
|
| 63 |
from bertopic import BERTopic
|
| 64 |
from safetensors.torch import load_file
|
| 65 |
|
| 66 |
+
# Load the BERTopic model
|
| 67 |
model = BERTopic.load("path/to/model.safetensors")
|
| 68 |
```
|
| 69 |
+
|
| 70 |
+
### Performing Topic Modeling
|
| 71 |
+
|
| 72 |
```python
|
| 73 |
+
# Sample documents for topic modeling
|
| 74 |
+
docs = [
|
| 75 |
+
"The hotel had a great view of the beach and excellent service.",
|
| 76 |
+
"Transportation was a bit difficult to find late at night."
|
| 77 |
+
]
|
| 78 |
|
| 79 |
+
# Extract topics from the documents
|
| 80 |
topics, probs = model.transform(docs)
|
| 81 |
+
print("Topics:", topics)
|
| 82 |
+
print("Probabilities:", probs)
|
| 83 |
```
|
| 84 |
|
| 85 |
## Deployment Guide
|
| 86 |
+
|
| 87 |
+
- **Serverless Platforms:**
|
| 88 |
+
Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
|
| 89 |
+
- **Memory Optimization:**
|
| 90 |
+
Use safetensors for a reduced memory footprint and faster inference.
|
| 91 |
+
- **Scaling Considerations:**
|
| 92 |
+
Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.
|
| 93 |
|
| 94 |
## Limitations
|
| 95 |
+
|
| 96 |
+
- **Variable Topic Coherence:**
|
| 97 |
+
Coherence may vary by language.
|
| 98 |
+
- **Dataset Biases:**
|
| 99 |
+
The model’s performance may be influenced by biases in the training data.
|
| 100 |
+
- **Latency Constraints:**
|
| 101 |
+
Not ideal for real-time low-latency applications (<50ms response time).
|
| 102 |
|
| 103 |
## License
|
| 104 |
+
|
| 105 |
[Insert License Here]
|
| 106 |
|
| 107 |
## Citation
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
@inproceedings{your_citation,
|
| 111 |
title={BERTopic Model for Multilingual Tourism Feedback},
|
| 112 |
+
author={Paul Andre D. Tadiar},
|
| 113 |
year={2025}
|
| 114 |
}
|
| 115 |
```
|
| 116 |
|
| 117 |
---
|
| 118 |
+
|
| 119 |
+
*For inquiries or contributions, please open an issue on the Hugging Face repository.*
|
| 120 |
+
|
| 121 |
+
---
|
| 122 |
+
|