SCANSKY
/

BERTopic_Tourism_8L

Text Classification

Model card Files Files and versions

xet

Community

SCANSKY commited on Mar 2, 2025

Commit

dd5af4c

verified ·

1 Parent(s): e4797c9

Update README.md

Browse files

Files changed (1) hide show

README.md +78 -45

README.md CHANGED Viewed

@@ -14,76 +14,109 @@ base_model:
 - distilbert/distilbert-base-multilingual-cased
 pipeline_tag: text-classification
 ---
-# Model Card: BERTopic Model for Serverless Inference
-## Model Description
-This is a BERTopic model trained for **topic modeling** on a multilingual dataset. The model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments.
-### Features
-- **Multilingual support** (Supports 8 languages)
-- **Pre-trained and fine-tuned on synthetic and real tourist reviews**
-- **Safetensors format for faster and safer model loading**
-- **Optimized for serverless architectures (FastAPI, AWS Lambda, Cloud Functions, etc.)**
-## Intended Use
-- **Tourism feedback analysis**
-- **Customer review topic modeling**
-- **Data-driven decision-making for tourism offices**
-- **Research on multilingual topic modeling**
-## Model Details
-- **Architecture**: BERTopic
-- **Embedding Model**: `paraphrase-multilingual-MiniLM-L12-v2`
-- **Dimensionality Reduction**: UMAP
-- **Clustering Algorithm**: HDBSCAN
-- **Vectorizer**: CountVectorizer (TF-IDF preprocessing)
-- **Languages**: English, Spanish, French, Chinese, Japanese, German, Korean, Tagalog
-- **Dataset**: 160k synthetic and real tourist reviews categorized by emotional tone and topics
-## Model Performance
-- **Topic Coherence Score**: *XX.XX*
-- **Diversity Score**: *XX.XX*
-- **Sentiment Analysis Accuracy**: *≥ 70%*
 ## How to Use
-### Load the Model:
 ```python
 from bertopic import BERTopic
 from safetensors.torch import load_file
-# Load model
 model = BERTopic.load("path/to/model.safetensors")
 ```
-### Perform Topic Modeling:
 ```python
-docs = ["The hotel had a great view of the beach and excellent service.",
-        "Transportation was a bit difficult to find late at night."]
 topics, probs = model.transform(docs)
-print(topics)
 ```
 ## Deployment Guide
-- **AWS Lambda / FastAPI**: Ensure `safetensors`, `bertopic`, and `sentence-transformers` are included in the dependencies.
-- **Memory Optimization**: Use `safetensors` for faster inference and reduced memory footprint.
-- **Serverless Scaling**: Load the model in memory at cold start and reuse for subsequent requests.
 ## Limitations
-- **Topic coherence may vary by language**
-- **Sensitive to dataset biases**
-- **Not suitable for real-time low-latency applications (<50ms response time)**
 ## License
 [Insert License Here]
 ## Citation
-```
 @inproceedings{your_citation,
   title={BERTopic Model for Multilingual Tourism Feedback},
-  author={Your Name},
   year={2025}
 }
 ```
 ---
-*For inquiries or contributions, please open an issue on the Hugging Face repository.*

 - distilbert/distilbert-base-multilingual-cased
 pipeline_tag: text-classification
 ---
+#Below is the revised README in Markdown format, incorporating additional thesis context and refined structure:
+---
+# BERTopic Model for Serverless Inference
+A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**
+## Overview
+This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.
+> **Thesis Context:**
+> As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.
+## Key Features
+- **Multilingual Support:**
+  Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
+- **Pre-trained & Fine-tuned:**
+  Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
+- **Optimized Serialization:**
+  Uses safetensors for faster and safer model loading.
+- **Serverless Inference Ready:**
+  Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.
+## Model Architecture & Details
+- **Architecture:** BERTopic
+- **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
+- **Dimensionality Reduction:** UMAP
+- **Clustering Algorithm:** HDBSCAN
+- **Vectorizer:** CountVectorizer with TF-IDF preprocessing
+- **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics
+## Model Performance Metrics
+- **Topic Coherence Score:** *XX.XX* (placeholder)
+- **Diversity Score:** *XX.XX* (placeholder)
+- **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)
 ## How to Use
+### Loading the Model
 ```python
 from bertopic import BERTopic
 from safetensors.torch import load_file
+# Load the BERTopic model
 model = BERTopic.load("path/to/model.safetensors")
 ```
+### Performing Topic Modeling
 ```python
+# Sample documents for topic modeling
+docs = [
+    "The hotel had a great view of the beach and excellent service.",
+    "Transportation was a bit difficult to find late at night."
+]
+# Extract topics from the documents
 topics, probs = model.transform(docs)
+print("Topics:", topics)
+print("Probabilities:", probs)
 ```
 ## Deployment Guide
+- **Serverless Platforms:**
+  Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
+- **Memory Optimization:**
+  Use safetensors for a reduced memory footprint and faster inference.
+- **Scaling Considerations:**
+  Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.
 ## Limitations
+- **Variable Topic Coherence:**
+  Coherence may vary by language.
+- **Dataset Biases:**
+  The model’s performance may be influenced by biases in the training data.
+- **Latency Constraints:**
+  Not ideal for real-time low-latency applications (<50ms response time).
 ## License
 [Insert License Here]
 ## Citation
+```bibtex
 @inproceedings{your_citation,
   title={BERTopic Model for Multilingual Tourism Feedback},
+  author={Paul Andre D. Tadiar},
   year={2025}
 }
 ```
 ---
+*For inquiries or contributions, please open an issue on the Hugging Face repository.*
+---