SCANSKY commited on
Commit
dd5af4c
·
verified ·
1 Parent(s): e4797c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -45
README.md CHANGED
@@ -14,76 +14,109 @@ base_model:
14
  - distilbert/distilbert-base-multilingual-cased
15
  pipeline_tag: text-classification
16
  ---
17
- # Model Card: BERTopic Model for Serverless Inference
18
-
19
- ## Model Description
20
- This is a BERTopic model trained for **topic modeling** on a multilingual dataset. The model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments.
21
-
22
- ### Features
23
- - **Multilingual support** (Supports 8 languages)
24
- - **Pre-trained and fine-tuned on synthetic and real tourist reviews**
25
- - **Safetensors format for faster and safer model loading**
26
- - **Optimized for serverless architectures (FastAPI, AWS Lambda, Cloud Functions, etc.)**
27
-
28
- ## Intended Use
29
- - **Tourism feedback analysis**
30
- - **Customer review topic modeling**
31
- - **Data-driven decision-making for tourism offices**
32
- - **Research on multilingual topic modeling**
33
-
34
- ## Model Details
35
- - **Architecture**: BERTopic
36
- - **Embedding Model**: `paraphrase-multilingual-MiniLM-L12-v2`
37
- - **Dimensionality Reduction**: UMAP
38
- - **Clustering Algorithm**: HDBSCAN
39
- - **Vectorizer**: CountVectorizer (TF-IDF preprocessing)
40
- - **Languages**: English, Spanish, French, Chinese, Japanese, German, Korean, Tagalog
41
- - **Dataset**: 160k synthetic and real tourist reviews categorized by emotional tone and topics
42
-
43
- ## Model Performance
44
- - **Topic Coherence Score**: *XX.XX*
45
- - **Diversity Score**: *XX.XX*
46
- - **Sentiment Analysis Accuracy**: *≥ 70%*
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## How to Use
49
- ### Load the Model:
 
 
50
  ```python
51
  from bertopic import BERTopic
52
  from safetensors.torch import load_file
53
 
54
- # Load model
55
  model = BERTopic.load("path/to/model.safetensors")
56
  ```
57
- ### Perform Topic Modeling:
 
 
58
  ```python
59
- docs = ["The hotel had a great view of the beach and excellent service.",
60
- "Transportation was a bit difficult to find late at night."]
 
 
 
61
 
 
62
  topics, probs = model.transform(docs)
63
- print(topics)
 
64
  ```
65
 
66
  ## Deployment Guide
67
- - **AWS Lambda / FastAPI**: Ensure `safetensors`, `bertopic`, and `sentence-transformers` are included in the dependencies.
68
- - **Memory Optimization**: Use `safetensors` for faster inference and reduced memory footprint.
69
- - **Serverless Scaling**: Load the model in memory at cold start and reuse for subsequent requests.
 
 
 
 
70
 
71
  ## Limitations
72
- - **Topic coherence may vary by language**
73
- - **Sensitive to dataset biases**
74
- - **Not suitable for real-time low-latency applications (<50ms response time)**
 
 
 
 
75
 
76
  ## License
 
77
  [Insert License Here]
78
 
79
  ## Citation
80
- ```
 
81
  @inproceedings{your_citation,
82
  title={BERTopic Model for Multilingual Tourism Feedback},
83
- author={Your Name},
84
  year={2025}
85
  }
86
  ```
87
 
88
  ---
89
- *For inquiries or contributions, please open an issue on the Hugging Face repository.*
 
 
 
 
 
14
  - distilbert/distilbert-base-multilingual-cased
15
  pipeline_tag: text-classification
16
  ---
17
+ #Below is the revised README in Markdown format, incorporating additional thesis context and refined structure:
18
+
19
+ ---
20
+
21
+ # BERTopic Model for Serverless Inference
22
+
23
+ A BERTopic model for multilingual topic modeling, specifically tailored for tourism feedback analysis. This model is serialized in **safetensors** format for optimized loading and is designed for **serverless inference** in cloud environments. It is a key component of our thesis project, **"Enhancing Tourist Destination Management through a Multilingual Web-Based Tourist Survey System with Machine Learning."**
24
+
25
+ ## Overview
26
+
27
+ This model leverages BERTopic to extract meaningful topics from a multilingual dataset of tourist reviews. It supports eight languages, making it a versatile tool for understanding diverse customer feedback. Optimized for serverless architectures, the model is ideal for deployment on platforms such as FastAPI, AWS Lambda, and Cloud Functions.
28
+
29
+ > **Thesis Context:**
30
+ > As part of the thesis project, this model works in tandem with a DistilBERT-based sentiment analyzer to revolutionize tourism feedback collection. The combined approach overcomes language barriers and inefficiencies in traditional survey methods, enhancing data-driven decision-making for tourism management.
31
+
32
+ ## Key Features
33
+
34
+ - **Multilingual Support:**
35
+ Supports 8 languages: English, Spanish, French, Chinese, Japanese, German, Korean, and Tagalog.
36
+ - **Pre-trained & Fine-tuned:**
37
+ Trained on 160k synthetic and real tourist reviews, capturing both emotional tone and topical diversity.
38
+ - **Optimized Serialization:**
39
+ Uses safetensors for faster and safer model loading.
40
+ - **Serverless Inference Ready:**
41
+ Tailored for deployment on serverless architectures such as FastAPI, AWS Lambda, and Cloud Functions.
42
+
43
+ ## Model Architecture & Details
44
+
45
+ - **Architecture:** BERTopic
46
+ - **Embedding Model:** `paraphrase-multilingual-MiniLM-L12-v2`
47
+ - **Dimensionality Reduction:** UMAP
48
+ - **Clustering Algorithm:** HDBSCAN
49
+ - **Vectorizer:** CountVectorizer with TF-IDF preprocessing
50
+ - **Dataset:** 160k synthetic and real tourist reviews categorized by emotional tone and topics
51
+
52
+ ## Model Performance Metrics
53
+
54
+ - **Topic Coherence Score:** *XX.XX* (placeholder)
55
+ - **Diversity Score:** *XX.XX* (placeholder)
56
+ - **Sentiment Analysis Accuracy:** *≥ 70%* (as part of the complementary system)
57
 
58
  ## How to Use
59
+
60
+ ### Loading the Model
61
+
62
  ```python
63
  from bertopic import BERTopic
64
  from safetensors.torch import load_file
65
 
66
+ # Load the BERTopic model
67
  model = BERTopic.load("path/to/model.safetensors")
68
  ```
69
+
70
+ ### Performing Topic Modeling
71
+
72
  ```python
73
+ # Sample documents for topic modeling
74
+ docs = [
75
+ "The hotel had a great view of the beach and excellent service.",
76
+ "Transportation was a bit difficult to find late at night."
77
+ ]
78
 
79
+ # Extract topics from the documents
80
  topics, probs = model.transform(docs)
81
+ print("Topics:", topics)
82
+ print("Probabilities:", probs)
83
  ```
84
 
85
  ## Deployment Guide
86
+
87
+ - **Serverless Platforms:**
88
+ Ensure dependencies such as `safetensors`, `bertopic`, and `sentence-transformers` are included in your deployment package for platforms like AWS Lambda or FastAPI.
89
+ - **Memory Optimization:**
90
+ Use safetensors for a reduced memory footprint and faster inference.
91
+ - **Scaling Considerations:**
92
+ Load the model at cold start and reuse it for subsequent requests to efficiently scale in serverless environments.
93
 
94
  ## Limitations
95
+
96
+ - **Variable Topic Coherence:**
97
+ Coherence may vary by language.
98
+ - **Dataset Biases:**
99
+ The model’s performance may be influenced by biases in the training data.
100
+ - **Latency Constraints:**
101
+ Not ideal for real-time low-latency applications (<50ms response time).
102
 
103
  ## License
104
+
105
  [Insert License Here]
106
 
107
  ## Citation
108
+
109
+ ```bibtex
110
  @inproceedings{your_citation,
111
  title={BERTopic Model for Multilingual Tourism Feedback},
112
+ author={Paul Andre D. Tadiar},
113
  year={2025}
114
  }
115
  ```
116
 
117
  ---
118
+
119
+ *For inquiries or contributions, please open an issue on the Hugging Face repository.*
120
+
121
+ ---
122
+