CCRss commited on
Commit
d0f107b
·
verified ·
1 Parent(s): 99691a8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - topic-modeling
7
+ ---
8
+
9
+ # Top2Vec Scientific Texts Model
10
+
11
+ This repository hosts the `top2vec_scientific_texts` model, a specialized Top2Vec model trained on scientific texts for topic modeling and semantic search.
12
+
13
+ ## Model Overview
14
+
15
+ The `top2vec_scientific_texts` model is built for analyzing scientific literature. It leverages the Universal Sentence Encoder for embedding texts and uses Top2Vec for topic modeling.
16
+
17
+ ### Key Features:
18
+
19
+ - **Domain-Specific:** Tailored for scientific texts.
20
+ - **Base Model:** Utilizes the Universal Sentence Encoder for effective text embeddings.
21
+ - **Topic Modeling:** Employs Top2Vec for discovering topics in scientific documents.
22
+
23
+ ## Installation
24
+
25
+ To use the model, you need to install the following dependencies:
26
+
27
+ ```bash
28
+ pip install top2vec
29
+ pip install top2vec[sentence_encoders]
30
+ pip install tensorflow==2.8.0
31
+ pip install tensorflow-probability==0.16.0
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ Here's an example of how to use the model for topic modeling:
37
+
38
+ ```bash
39
+ from top2vec import Top2Vec
40
+
41
+ # Load your documents
42
+ docs = ["Document 1 text", "Document 2 text", ...]
43
+
44
+ # Initialize the Top2Vec model
45
+ model = Top2Vec(
46
+ documents=docs,
47
+ speed='learn',
48
+ workers=80,
49
+ embedding_model='universal-sentence-encoder',
50
+ umap_args={'n_neighbors': 15, 'n_components': 5, 'metric': 'cosine', 'min_dist': 0.0, 'random_state': 42},
51
+ hdbscan_args={'min_cluster_size': 15, 'metric': 'euclidean', 'cluster_selection_method': 'eom'}
52
+ )
53
+ ```
54
+
55
+ # Save the model
56
+
57
+ ```bash
58
+ model.save('top2vec_scientific_texts_model')
59
+ ```
60
+
61
+ ## Dataset
62
+
63
+ The model was trained on a dataset of scientific abstracts sourced from [arXiv](https://arxiv.org/). The dataset covers a range of topics within the field of computer science from 2010 to 2024.
64
+
65
+ You can access the dataset [arxiv_papers_cs](https://huggingface.co/datasets/CCRss/arxiv_papers_cs).
66
+
67
+ ## Use Cases
68
+
69
+ The `top2vec_scientific_texts` model can be used for various purposes, including:
70
+
71
+ - **Topic Discovery:** Identify the main topics within a collection of scientific texts.
72
+ - **Semantic Search:** Find documents that are semantically similar to a query text.
73
+ - **Trend Analysis:** Analyze the evolution of topics over time.
74
+
75
+ ## Examples
76
+
77
+ Here are some examples of the model's output for the thematic group "UAV in Disasters and Emergency":
78
+
79
+ ### Trend Analysis for "UAV in Disasters and Emergency"
80
+
81
+ ![Trend Analysis](path/to/trend_analysis_disasters_emergency.png)
82
+
83
+ This graph shows the trend of interest in the use of UAVs in disaster and emergency situations over time.
84
+
85
+ ### Key Metrics Table
86
+
87
+ ## Contributions
88
+
89
+ We welcome contributions to the top2vec_scientific_texts model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue or submit a pull request.
90
+
91
+ ## License
92
+
93
+ This project is licensed under the MIT License