Sunny6727 commited on
Commit
acc7268
·
verified ·
1 Parent(s): 9aa0736

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - word2vec
6
+ - embeddings
7
+ - nlp
8
+ - sports
9
+ - outdoors
10
+ - amazon-reviews
11
+ metrics:
12
+ - semantic similarity
13
+ ---
14
+
15
+ # Word2Vec Model for Amazon Sports & Outdoors Reviews
16
+
17
+ ## Model Description
18
+
19
+ This is a Word2Vec model trained on Amazon product reviews from the Sports & Outdoors category. The model was trained using the Gensim library on 296,337 reviews to learn word embeddings that capture semantic relationships between words in the context of sports and outdoor product reviews.
20
+
21
+ - **Model type**: Word2Vec (Skip-gram architecture)
22
+ - **Training data**: Amazon Sports & Outdoors reviews (296,337 reviews)
23
+ - **Vocabulary size**: Dependent on the min_count parameter (words appearing at least twice)
24
+ - **Vector dimension**: 100 (Gensim default)
25
+ - **Window size**: 10 words
26
+
27
+ ## Intended Uses & Limitations
28
+
29
+ ### Intended Use
30
+ This model is designed for:
31
+ - Semantic similarity tasks for sports and outdoor-related vocabulary
32
+ - Product recommendation systems
33
+ - Review analysis and sentiment tasks
34
+ - Keyword expansion and related term discovery
35
+ - Educational and research purposes
36
+
37
+ ### Limitations
38
+ - The model is specialized for the sports and outdoors domain
39
+ - Performance on vocabulary outside this domain may be limited
40
+ - Inherits any biases present in the Amazon review data
41
+ - May not perform well for very recent terminology not present in the training data
42
+
43
+ ## How to Use
44
+
45
+ ### Installation
46
+ ```bash
47
+ pip install gensim pandas
48
+ ```
49
+
50
+ ### Loading the Model
51
+ ```python
52
+ import gensim
53
+
54
+ # Load the model
55
+ model = gensim.models.Word2Vec.load("word2vec_model.model")
56
+ ```
57
+
58
+ ### Getting Word Similarities
59
+ ```python
60
+ # Find words similar to "good"
61
+ similar_words = model.wv.most_similar("good", topn=5)
62
+ print(similar_words)
63
+
64
+ # Find words similar to "slow"
65
+ similar_words = model.wv.most_similar("slow", topn=5)
66
+ print(similar_words)
67
+ ```
68
+
69
+ ### Additional Operations
70
+ ```python
71
+ # Get word vector
72
+ vector = model.wv['running']
73
+
74
+ # Calculate similarity between two words
75
+ similarity = model.wv.similarity('hiking', 'outdoors')
76
+
77
+ # Find odd one out
78
+ odd_one = model.wv.doesnt_match(['tent', 'sleeping bag', 'basketball'])
79
+ ```
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+ The model was trained on the Amazon Sports & Outdoors reviews dataset(https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz) containing 296,337 reviews with 9 columns each. The text was preprocessed using Gensim's `simple_preprocess` function.
85
+
86
+ ### Hyperparameters
87
+ - Window size: 10
88
+ - Minimum word count: 2
89
+ - Vector size: 100 (default)
90
+ - Training algorithm: Skip-gram (default)
91
+ - Negative samples: 5 (default)
92
+ - epochs: 5 (default)
93
+
94
+ ## Evaluation
95
+
96
+ The model can be evaluated by examining the semantic relationships it captures. For example:
97
+ - It should find "excellent", "great", and "nice" similar to "good"
98
+ - It should find "fast", "quick" as antonyms to "slow"
99
+ - It should maintain sports-specific relationships (e.g., "football" related to "soccer")
100
+
101
+ ## Model Performance
102
+
103
+ While quantitative evaluation metrics like accuracy on analogy tasks are not provided, the model demonstrates meaningful semantic relationships for vocabulary in the sports and outdoors domain.
104
+
105
+ ## Ethical Considerations
106
+
107
+ - The model may reflect biases present in the original Amazon reviews
108
+ - Should not be used for automated decision making without human oversight
109
+ - Users should be aware that word embeddings can amplify societal biases
110
+
111
+ ## Citation
112
+
113
+ If you use this model in your research, please cite the original Amazon reviews dataset:
114
+
115
+ ```
116
+ Please cite one or both of the following if you use the data in any way:
117
+
118
+ Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
119
+ R. He, J. McAuley
120
+ WWW, 2016
121
+ pdf
122
+
123
+ Image-based recommendations on styles and substitutes
124
+ J. McAuley, C. Targett, J. Shi, A. van den Hengel
125
+ SIGIR, 2015
126
+ pdf
127
+ }
128
+ ```
129
+
130
+ ## License
131
+
132
+ The model is shared for research purposes. The original data follows Amazon's terms of use.
133
+ ```