Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- word2vec
|
| 6 |
+
- embeddings
|
| 7 |
+
- nlp
|
| 8 |
+
- sports
|
| 9 |
+
- outdoors
|
| 10 |
+
- amazon-reviews
|
| 11 |
+
metrics:
|
| 12 |
+
- semantic similarity
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Word2Vec Model for Amazon Sports & Outdoors Reviews
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
+
|
| 19 |
+
This is a Word2Vec model trained on Amazon product reviews from the Sports & Outdoors category. The model was trained using the Gensim library on 296,337 reviews to learn word embeddings that capture semantic relationships between words in the context of sports and outdoor product reviews.
|
| 20 |
+
|
| 21 |
+
- **Model type**: Word2Vec (Skip-gram architecture)
|
| 22 |
+
- **Training data**: Amazon Sports & Outdoors reviews (296,337 reviews)
|
| 23 |
+
- **Vocabulary size**: Dependent on the min_count parameter (words appearing at least twice)
|
| 24 |
+
- **Vector dimension**: 100 (Gensim default)
|
| 25 |
+
- **Window size**: 10 words
|
| 26 |
+
|
| 27 |
+
## Intended Uses & Limitations
|
| 28 |
+
|
| 29 |
+
### Intended Use
|
| 30 |
+
This model is designed for:
|
| 31 |
+
- Semantic similarity tasks for sports and outdoor-related vocabulary
|
| 32 |
+
- Product recommendation systems
|
| 33 |
+
- Review analysis and sentiment tasks
|
| 34 |
+
- Keyword expansion and related term discovery
|
| 35 |
+
- Educational and research purposes
|
| 36 |
+
|
| 37 |
+
### Limitations
|
| 38 |
+
- The model is specialized for the sports and outdoors domain
|
| 39 |
+
- Performance on vocabulary outside this domain may be limited
|
| 40 |
+
- Inherits any biases present in the Amazon review data
|
| 41 |
+
- May not perform well for very recent terminology not present in the training data
|
| 42 |
+
|
| 43 |
+
## How to Use
|
| 44 |
+
|
| 45 |
+
### Installation
|
| 46 |
+
```bash
|
| 47 |
+
pip install gensim pandas
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### Loading the Model
|
| 51 |
+
```python
|
| 52 |
+
import gensim
|
| 53 |
+
|
| 54 |
+
# Load the model
|
| 55 |
+
model = gensim.models.Word2Vec.load("word2vec_model.model")
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Getting Word Similarities
|
| 59 |
+
```python
|
| 60 |
+
# Find words similar to "good"
|
| 61 |
+
similar_words = model.wv.most_similar("good", topn=5)
|
| 62 |
+
print(similar_words)
|
| 63 |
+
|
| 64 |
+
# Find words similar to "slow"
|
| 65 |
+
similar_words = model.wv.most_similar("slow", topn=5)
|
| 66 |
+
print(similar_words)
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### Additional Operations
|
| 70 |
+
```python
|
| 71 |
+
# Get word vector
|
| 72 |
+
vector = model.wv['running']
|
| 73 |
+
|
| 74 |
+
# Calculate similarity between two words
|
| 75 |
+
similarity = model.wv.similarity('hiking', 'outdoors')
|
| 76 |
+
|
| 77 |
+
# Find odd one out
|
| 78 |
+
odd_one = model.wv.doesnt_match(['tent', 'sleeping bag', 'basketball'])
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## Training Details
|
| 82 |
+
|
| 83 |
+
### Training Data
|
| 84 |
+
The model was trained on the Amazon Sports & Outdoors reviews dataset(https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz) containing 296,337 reviews with 9 columns each. The text was preprocessed using Gensim's `simple_preprocess` function.
|
| 85 |
+
|
| 86 |
+
### Hyperparameters
|
| 87 |
+
- Window size: 10
|
| 88 |
+
- Minimum word count: 2
|
| 89 |
+
- Vector size: 100 (default)
|
| 90 |
+
- Training algorithm: Skip-gram (default)
|
| 91 |
+
- Negative samples: 5 (default)
|
| 92 |
+
- epochs: 5 (default)
|
| 93 |
+
|
| 94 |
+
## Evaluation
|
| 95 |
+
|
| 96 |
+
The model can be evaluated by examining the semantic relationships it captures. For example:
|
| 97 |
+
- It should find "excellent", "great", and "nice" similar to "good"
|
| 98 |
+
- It should find "fast", "quick" as antonyms to "slow"
|
| 99 |
+
- It should maintain sports-specific relationships (e.g., "football" related to "soccer")
|
| 100 |
+
|
| 101 |
+
## Model Performance
|
| 102 |
+
|
| 103 |
+
While quantitative evaluation metrics like accuracy on analogy tasks are not provided, the model demonstrates meaningful semantic relationships for vocabulary in the sports and outdoors domain.
|
| 104 |
+
|
| 105 |
+
## Ethical Considerations
|
| 106 |
+
|
| 107 |
+
- The model may reflect biases present in the original Amazon reviews
|
| 108 |
+
- Should not be used for automated decision making without human oversight
|
| 109 |
+
- Users should be aware that word embeddings can amplify societal biases
|
| 110 |
+
|
| 111 |
+
## Citation
|
| 112 |
+
|
| 113 |
+
If you use this model in your research, please cite the original Amazon reviews dataset:
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
+
Please cite one or both of the following if you use the data in any way:
|
| 117 |
+
|
| 118 |
+
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
|
| 119 |
+
R. He, J. McAuley
|
| 120 |
+
WWW, 2016
|
| 121 |
+
pdf
|
| 122 |
+
|
| 123 |
+
Image-based recommendations on styles and substitutes
|
| 124 |
+
J. McAuley, C. Targett, J. Shi, A. van den Hengel
|
| 125 |
+
SIGIR, 2015
|
| 126 |
+
pdf
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## License
|
| 131 |
+
|
| 132 |
+
The model is shared for research purposes. The original data follows Amazon's terms of use.
|
| 133 |
+
```
|