Abhay01702 commited on
Commit
1790c6f
Β·
verified Β·
1 Parent(s): f2a9d65

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -3
README.md CHANGED
@@ -1,3 +1,104 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Assignment 2 – Introduction to NLP
2
+
3
+ ## Environment
4
+
5
+ ```bash
6
+ pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
7
+ python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
8
+ ```
9
+
10
+ Python β‰₯ 3.9 and PyTorch β‰₯ 2.0 recommended.
11
+
12
+ ---
13
+
14
+ ## File Structure
15
+
16
+ ```
17
+ <2025201014>_A2.zip
18
+ β”œβ”€β”€ embeddings/
19
+ β”‚ β”œβ”€β”€ svd.pt
20
+ β”‚ └── cbow.pt
21
+ β”œβ”€β”€ svd_embeddings.py
22
+ β”œβ”€β”€ word2vec.py
23
+ β”œβ”€β”€ pos_tagger.py
24
+ β”œβ”€β”€ README.md
25
+ └── report.pdf
26
+ ```
27
+
28
+ > Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
29
+ > Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]
30
+
31
+ ---
32
+
33
+ ## Step 1 – Train SVD Embeddings
34
+
35
+ ```bash
36
+ python svd_embeddings.py
37
+ # Output: embeddings/svd.pt
38
+ ```
39
+
40
+ Hyperparameters:
41
+ | Parameter | Value | Justification |
42
+ |---|---|---|
43
+ | CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
44
+ | EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
45
+ | MIN_FREQ | 2 | Removes hapax legomena that add noise |
46
+ | WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
47
+
48
+ ---
49
+
50
+ ## Step 2 – Train Word2Vec (CBOW + Negative Sampling)
51
+
52
+ ```bash
53
+ python word2vec.py
54
+ # Output: embeddings/cbow.pt
55
+ ```
56
+
57
+ Hyperparameters:
58
+ | Parameter | Value | Justification |
59
+ |---|---|---|
60
+ | CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
61
+ | EMBEDDING_DIM | 100 | Matches SVD |
62
+ | NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
63
+ | EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
64
+ | BATCH_SIZE | 512 | Efficient mini-batch |
65
+ | LR | 0.001 | Adam default; stable convergence |
66
+
67
+ ---
68
+
69
+ ## Step 3 – POS Tagger + Analysis
70
+
71
+ ```bash
72
+ # Run everything (analogy + bias + tagger training):
73
+ python pos_tagger.py --mode train_all
74
+
75
+ # Run only analogy tests:
76
+ python pos_tagger.py --mode analogy
77
+
78
+ # Run only bias check:
79
+ python pos_tagger.py --mode bias
80
+ ```
81
+
82
+ Outputs: console logs + confusion_<variant>.png images.
83
+
84
+ ---
85
+
86
+ ## MLP Hyperparameters
87
+
88
+ | Parameter | Value |
89
+ |---|---|
90
+ | WINDOW | 2 (5-word input window) |
91
+ | HIDDEN | [512, 256] |
92
+ | DROPOUT | 0.3 |
93
+ | ACTIVATION | ReLU |
94
+ | EPOCHS | 20 |
95
+ | BATCH_SIZE | 512 |
96
+ | LR | 0.001 (Adam) |
97
+ | Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
98
+
99
+ Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
100
+ the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
101
+ trained on Brown so joint fine-tuning is beneficial.
102
+
103
+ **Created By:** Abhay Sharma
104
+ **Roll No.** 2025201014