File size: 2,546 Bytes
1790c6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# Assignment 2 – Introduction to NLP

## Environment

```bash
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
```

Python β‰₯ 3.9 and PyTorch β‰₯ 2.0 recommended.

---

## File Structure

```
<2025201014>_A2.zip
β”œβ”€β”€ embeddings/
β”‚   β”œβ”€β”€ svd.pt
β”‚   └── cbow.pt
β”œβ”€β”€ svd_embeddings.py
β”œβ”€β”€ word2vec.py
β”œβ”€β”€ pos_tagger.py
β”œβ”€β”€ README.md
└── report.pdf
```

> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]

---

## Step 1 – Train SVD Embeddings

```bash
python svd_embeddings.py
# Output: embeddings/svd.pt
```

Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |

---

## Step 2 – Train Word2Vec (CBOW + Negative Sampling)

```bash
python word2vec.py
# Output: embeddings/cbow.pt
```

Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
| EMBEDDING_DIM | 100 | Matches SVD |
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
| BATCH_SIZE | 512 | Efficient mini-batch |
| LR | 0.001 | Adam default; stable convergence |

---

## Step 3 – POS Tagger + Analysis

```bash
# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all

# Run only analogy tests:
python pos_tagger.py --mode analogy

# Run only bias check:
python pos_tagger.py --mode bias
```

Outputs: console logs + confusion_<variant>.png images.

---

## MLP Hyperparameters

| Parameter | Value |
|---|---|
| WINDOW | 2 (5-word input window) |
| HIDDEN | [512, 256] |
| DROPOUT | 0.3 |
| ACTIVATION | ReLU |
| EPOCHS | 20 |
| BATCH_SIZE | 512 |
| LR | 0.001 (Adam) |
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |

Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
trained on Brown so joint fine-tuning is beneficial.  
  
**Created By:** Abhay Sharma  
**Roll No.** 2025201014