File size: 2,546 Bytes
1790c6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | # Assignment 2 β Introduction to NLP
## Environment
```bash
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
```
Python β₯ 3.9 and PyTorch β₯ 2.0 recommended.
---
## File Structure
```
<2025201014>_A2.zip
βββ embeddings/
β βββ svd.pt
β βββ cbow.pt
βββ svd_embeddings.py
βββ word2vec.py
βββ pos_tagger.py
βββ README.md
βββ report.pdf
```
> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]
---
## Step 1 β Train SVD Embeddings
```bash
python svd_embeddings.py
# Output: embeddings/svd.pt
```
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
---
## Step 2 β Train Word2Vec (CBOW + Negative Sampling)
```bash
python word2vec.py
# Output: embeddings/cbow.pt
```
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
| EMBEDDING_DIM | 100 | Matches SVD |
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
| BATCH_SIZE | 512 | Efficient mini-batch |
| LR | 0.001 | Adam default; stable convergence |
---
## Step 3 β POS Tagger + Analysis
```bash
# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all
# Run only analogy tests:
python pos_tagger.py --mode analogy
# Run only bias check:
python pos_tagger.py --mode bias
```
Outputs: console logs + confusion_<variant>.png images.
---
## MLP Hyperparameters
| Parameter | Value |
|---|---|
| WINDOW | 2 (5-word input window) |
| HIDDEN | [512, 256] |
| DROPOUT | 0.3 |
| ACTIVATION | ReLU |
| EPOCHS | 20 |
| BATCH_SIZE | 512 |
| LR | 0.001 (Adam) |
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
trained on Brown so joint fine-tuning is beneficial.
**Created By:** Abhay Sharma
**Roll No.** 2025201014 |