# Assignment 2 – Introduction to NLP

## Environment

```bash
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
```

Python ≥ 3.9 and PyTorch ≥ 2.0 recommended.

---

## File Structure

```
<2025201014>_A2.zip
├── embeddings/
│   ├── svd.pt
│   └── cbow.pt
├── svd_embeddings.py
├── word2vec.py
├── pos_tagger.py
├── README.md
└── report.pdf
```

> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]

---

## Step 1 – Train SVD Embeddings

```bash
python svd_embeddings.py
# Output: embeddings/svd.pt
```

Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |

---

## Step 2 – Train Word2Vec (CBOW + Negative Sampling)

```bash
python word2vec.py
# Output: embeddings/cbow.pt
```

Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
| EMBEDDING_DIM | 100 | Matches SVD |
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
| BATCH_SIZE | 512 | Efficient mini-batch |
| LR | 0.001 | Adam default; stable convergence |

---

## Step 3 – POS Tagger + Analysis

```bash
# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all

# Run only analogy tests:
python pos_tagger.py --mode analogy

# Run only bias check:
python pos_tagger.py --mode bias
```

Outputs: console logs + confusion_<variant>.png images.

---

## MLP Hyperparameters

| Parameter | Value |
|---|---|
| WINDOW | 2 (5-word input window) |
| HIDDEN | [512, 256] |
| DROPOUT | 0.3 |
| ACTIVATION | ReLU |
| EPOCHS | 20 |
| BATCH_SIZE | 512 |
| LR | 0.001 (Adam) |
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |

Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
trained on Brown so joint fine-tuning is beneficial.  
  
**Created By:** Abhay Sharma  
**Roll No.** 2025201014