Assignment02 / README.md
Abhay01702's picture
Upload README.md
1790c6f verified
# Assignment 2 – Introduction to NLP
## Environment
```bash
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
```
Python β‰₯ 3.9 and PyTorch β‰₯ 2.0 recommended.
---
## File Structure
```
<2025201014>_A2.zip
β”œβ”€β”€ embeddings/
β”‚ β”œβ”€β”€ svd.pt
β”‚ └── cbow.pt
β”œβ”€β”€ svd_embeddings.py
β”œβ”€β”€ word2vec.py
β”œβ”€β”€ pos_tagger.py
β”œβ”€β”€ README.md
└── report.pdf
```
> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]
---
## Step 1 – Train SVD Embeddings
```bash
python svd_embeddings.py
# Output: embeddings/svd.pt
```
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
---
## Step 2 – Train Word2Vec (CBOW + Negative Sampling)
```bash
python word2vec.py
# Output: embeddings/cbow.pt
```
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
| EMBEDDING_DIM | 100 | Matches SVD |
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
| BATCH_SIZE | 512 | Efficient mini-batch |
| LR | 0.001 | Adam default; stable convergence |
---
## Step 3 – POS Tagger + Analysis
```bash
# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all
# Run only analogy tests:
python pos_tagger.py --mode analogy
# Run only bias check:
python pos_tagger.py --mode bias
```
Outputs: console logs + confusion_<variant>.png images.
---
## MLP Hyperparameters
| Parameter | Value |
|---|---|
| WINDOW | 2 (5-word input window) |
| HIDDEN | [512, 256] |
| DROPOUT | 0.3 |
| ACTIVATION | ReLU |
| EPOCHS | 20 |
| BATCH_SIZE | 512 |
| LR | 0.001 (Adam) |
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
trained on Brown so joint fine-tuning is beneficial.
**Created By:** Abhay Sharma
**Roll No.** 2025201014