Assignment 2 β Introduction to NLP
Environment
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
Python β₯ 3.9 and PyTorch β₯ 2.0 recommended.
File Structure
<2025201014>_A2.zip
βββ embeddings/
β βββ svd.pt
β βββ cbow.pt
βββ svd_embeddings.py
βββ word2vec.py
βββ pos_tagger.py
βββ README.md
βββ report.pdf
Pre-trained GloVe weights are downloaded automatically via
gensim.downloaderTrained.ptfiles: [https://huggingface.co/Abhay01702/Assignment02]
Step 1 β Train SVD Embeddings
python svd_embeddings.py
# Output: embeddings/svd.pt
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
Step 2 β Train Word2Vec (CBOW + Negative Sampling)
python word2vec.py
# Output: embeddings/cbow.pt
Hyperparameters:
| Parameter | Value | Justification |
|---|---|---|
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
| EMBEDDING_DIM | 100 | Matches SVD |
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
| BATCH_SIZE | 512 | Efficient mini-batch |
| LR | 0.001 | Adam default; stable convergence |
Step 3 β POS Tagger + Analysis
# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all
# Run only analogy tests:
python pos_tagger.py --mode analogy
# Run only bias check:
python pos_tagger.py --mode bias
Outputs: console logs + confusion_.png images.
MLP Hyperparameters
| Parameter | Value |
|---|---|
| WINDOW | 2 (5-word input window) |
| HIDDEN | [512, 256] |
| DROPOUT | 0.3 |
| ACTIVATION | ReLU |
| EPOCHS | 20 |
| BATCH_SIZE | 512 |
| LR | 0.001 (Adam) |
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were trained on Brown so joint fine-tuning is beneficial.
Created By: Abhay Sharma
Roll No. 2025201014