Assignment 2 – Introduction to NLP

Environment

pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"

Python ≥ 3.9 and PyTorch ≥ 2.0 recommended.

File Structure

<2025201014>_A2.zip
├── embeddings/
│   ├── svd.pt
│   └── cbow.pt
├── svd_embeddings.py
├── word2vec.py
├── pos_tagger.py
├── README.md
└── report.pdf

Pre-trained GloVe weights are downloaded automatically via gensim.downloader Trained .pt files: [https://huggingface.co/Abhay01702/Assignment02]

Step 1 – Train SVD Embeddings

python svd_embeddings.py
# Output: embeddings/svd.pt

Hyperparameters:

Parameter	Value	Justification
CONTEXT_WINDOW	5	Standard NLP window; captures local syntactic context
EMBEDDING_DIM	100	Balances expressiveness & cost; matches Word2Vec
MIN_FREQ	2	Removes hapax legomena that add noise
WEIGHTING	PPMI + distance decay	Removes frequency bias; weights nearby context more

Step 2 – Train Word2Vec (CBOW + Negative Sampling)

python word2vec.py
# Output: embeddings/cbow.pt

Hyperparameters:

Parameter	Value	Justification
CONTEXT_WINDOW	5	Matches SVD for fair comparison
EMBEDDING_DIM	100	Matches SVD
NEG_SAMPLES	10	Mikolov et al. recommend 5-20 for small corpora
EPOCHS	10	Loss plateaus after ~8 epochs on Brown
BATCH_SIZE	512	Efficient mini-batch
LR	0.001	Adam default; stable convergence

Step 3 – POS Tagger + Analysis

# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all

# Run only analogy tests:
python pos_tagger.py --mode analogy

# Run only bias check:
python pos_tagger.py --mode bias

Outputs: console logs + confusion_.png images.

MLP Hyperparameters

Parameter	Value
WINDOW	2 (5-word input window)
HIDDEN	[512, 256]
DROPOUT	0.3
ACTIVATION	ReLU
EPOCHS	20
BATCH_SIZE	512
LR	0.001 (Adam)
Freeze embeddings?	GloVe: frozen; SVD & CBOW: fine-tuned

Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were trained on Brown so joint fine-tuning is beneficial.

Created By: Abhay Sharma
Roll No. 2025201014