Assignment02 / README.md
Abhay01702's picture
Upload README.md
1790c6f verified

Assignment 2 – Introduction to NLP

Environment

pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"

Python β‰₯ 3.9 and PyTorch β‰₯ 2.0 recommended.


File Structure

<2025201014>_A2.zip
β”œβ”€β”€ embeddings/
β”‚   β”œβ”€β”€ svd.pt
β”‚   └── cbow.pt
β”œβ”€β”€ svd_embeddings.py
β”œβ”€β”€ word2vec.py
β”œβ”€β”€ pos_tagger.py
β”œβ”€β”€ README.md
└── report.pdf

Pre-trained GloVe weights are downloaded automatically via gensim.downloader Trained .pt files: [https://huggingface.co/Abhay01702/Assignment02]


Step 1 – Train SVD Embeddings

python svd_embeddings.py
# Output: embeddings/svd.pt

Hyperparameters:

Parameter Value Justification
CONTEXT_WINDOW 5 Standard NLP window; captures local syntactic context
EMBEDDING_DIM 100 Balances expressiveness & cost; matches Word2Vec
MIN_FREQ 2 Removes hapax legomena that add noise
WEIGHTING PPMI + distance decay Removes frequency bias; weights nearby context more

Step 2 – Train Word2Vec (CBOW + Negative Sampling)

python word2vec.py
# Output: embeddings/cbow.pt

Hyperparameters:

Parameter Value Justification
CONTEXT_WINDOW 5 Matches SVD for fair comparison
EMBEDDING_DIM 100 Matches SVD
NEG_SAMPLES 10 Mikolov et al. recommend 5-20 for small corpora
EPOCHS 10 Loss plateaus after ~8 epochs on Brown
BATCH_SIZE 512 Efficient mini-batch
LR 0.001 Adam default; stable convergence

Step 3 – POS Tagger + Analysis

# Run everything (analogy + bias + tagger training):
python pos_tagger.py --mode train_all

# Run only analogy tests:
python pos_tagger.py --mode analogy

# Run only bias check:
python pos_tagger.py --mode bias

Outputs: console logs + confusion_.png images.


MLP Hyperparameters

Parameter Value
WINDOW 2 (5-word input window)
HIDDEN [512, 256]
DROPOUT 0.3
ACTIVATION ReLU
EPOCHS 20
BATCH_SIZE 512
LR 0.001 (Adam)
Freeze embeddings? GloVe: frozen; SVD & CBOW: fine-tuned

Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were trained on Brown so joint fine-tuning is beneficial.

Created By: Abhay Sharma
Roll No. 2025201014