# Assignment 2 – Introduction to NLP ## Environment ```bash pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')" ``` Python ≥ 3.9 and PyTorch ≥ 2.0 recommended. --- ## File Structure ``` <2025201014>_A2.zip ├── embeddings/ │ ├── svd.pt │ └── cbow.pt ├── svd_embeddings.py ├── word2vec.py ├── pos_tagger.py ├── README.md └── report.pdf ``` > Pre-trained GloVe weights are downloaded automatically via `gensim.downloader` > Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02] --- ## Step 1 – Train SVD Embeddings ```bash python svd_embeddings.py # Output: embeddings/svd.pt ``` Hyperparameters: | Parameter | Value | Justification | |---|---|---| | CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context | | EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec | | MIN_FREQ | 2 | Removes hapax legomena that add noise | | WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more | --- ## Step 2 – Train Word2Vec (CBOW + Negative Sampling) ```bash python word2vec.py # Output: embeddings/cbow.pt ``` Hyperparameters: | Parameter | Value | Justification | |---|---|---| | CONTEXT_WINDOW | 5 | Matches SVD for fair comparison | | EMBEDDING_DIM | 100 | Matches SVD | | NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora | | EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown | | BATCH_SIZE | 512 | Efficient mini-batch | | LR | 0.001 | Adam default; stable convergence | --- ## Step 3 – POS Tagger + Analysis ```bash # Run everything (analogy + bias + tagger training): python pos_tagger.py --mode train_all # Run only analogy tests: python pos_tagger.py --mode analogy # Run only bias check: python pos_tagger.py --mode bias ``` Outputs: console logs + confusion_.png images. --- ## MLP Hyperparameters | Parameter | Value | |---|---| | WINDOW | 2 (5-word input window) | | HIDDEN | [512, 256] | | DROPOUT | 0.3 | | ACTIVATION | ReLU | | EPOCHS | 20 | | BATCH_SIZE | 512 | | LR | 0.001 (Adam) | | Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned | Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were trained on Brown so joint fine-tuning is beneficial. **Created By:** Abhay Sharma **Roll No.** 2025201014