| # Assignment 2 β Introduction to NLP |
|
|
| ## Environment |
|
|
| ```bash |
| pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim |
| python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')" |
| ``` |
|
|
| Python β₯ 3.9 and PyTorch β₯ 2.0 recommended. |
|
|
| --- |
|
|
| ## File Structure |
|
|
| ``` |
| <2025201014>_A2.zip |
| βββ embeddings/ |
| β βββ svd.pt |
| β βββ cbow.pt |
| βββ svd_embeddings.py |
| βββ word2vec.py |
| βββ pos_tagger.py |
| βββ README.md |
| βββ report.pdf |
| ``` |
|
|
| > Pre-trained GloVe weights are downloaded automatically via `gensim.downloader` |
| > Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02] |
|
|
| --- |
|
|
| ## Step 1 β Train SVD Embeddings |
|
|
| ```bash |
| python svd_embeddings.py |
| # Output: embeddings/svd.pt |
| ``` |
|
|
| Hyperparameters: |
| | Parameter | Value | Justification | |
| |---|---|---| |
| | CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context | |
| | EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec | |
| | MIN_FREQ | 2 | Removes hapax legomena that add noise | |
| | WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more | |
| |
| --- |
| |
| ## Step 2 β Train Word2Vec (CBOW + Negative Sampling) |
| |
| ```bash |
| python word2vec.py |
| # Output: embeddings/cbow.pt |
| ``` |
| |
| Hyperparameters: |
| | Parameter | Value | Justification | |
| |---|---|---| |
| | CONTEXT_WINDOW | 5 | Matches SVD for fair comparison | |
| | EMBEDDING_DIM | 100 | Matches SVD | |
| | NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora | |
| | EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown | |
| | BATCH_SIZE | 512 | Efficient mini-batch | |
| | LR | 0.001 | Adam default; stable convergence | |
| |
| --- |
| |
| ## Step 3 β POS Tagger + Analysis |
| |
| ```bash |
| # Run everything (analogy + bias + tagger training): |
| python pos_tagger.py --mode train_all |
| |
| # Run only analogy tests: |
| python pos_tagger.py --mode analogy |
|
|
| # Run only bias check: |
| python pos_tagger.py --mode bias |
| ``` |
| |
| Outputs: console logs + confusion_<variant>.png images. |
|
|
| --- |
|
|
| ## MLP Hyperparameters |
|
|
| | Parameter | Value | |
| |---|---| |
| | WINDOW | 2 (5-word input window) | |
| | HIDDEN | [512, 256] | |
| | DROPOUT | 0.3 | |
| | ACTIVATION | ReLU | |
| | EPOCHS | 20 | |
| | BATCH_SIZE | 512 | |
| | LR | 0.001 (Adam) | |
| | Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned | |
| |
| Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on |
| the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were |
| trained on Brown so joint fine-tuning is beneficial. |
| |
| **Created By:** Abhay Sharma |
| **Roll No.** 2025201014 |