Abhay01702
/

Assignment02

Model card Files Files and versions

Abhay01702 commited on Mar 4

Commit

1790c6f

·

verified ·

1 Parent(s): f2a9d65

Upload README.md

Files changed (1) hide show

README.md +104 -3

README.md CHANGED Viewed

@@ -1,3 +1,104 @@
----
-license: mit
----

+# Assignment 2 – Introduction to NLP
+## Environment
+```bash
+pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
+python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
+```
+Python ≥ 3.9 and PyTorch ≥ 2.0 recommended.
+---
+## File Structure
+```
+<2025201014>_A2.zip
+├── embeddings/
+│   ├── svd.pt
+│   └── cbow.pt
+├── svd_embeddings.py
+├── word2vec.py
+├── pos_tagger.py
+├── README.md
+└── report.pdf
+```
+> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
+> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]
+---
+## Step 1 – Train SVD Embeddings
+```bash
+python svd_embeddings.py
+# Output: embeddings/svd.pt
+```
+Hyperparameters:
+| Parameter | Value | Justification |
+|---|---|---|
+| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
+| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
+| MIN_FREQ | 2 | Removes hapax legomena that add noise |
+| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
+---
+## Step 2 – Train Word2Vec (CBOW + Negative Sampling)
+```bash
+python word2vec.py
+# Output: embeddings/cbow.pt
+```
+Hyperparameters:
+| Parameter | Value | Justification |
+|---|---|---|
+| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
+| EMBEDDING_DIM | 100 | Matches SVD |
+| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
+| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
+| BATCH_SIZE | 512 | Efficient mini-batch |
+| LR | 0.001 | Adam default; stable convergence |
+---
+## Step 3 – POS Tagger + Analysis
+```bash
+# Run everything (analogy + bias + tagger training):
+python pos_tagger.py --mode train_all
+# Run only analogy tests:
+python pos_tagger.py --mode analogy
+# Run only bias check:
+python pos_tagger.py --mode bias
+```
+Outputs: console logs + confusion_<variant>.png images.
+---
+## MLP Hyperparameters
+| Parameter | Value |
+|---|---|
+| WINDOW | 2 (5-word input window) |
+| HIDDEN | [512, 256] |
+| DROPOUT | 0.3 |
+| ACTIVATION | ReLU |
+| EPOCHS | 20 |
+| BATCH_SIZE | 512 |
+| LR | 0.001 (Adam) |
+| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
+Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
+the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
+trained on Brown so joint fine-tuning is beneficial.
+**Created By:** Abhay Sharma
+**Roll No.** 2025201014