Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,104 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Assignment 2 β Introduction to NLP
|
| 2 |
+
|
| 3 |
+
## Environment
|
| 4 |
+
|
| 5 |
+
```bash
|
| 6 |
+
pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
|
| 7 |
+
python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
|
| 8 |
+
```
|
| 9 |
+
|
| 10 |
+
Python β₯ 3.9 and PyTorch β₯ 2.0 recommended.
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## File Structure
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
<2025201014>_A2.zip
|
| 18 |
+
βββ embeddings/
|
| 19 |
+
β βββ svd.pt
|
| 20 |
+
β βββ cbow.pt
|
| 21 |
+
βββ svd_embeddings.py
|
| 22 |
+
βββ word2vec.py
|
| 23 |
+
βββ pos_tagger.py
|
| 24 |
+
βββ README.md
|
| 25 |
+
βββ report.pdf
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
|
| 29 |
+
> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Step 1 β Train SVD Embeddings
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
python svd_embeddings.py
|
| 37 |
+
# Output: embeddings/svd.pt
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Hyperparameters:
|
| 41 |
+
| Parameter | Value | Justification |
|
| 42 |
+
|---|---|---|
|
| 43 |
+
| CONTEXT_WINDOW | 5 | Standard NLP window; captures local syntactic context |
|
| 44 |
+
| EMBEDDING_DIM | 100 | Balances expressiveness & cost; matches Word2Vec |
|
| 45 |
+
| MIN_FREQ | 2 | Removes hapax legomena that add noise |
|
| 46 |
+
| WEIGHTING | PPMI + distance decay | Removes frequency bias; weights nearby context more |
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## Step 2 β Train Word2Vec (CBOW + Negative Sampling)
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
python word2vec.py
|
| 54 |
+
# Output: embeddings/cbow.pt
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
Hyperparameters:
|
| 58 |
+
| Parameter | Value | Justification |
|
| 59 |
+
|---|---|---|
|
| 60 |
+
| CONTEXT_WINDOW | 5 | Matches SVD for fair comparison |
|
| 61 |
+
| EMBEDDING_DIM | 100 | Matches SVD |
|
| 62 |
+
| NEG_SAMPLES | 10 | Mikolov et al. recommend 5-20 for small corpora |
|
| 63 |
+
| EPOCHS | 10 | Loss plateaus after ~8 epochs on Brown |
|
| 64 |
+
| BATCH_SIZE | 512 | Efficient mini-batch |
|
| 65 |
+
| LR | 0.001 | Adam default; stable convergence |
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Step 3 β POS Tagger + Analysis
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
# Run everything (analogy + bias + tagger training):
|
| 73 |
+
python pos_tagger.py --mode train_all
|
| 74 |
+
|
| 75 |
+
# Run only analogy tests:
|
| 76 |
+
python pos_tagger.py --mode analogy
|
| 77 |
+
|
| 78 |
+
# Run only bias check:
|
| 79 |
+
python pos_tagger.py --mode bias
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
Outputs: console logs + confusion_<variant>.png images.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## MLP Hyperparameters
|
| 87 |
+
|
| 88 |
+
| Parameter | Value |
|
| 89 |
+
|---|---|
|
| 90 |
+
| WINDOW | 2 (5-word input window) |
|
| 91 |
+
| HIDDEN | [512, 256] |
|
| 92 |
+
| DROPOUT | 0.3 |
|
| 93 |
+
| ACTIVATION | ReLU |
|
| 94 |
+
| EPOCHS | 20 |
|
| 95 |
+
| BATCH_SIZE | 512 |
|
| 96 |
+
| LR | 0.001 (Adam) |
|
| 97 |
+
| Freeze embeddings? | GloVe: frozen; SVD & CBOW: fine-tuned |
|
| 98 |
+
|
| 99 |
+
Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
|
| 100 |
+
the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
|
| 101 |
+
trained on Brown so joint fine-tuning is beneficial.
|
| 102 |
+
|
| 103 |
+
**Created By:** Abhay Sharma
|
| 104 |
+
**Roll No.** 2025201014
|