Abhay01702
/

Assignment02

Model card Files Files and versions

Assignment02 / README.md

Abhay01702's picture

Upload README.md

1790c6f verified about 1 month ago

|

history blame contribute delete

2.55 kB

	# Assignment 2 – Introduction to NLP

	## Environment

	```bash
	pip install torch numpy scipy scikit-learn matplotlib seaborn nltk gensim
	python -c "import nltk; nltk.download('brown'); nltk.download('universal_tagset')"
	```

	Python ≥ 3.9 and PyTorch ≥ 2.0 recommended.

	---

	## File Structure

	```
	<2025201014>_A2.zip
	├── embeddings/
	│ ├── svd.pt
	│ └── cbow.pt
	├── svd_embeddings.py
	├── word2vec.py
	├── pos_tagger.py
	├── README.md
	└── report.pdf
	```

	> Pre-trained GloVe weights are downloaded automatically via `gensim.downloader`
	> Trained `.pt` files: [https://huggingface.co/Abhay01702/Assignment02]

	---

	## Step 1 – Train SVD Embeddings

	```bash
	python svd_embeddings.py
	# Output: embeddings/svd.pt
	```

	Hyperparameters:
	\| Parameter \| Value \| Justification \|
	\|---\|---\|---\|
	\| CONTEXT_WINDOW \| 5 \| Standard NLP window; captures local syntactic context \|
	\| EMBEDDING_DIM \| 100 \| Balances expressiveness & cost; matches Word2Vec \|
	\| MIN_FREQ \| 2 \| Removes hapax legomena that add noise \|
	\| WEIGHTING \| PPMI + distance decay \| Removes frequency bias; weights nearby context more \|

	---

	## Step 2 – Train Word2Vec (CBOW + Negative Sampling)

	```bash
	python word2vec.py
	# Output: embeddings/cbow.pt
	```

	Hyperparameters:
	\| Parameter \| Value \| Justification \|
	\|---\|---\|---\|
	\| CONTEXT_WINDOW \| 5 \| Matches SVD for fair comparison \|
	\| EMBEDDING_DIM \| 100 \| Matches SVD \|
	\| NEG_SAMPLES \| 10 \| Mikolov et al. recommend 5-20 for small corpora \|
	\| EPOCHS \| 10 \| Loss plateaus after ~8 epochs on Brown \|
	\| BATCH_SIZE \| 512 \| Efficient mini-batch \|
	\| LR \| 0.001 \| Adam default; stable convergence \|

	---

	## Step 3 – POS Tagger + Analysis

	```bash
	# Run everything (analogy + bias + tagger training):
	python pos_tagger.py --mode train_all

	# Run only analogy tests:
	python pos_tagger.py --mode analogy

	# Run only bias check:
	python pos_tagger.py --mode bias
	```

	Outputs: console logs + confusion_<variant>.png images.

	---

	## MLP Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| WINDOW \| 2 (5-word input window) \|
	\| HIDDEN \| [512, 256] \|
	\| DROPOUT \| 0.3 \|
	\| ACTIVATION \| ReLU \|
	\| EPOCHS \| 20 \|
	\| BATCH_SIZE \| 512 \|
	\| LR \| 0.001 (Adam) \|
	\| Freeze embeddings? \| GloVe: frozen; SVD & CBOW: fine-tuned \|

	Freeze justification: GloVe was trained on a massive external corpus; fine-tuning on
	the small Brown corpus risks degrading its representations. SVD/CBOW embeddings were
	trained on Brown so joint fine-tuning is beneficial.

	Created By: Abhay Sharma
	Roll No. 2025201014