tanuj437
/

Vedika

Model card Files Files and versions

Vedika / README.md

tanuj437's picture

Update README.md

f854bc7 verified 9 months ago

|

history blame contribute delete

3.49 kB

	# Vedika - Sanskrit NLP Toolkit

	Vedika is a comprehensive toolkit for Sanskrit text processing, offering deep learning-based tools for sandhi splitting and joining, text normalization, sentence splitting, syllabification, and tokenization.

	## Features

	- Sandhi Processing
	- Split compound Sanskrit words using attention-based neural networks
	- Join Sanskrit words with proper sandhi rules
	- Support for beam search to get multiple suggestions
	- Text Processing
	- Syllabification
	- Tokenization
	- Sentence splitting
	- Text normalization

	## Installation

	```bash
	# Install from PyPI (Soon Coming)
	pip install vedika

	# Install from source
	git clone https://github.com/tanuj437/vedika.git
	cd vedika
	pip install -e .
	```

	## Requirements

	- Python >= 3.8
	- PyTorch >= 1.9.0
	- NumPy >= 1.19.0
	- Pandas >= 1.3.0
	- tqdm >= 4.62.0
	- regex >= 2021.8.3

	## Quick Start

	### Sandhi Splitting

	```python
	from vedika import SanskritSplit

	# Initialize splitter
	splitter = SanskritSplit()

	# Split a single word
	result = splitter.split("रामायणम्")
	print(result['split']) # Output: राम + अयन + अम्

	# Batch processing
	words = ["रामायणम्", "गीतागोविन्दम्"]
	results = splitter.split_batch(words)
	for result in results:
	print(f"{result['input']} → {result['split']}")
	```

	### Sandhi Joining

	```python
	from vedika import SandhiJoiner

	# Initialize joiner
	joiner = SandhiJoiner()

	# Join split words
	result = joiner.join("राम+अस्ति")
	print(result) # Output: रामास्ति

	# Batch processing
	texts = ["राम+अस्ति", "गच्छ+अमि"]
	results = joiner.join_batch(texts)
	print(results) # ['रामास्ति', 'गच्छामि']
	```

	## Advanced Usage

	### Beam Search for Multiple Suggestions

	```python
	# Get multiple suggestions with beam search
	result = splitter.split("रामायणम्", beam_size=3)
	print(f"Best split: {result['split']}")
	print(f"Confidence: {result['confidence']}")
	print("Alternatives:")
	for alt in result['alternatives']:
	print(f"- {alt['split']} (confidence: {alt['confidence']})")
	```

	### Model Information

	```python
	# Get model details
	info = splitter.get_model_info()
	print(f"Vocabulary size: {info['vocabulary_size']}")
	print(f"Device: {info['device']}")
	print(f"Configuration: {info['model_config']}")
	```

	## Project Structure

	```
	vedika/
	├── __init__.py
	├── normalizer.py
	├── sandhi_join.py
	├── sandhi_split.py
	├── sentence_splitter.py
	├── syllabification.py
	├── tokenizer.py
	└── data/
	├── cleaned_metres.json
	├── sandhi_joiner.pth
	└── sandhi_split.pth
	```

	## Model Architecture

	The sandhi processing models use:
	- Bidirectional LSTM encoder
	- GRU decoder with attention
	- Multi-head attention mechanism
	- Character-level processing

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	## License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## Authors

	- Tanuj Saxena
	- Soumya Sharma

	## Citation

	If you use Vedika in your research, please cite:

	```bibtex
	@software{vedika2025,
	title={Vedika: A Sanskrit Text Processing Toolkit},
	author={Saxena, Tanuj and Sharma, Soumya},
	year={2025},
	url={https://github.com/tanuj437/vedika}
	}
	```

	## Contact

	- Email: tanuj.saxena.rks@gmail.com, soumyasharma1599@gmail.com