| # Vedika - Sanskrit NLP Toolkit | |
| Vedika is a comprehensive toolkit for Sanskrit text processing, offering deep learning-based tools for sandhi splitting and joining, text normalization, sentence splitting, syllabification, and tokenization. | |
| ## Features | |
| - **Sandhi Processing** | |
| - Split compound Sanskrit words using attention-based neural networks | |
| - Join Sanskrit words with proper sandhi rules | |
| - Support for beam search to get multiple suggestions | |
| - **Text Processing** | |
| - Syllabification | |
| - Tokenization | |
| - Sentence splitting | |
| - Text normalization | |
| ## Installation | |
| ```bash | |
| # Install from PyPI (Soon Coming) | |
| pip install vedika | |
| # Install from source | |
| git clone https://github.com/tanuj437/vedika.git | |
| cd vedika | |
| pip install -e . | |
| ``` | |
| ## Requirements | |
| - Python >= 3.8 | |
| - PyTorch >= 1.9.0 | |
| - NumPy >= 1.19.0 | |
| - Pandas >= 1.3.0 | |
| - tqdm >= 4.62.0 | |
| - regex >= 2021.8.3 | |
| ## Quick Start | |
| ### Sandhi Splitting | |
| ```python | |
| from vedika import SanskritSplit | |
| # Initialize splitter | |
| splitter = SanskritSplit() | |
| # Split a single word | |
| result = splitter.split("रामायणम्") | |
| print(result['split']) # Output: राम + अयन + अम् | |
| # Batch processing | |
| words = ["रामायणम्", "गीतागोविन्दम्"] | |
| results = splitter.split_batch(words) | |
| for result in results: | |
| print(f"{result['input']} → {result['split']}") | |
| ``` | |
| ### Sandhi Joining | |
| ```python | |
| from vedika import SandhiJoiner | |
| # Initialize joiner | |
| joiner = SandhiJoiner() | |
| # Join split words | |
| result = joiner.join("राम+अस्ति") | |
| print(result) # Output: रामास्ति | |
| # Batch processing | |
| texts = ["राम+अस्ति", "गच्छ+अमि"] | |
| results = joiner.join_batch(texts) | |
| print(results) # ['रामास्ति', 'गच्छामि'] | |
| ``` | |
| ## Advanced Usage | |
| ### Beam Search for Multiple Suggestions | |
| ```python | |
| # Get multiple suggestions with beam search | |
| result = splitter.split("रामायणम्", beam_size=3) | |
| print(f"Best split: {result['split']}") | |
| print(f"Confidence: {result['confidence']}") | |
| print("Alternatives:") | |
| for alt in result['alternatives']: | |
| print(f"- {alt['split']} (confidence: {alt['confidence']})") | |
| ``` | |
| ### Model Information | |
| ```python | |
| # Get model details | |
| info = splitter.get_model_info() | |
| print(f"Vocabulary size: {info['vocabulary_size']}") | |
| print(f"Device: {info['device']}") | |
| print(f"Configuration: {info['model_config']}") | |
| ``` | |
| ## Project Structure | |
| ``` | |
| vedika/ | |
| ├── __init__.py | |
| ├── normalizer.py | |
| ├── sandhi_join.py | |
| ├── sandhi_split.py | |
| ├── sentence_splitter.py | |
| ├── syllabification.py | |
| ├── tokenizer.py | |
| └── data/ | |
| ├── cleaned_metres.json | |
| ├── sandhi_joiner.pth | |
| └── sandhi_split.pth | |
| ``` | |
| ## Model Architecture | |
| The sandhi processing models use: | |
| - Bidirectional LSTM encoder | |
| - GRU decoder with attention | |
| - Multi-head attention mechanism | |
| - Character-level processing | |
| ## Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| ## License | |
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | |
| ## Authors | |
| - Tanuj Saxena | |
| - Soumya Sharma | |
| ## Citation | |
| If you use Vedika in your research, please cite: | |
| ```bibtex | |
| @software{vedika2025, | |
| title={Vedika: A Sanskrit Text Processing Toolkit}, | |
| author={Saxena, Tanuj and Sharma, Soumya}, | |
| year={2025}, | |
| url={https://github.com/tanuj437/vedika} | |
| } | |
| ``` | |
| ## Contact | |
| - Email: tanuj.saxena.rks@gmail.com, soumyasharma1599@gmail.com |