File size: 3,490 Bytes

# Vedika - Sanskrit NLP Toolkit

Vedika is a comprehensive toolkit for Sanskrit text processing, offering deep learning-based tools for sandhi splitting and joining, text normalization, sentence splitting, syllabification, and tokenization.

## Features

- **Sandhi Processing**
  - Split compound Sanskrit words using attention-based neural networks
  - Join Sanskrit words with proper sandhi rules
  - Support for beam search to get multiple suggestions
- **Text Processing**
  - Syllabification
  - Tokenization
  - Sentence splitting
  - Text normalization

## Installation

```bash
# Install from PyPI (Soon Coming)
pip install vedika

# Install from source
git clone https://github.com/tanuj437/vedika.git
cd vedika
pip install -e .
```

## Requirements

- Python >= 3.8
- PyTorch >= 1.9.0
- NumPy >= 1.19.0
- Pandas >= 1.3.0
- tqdm >= 4.62.0
- regex >= 2021.8.3

## Quick Start

### Sandhi Splitting

```python
from vedika import SanskritSplit

# Initialize splitter
splitter = SanskritSplit()

# Split a single word
result = splitter.split("रामायणम्")
print(result['split'])  # Output: राम + अयन + अम्

# Batch processing
words = ["रामायणम्", "गीतागोविन्दम्"]
results = splitter.split_batch(words)
for result in results:
    print(f"{result['input']} → {result['split']}")
```

### Sandhi Joining

```python
from vedika import SandhiJoiner

# Initialize joiner
joiner = SandhiJoiner()

# Join split words
result = joiner.join("राम+अस्ति")
print(result)  # Output: रामास्ति

# Batch processing
texts = ["राम+अस्ति", "गच्छ+अमि"]
results = joiner.join_batch(texts)
print(results)  # ['रामास्ति', 'गच्छामि']
```

## Advanced Usage

### Beam Search for Multiple Suggestions

```python
# Get multiple suggestions with beam search
result = splitter.split("रामायणम्", beam_size=3)
print(f"Best split: {result['split']}")
print(f"Confidence: {result['confidence']}")
print("Alternatives:")
for alt in result['alternatives']:
    print(f"- {alt['split']} (confidence: {alt['confidence']})")
```

### Model Information

```python
# Get model details
info = splitter.get_model_info()
print(f"Vocabulary size: {info['vocabulary_size']}")
print(f"Device: {info['device']}")
print(f"Configuration: {info['model_config']}")
```

## Project Structure

```
vedika/
├── __init__.py
├── normalizer.py
├── sandhi_join.py
├── sandhi_split.py
├── sentence_splitter.py
├── syllabification.py
├── tokenizer.py
└── data/
    ├── cleaned_metres.json
    ├── sandhi_joiner.pth
    └── sandhi_split.pth
```

## Model Architecture

The sandhi processing models use:
- Bidirectional LSTM encoder
- GRU decoder with attention
- Multi-head attention mechanism
- Character-level processing

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Authors

- Tanuj Saxena
- Soumya Sharma

## Citation

If you use Vedika in your research, please cite:

```bibtex
@software{vedika2025,
  title={Vedika: A Sanskrit Text Processing Toolkit},
  author={Saxena, Tanuj and Sharma, Soumya},
  year={2025},
  url={https://github.com/tanuj437/vedika}
}
```

## Contact

- Email: tanuj.saxena.rks@gmail.com, soumyasharma1599@gmail.com