File size: 4,713 Bytes

382124c

# Word Segmentation in Sanskrit Using Energy Based Models

This is a reconstruction of the paper [Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit](https://aclanthology.org/D18-1276/)

You can refer to the original repository [here](https://zenodo.org/records/1035413#.W35s8hjhUUs)

## Folder Structure
```

├── dir/

├──wordsegmentation/

   ├── skt_dcs_DS.bz2_4K_bigram_mir_10K/    

   ├── skt_dcs_DS.bz2_4K_bigram_mir_heldout/

```

## Prerequisites
* Python3
  * scipy
  * numpy
  * csv
  * pickle
  * multiprocessing
  * bz2
## Instructions for Training
1. Change your current directory to 'dir'

2. Run the file Train_clique.py by using the following command



```python

python Train_clique.py
```

**TRAINING OUTPUTS** are already stored in `dir/outputs/train_t3896665073989`.  

**NOTE**: To train on different input features like BM2,BM3,BR2,BR3,PM2,PM3,PR,PR3 please modify the bz2_input_folder value in the main function before beginning the training.



| Feature Code | `bz2_input_folder` Path                                          |

|--------------|------------------------------------------------------------------|

| BM2          | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_mir_10K/               |

| BM3          | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_mir_10K/               |

| BR2          | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_rfe_10K/               |

| BR3          | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_rfe_10K/               |

| PM2          | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_mir_10K/                  |

| PM3          | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_mir_10K2/                 |

| PR2          | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_rfe_10K/                  |

| PR3          | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_rfe_10K/                  |





## Instructions for Testing



After training, please modify the 'modelList' dictionary  in 'test_clique.py' with the name of the neural network that has been saved during training. While testing for a feature, please provide the name of the neural net which was trained for the same feature.



We only provide the trained model for the feature BM2 which was our best performing feature. If the name of the neural net is not changed, then the testing will be performed on the pre-trained model for BM2 provided in outputs/train_t7978754709018



To test with a particular feature vector use the tag of the feature while execution



```python

python test_clique.py -t <tag>

```

For example:
```python

  python test_clique.py -t BM2

```
After finishing the testing please run the following command to see the precision and recall values for both the word and word++ prediction tasks
```python

  python evaluate.py <tag>

 ```
For example: 
```python

  python evaluate.py BM2

```

#Citing 
```bibtex

@inproceedings{krishna-etal-2018-free,

    title = "Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in {S}anskrit",

    author = "Krishna, Amrith  and

      Santra, Bishal  and

      Bandaru, Sasi Prasanth  and

      Sahu, Gaurav  and

      Sharma, Vishnu Dutt  and

      Satuluri, Pavankumar  and

      Goyal, Pawan",

    editor = "Riloff, Ellen  and

      Chiang, David  and

      Hockenmaier, Julia  and

      Tsujii, Jun{'}ichi",

    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",

    month = oct # "-" # nov,

    year = "2018",

    address = "Brussels, Belgium",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/D18-1276/",

    doi = "10.18653/v1/D18-1276",

    pages = "2550--2561",

    abstract = "The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a; Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06{\%}) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6{\%} in F-Score for the segmentation task."

}

```