shivrajanand commited on Jul 31, 2025

Commit

382124c

verified ·

1 Parent(s): f6eee78

Add files using upload-large-folder tool

Browse files

Files changed (50) hide show

README.md +96 -3
dir/Baseline4_advSample.csv +0 -0
dir/DCS.py +19 -0
dir/ECL_MST.py +132 -0
dir/Feature_Collector.py +83 -0
dir/GetFeatures_gen.py +0 -0
dir/MatDB.py +18 -0
dir/TestPoolMP.py +53 -0
dir/TestPool_Unit.py +54 -0
dir/TestPool_Unit_clique.py +57 -0
dir/Train_bron.py +830 -0
dir/Train_clique.py +769 -0
dir/Train_n_Save_NNet.py +680 -0
dir/__pycache__/TestPool_Unit_clique.cpython-36.pyc +0 -0
dir/__pycache__/heap_n_clique.cpython-36.pyc +0 -0
dir/bronclique.py +362 -0
dir/bucket_by_conflicting_nodes_0.py +85 -0
dir/bucket_by_conflicting_nodes_1.py +72 -0
dir/bucket_by_conflicting_nodes_2.py +73 -0
dir/bucket_by_conflicting_nodes_3.py +73 -0
dir/bucket_by_conflicting_nodes_4.py +73 -0
dir/bz2_counter.py +8 -0
dir/cliq.csv +0 -0
dir/datainspect.py +37 -0
dir/dcs_skt_bzipper.py +196 -0
dir/evaluate.py +81 -0
dir/generate_dcs_and_skt_csv.py +66 -0
dir/gt2.py +148 -0
dir/heap_n_PrimMST.py +201 -0
dir/heap_n_clique.py +325 -0
dir/heldoutmatchtest.py +37 -0
dir/lemmawise_labeller.py +77 -0
dir/nnet.py +371 -0
dir/pvb.p +0 -0
dir/rom.txt +33 -0
dir/rom2.txt +12 -0
dir/romtoslp.py +35 -0
dir/romtoslp.pyc +0 -0
dir/sandhiRules.p +0 -0
dir/sentences.py +319 -0
dir/sh_TestPool_MP_clique.py +168 -0
dir/test_clique.py +174 -0
dir/unpack.py +39 -0
dir/utilities.py +323 -0
dir/verbs_vs_cngs_matrix_countonly.p +0 -0
dir/weighted.py +81 -0
dir/wordTypeCheckFunction.py +281 -0
dir/word_definite.py +0 -0
dir/word_definite[d_1500_BM2_v12].py +0 -0
requirements.txt +68 -0

README.md CHANGED Viewed

@@ -1,3 +1,96 @@
----
-license: apache-2.0
----

+# Word Segmentation in Sanskrit Using Energy Based Models
+This is a reconstruction of the paper [Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit](https://aclanthology.org/D18-1276/)
+You can refer to the original repository [here](https://zenodo.org/records/1035413#.W35s8hjhUUs)
+## Folder Structure
+```
+├── dir/
+├──wordsegmentation/
+   ├── skt_dcs_DS.bz2_4K_bigram_mir_10K/
+   ├── skt_dcs_DS.bz2_4K_bigram_mir_heldout/
+```
+## Prerequisites
+* Python3
+  * scipy
+  * numpy
+  * csv
+  * pickle
+  * multiprocessing
+  * bz2
+## Instructions for Training
+1. Change your current directory to 'dir'
+2. Run the file Train_clique.py by using the following command
+```python
+python Train_clique.py
+```
+**TRAINING OUTPUTS** are already stored in `dir/outputs/train_t3896665073989`.
+**NOTE**: To train on different input features like BM2,BM3,BR2,BR3,PM2,PM3,PR,PR3 please modify the bz2_input_folder value in the main function before beginning the training.
+| Feature Code | `bz2_input_folder` Path                                          |
+|--------------|------------------------------------------------------------------|
+| BM2          | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_mir_10K/               |
+| BM3          | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_mir_10K/               |
+| BR2          | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_rfe_10K/               |
+| BR3          | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_rfe_10K/               |
+| PM2          | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_mir_10K/                  |
+| PM3          | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_mir_10K2/                 |
+| PR2          | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_rfe_10K/                  |
+| PR3          | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_rfe_10K/                  |
+## Instructions for Testing
+After training, please modify the 'modelList' dictionary  in 'test_clique.py' with the name of the neural network that has been saved during training. While testing for a feature, please provide the name of the neural net which was trained for the same feature.
+We only provide the trained model for the feature BM2 which was our best performing feature. If the name of the neural net is not changed, then the testing will be performed on the pre-trained model for BM2 provided in outputs/train_t7978754709018
+To test with a particular feature vector use the tag of the feature while execution
+```python
+python test_clique.py -t <tag>
+```
+For example:
+```python
+  python test_clique.py -t BM2
+```
+After finishing the testing please run the following command to see the precision and recall values for both the word and word++ prediction tasks
+```python
+  python evaluate.py <tag>
+ ```
+For example:
+```python
+  python evaluate.py BM2
+```
+#Citing
+```bibtex
+@inproceedings{krishna-etal-2018-free,
+    title = "Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in {S}anskrit",
+    author = "Krishna, Amrith  and
+      Santra, Bishal  and
+      Bandaru, Sasi Prasanth  and
+      Sahu, Gaurav  and
+      Sharma, Vishnu Dutt  and
+      Satuluri, Pavankumar  and
+      Goyal, Pawan",
+    editor = "Riloff, Ellen  and
+      Chiang, David  and
+      Hockenmaier, Julia  and
+      Tsujii, Jun{'}ichi",
+    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
+    month = oct # "-" # nov,
+    year = "2018",
+    address = "Brussels, Belgium",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/D18-1276/",
+    doi = "10.18653/v1/D18-1276",
+    pages = "2550--2561",
+    abstract = "The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a; Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06{\%}) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6{\%} in F-Score for the segmentation task."
+}
+```

dir/Baseline4_advSample.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

dir/DCS.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import sys
+import warnings
+from romtoslp import *
+class DCS:
+    def __init__(self,sent_id,sentence):
+        self.sent_id=sent_id
+        self.sentence=sentence
+        self.dcs_chunks=[]
+        self.lemmas=[]
+        self.cng=[]
+def SeeDCS(dcsObj):
+    print('DCS ANALYZE')
+    print('-'*15)
+    print(dcsObj.sentence)
+    print(dcsObj.lemmas)
+    print("Lemmas:", [rom_slp(c) for arr in dcsObj.lemmas for c in arr])
+    print(dcsObj.cng)
+    print()

dir/ECL_MST.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# coding: utf-8
+# In[38]:
+#!/usr/bin/env python3
+import numpy as np
+import sys
+from collections import defaultdict, namedtuple
+from copy import deepcopy
+# Arc is same as edge
+Arc = namedtuple('Arc', ('tail', 'weight', 'head'))
+def min_spanning_arborescence(arcs, source):
+    good_arcs = []
+    quotient_map = {arc.tail: arc.tail for arc in arcs}
+    quotient_map[source] = source
+    while True:
+        min_arc_by_tail_rep = {}
+        successor_rep = {}
+        for arc in arcs:
+            if arc.tail == source:
+                continue
+            tail_rep = quotient_map[arc.tail]
+            head_rep = quotient_map[arc.head]
+            if tail_rep == head_rep:
+                continue
+            if tail_rep not in min_arc_by_tail_rep or min_arc_by_tail_rep[tail_rep].weight > arc.weight:
+                min_arc_by_tail_rep[tail_rep] = arc
+                successor_rep[tail_rep] = head_rep
+        cycle_reps = find_cycle(successor_rep, source)
+        if cycle_reps is None:
+            good_arcs.extend(min_arc_by_tail_rep.values())
+            return spanning_arborescence(good_arcs, source)
+        good_arcs.extend(min_arc_by_tail_rep[cycle_rep] for cycle_rep in cycle_reps)
+        cycle_rep_set = set(cycle_reps)
+        cycle_rep = cycle_rep_set.pop()
+        quotient_map = {node: cycle_rep if node_rep in cycle_rep_set else node_rep for node, node_rep in quotient_map.items()}
+def find_cycle(successor, source):
+    visited = {source}
+    for node in successor:
+        cycle = []
+        while node not in visited:
+            visited.add(node)
+            cycle.append(node)
+            node = successor[node]
+        if node in cycle:
+            return cycle[cycle.index(node):]
+    return None
+def spanning_arborescence(arcs, source):
+    arcs_by_head = defaultdict(list)
+    for arc in arcs:
+        if arc.tail == source:
+            continue
+        arcs_by_head[arc.head].append(arc)
+    solution_arc_by_tail = {}
+    stack = arcs_by_head[source]
+    while stack:
+        stack = sorted(stack)
+        arc = stack.pop(0)
+        if arc.tail in solution_arc_by_tail:
+            continue
+        solution_arc_by_tail[arc.tail] = arc
+        stack.extend(arcs_by_head[arc.tail])
+    return solution_arc_by_tail
+def MST_ECL(nodelist, WScalarMat, conflicts_Dict1, source):
+    pre_nodes = []
+    nodes = []
+    WScalarMat1 = deepcopy(WScalarMat)
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+    while(len(nodes)<len(nodelist)):
+        i = int(np.argmin(WScalarMat1)/len(nodelist))
+        j = np.argmin(WScalarMat1)%len(nodelist)
+        if(i not in nodes and j not in nodes and i!=j and i not in conflicts_Dict1[j]):
+            pre_nodes.extend([i,j])
+            nodes.extend([i,j])
+            for x in conflicts_Dict1[i]:
+                if x not in nodes:
+                    nodes.append(x)
+            for x in conflicts_Dict1[j]:
+                if x not in nodes:
+                    nodes.append(x)
+        elif(i not in nodes and j in pre_nodes and i!=j and i not in conflicts_Dict1[j]):
+            pre_nodes.append(i)
+            nodes.append(i)
+            for x in conflicts_Dict1[i]:
+                if x not in nodes:
+                    nodes.append(x)
+        elif(j not in nodes and i in pre_nodes and i!=j and j not in conflicts_Dict1[i]):
+            pre_nodes.append(j)
+            nodes.append(j)
+            for x in conflicts_Dict1[j]:
+                if x not in nodes:
+                    nodes.append(x)
+        WScalarMat1[i][j] = sys.maxsize
+    pre_nodes.sort()
+    for i in pre_nodes:
+        mst_nodes_bool[i] = True
+        mst_nodes[nodelist[i].chunk_id].append(nodelist[i])
+    mst_nodes = dict(mst_nodes)
+    # list of arcs(edges)
+    list_arcs = []
+    for i in range(len(nodelist)):
+        for j in range(len(nodelist)):
+            if i in pre_nodes and j in pre_nodes and WScalarMat[i][j] != 0.0:
+                list_arcs.append(Arc(j,WScalarMat[i][j],i))
+    Resultant_Arcs = min_spanning_arborescence(list_arcs,pre_nodes[0])
+    #print(Resultant_Arcs)
+    for i in Resultant_Arcs.values():
+        #print(i.head,i.tail)
+        mst_adj_graph[i.head][i.tail] = True
+    return(mst_nodes_bool,mst_nodes,mst_adj_graph)
+# Example_Graph_1 = [Arc(1, 9, 0), Arc(2, 10, 0), Arc(3, 9, 0), Arc(2, 20, 1), Arc(3, 3, 1), Arc(1,30, 2), Arc(3, 30, 2), Arc(2, 0, 3), Arc(1, 11, 3)]
+# Example_Graph_2 = [Arc(1, 10, 0),Arc(2, 7, 0),Arc(1, 9, 2),Arc(4, 10, 1),Arc(3, 2, 2),Arc(3, 20, 4),Arc(4, 23, 2),Arc(5, 5, 2),Arc(3, 7, 5)]
+# print(min_spanning_arborescence(Example_Graph_2,0))

dir/Feature_Collector.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import word_definite as WD
+from MatDB import *
+import matplotlib.pyplot as plt
+from IPython.display import display
+import numpy as np
+import math
+np.set_printoptions(suppress=False, precision=16)
+matDB = MatDB()
+WD.word_definite_extInit(matDB)
+print(len(matDB.mat_tupCount_1D))
+smallTupList = []
+for tup1, val in matDB.mat_tupCount_1D.items():
+    if val > 20:
+        smallTupList.append(tup1)
+l = len(smallTupList)
+print(l)
+batchCount = 2000
+perm = np.random.permutation(l)[:batchCount]
+# perm2 = np.random.permutation(l)[:batchCount]
+fN = 4*443**2 + 9*443 + 9
+print(fN)
+all_pairs = {}
+index = 0
+for k in range(batchCount):
+    tup1 = smallTupList[perm[k]]
+    lem1 = tup1.split('_')
+    cng1 = lem1[1]
+    lem1 = lem1[0]
+    node1 = WD.word_definite(None, lem1, cng1, 0, 0)
+    for tup2, co_occurrence in matDB.mat_tup2tup_countonly[tup1].items():
+        if co_occurrence > 4:
+            lem2 = tup2.split('_')
+            cng2 = lem2[1]
+            lem2 = lem2[0]
+            node2 = WD.word_definite(None, lem2, cng2, 0, 1)
+            all_pairs[index] = (node1, node2)
+            index += 1
+with open('outputs/log_001.txt', 'a') as log_handle:
+    log_handle.write('Will get feature vectors for {} pairs\n'.format(index))
+total_examples = index
+pairs_per_file = 500
+def tryForVal(mat, key1, key2):
+    try:
+        v = mat[key1][key2]
+    except:
+        v = 0
+    return v
+for pairx in range(math.ceil(len(all_pairs)/pairs_per_file)):
+    subset_pairs = range(pairx*pairs_per_file, min(len(all_pairs), (pairx + 1)*pairs_per_file))
+    featureMatrix = np.zeros((fN, len(subset_pairs)))
+    targetDict = {}
+    index = 0
+    current_pairs = {}
+    for hi in subset_pairs:
+        node1 = all_pairs[hi][0]
+        node2 = all_pairs[hi][1]
+        current_pairs[index] = '{}^{}'.format(node1.tup, node2.tup)
+        featureMatrix[:, index, None] = WD.Get_Features(node1, node2)
+        targetDict[index] = (tryForVal(matDB.mat_tup2tup_countonly, node1.tup, node2.tup),\
+                              tryForVal(matDB.mat_lem2lem_countonly, node1.lemma, node2.lemma),\
+                              tryForVal(matDB.mat_lem2tup_countonly, node1.lemma, node2.tup),\
+                              tryForVal(matDB.mat_tup2lem_countonly, node1.tup, node2.lemma))
+        index += 1
+        if index % min(math.ceil(pairs_per_file/2), 100) == 0:
+            with open('outputs/log_001.txt', 'a') as log_handle:
+                log_handle.write('Checkpoint S{}E{} of {}\n'.format(pairx, index, pairs_per_file))
+    pickle.dump({'all_pairs': current_pairs, 'featureMatrix': featureMatrix, 'targetDict': targetDict},\
+        open('outputs/featureSet_{}samples_8L_{}.p'.format(pairs_per_file, pairx), 'wb'), protocol = 4)

dir/GetFeatures_gen.py ADDED Viewed

The diff for this file is too large to render. See raw diff

dir/MatDB.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import pickle
+class MatDB:
+	def __init__(self):
+		self.mat_lem2lem_countonly = pickle.load(open('../NewData/gauravs/mat_lem2lem_old_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_lem2cng_countonly = pickle.load(open('../NewData/gauravs/mat_lemma2cng_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_lem2tup_countonly = pickle.load(open('../NewData/gauravs/mat_lem2tup_old_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_cng2lem_countonly = pickle.load(open('../NewData/gauravs/mat_cng2lemma_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_cng2tup_countonly = pickle.load(open('../NewData/gauravs/mat_cng2tup_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_cng2cng_countonly = pickle.load(open('../NewData/gauravs/mat_cng2cng_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_tup2cng_countonly = pickle.load(open('../NewData/gauravs/mat_tup2cng_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_tup2lem_countonly = pickle.load(open('../NewData/gauravs/mat_tup2lem_old_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_tup2tup_countonly = pickle.load(open('../NewData/gauravs/mat_tup2tup_new_countonly.p', 'rb'), encoding = u'utf-8')
+		self.mat_lemCount_1D = pickle.load(open('../NewData/gauravs/Temporary_1D/mat_lemCount_1D.p', 'rb'), encoding = u'utf-8')
+		self.mat_cngCount_1D = pickle.load(open('../NewData/gauravs/Temporary_1D/mat_cngCount_1D.p', 'rb'), encoding = u'utf-8')
+		self.mat_tupCount_1D = pickle.load(open('../NewData/gauravs/Temporary_1D/mat_tupCount_1D.p', 'rb'), encoding = u'utf-8')

dir/TestPoolMP.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import multiprocessing as mp
+import TestPool_Unit
+from shutil import copyfile
+def Evaluate(result_arr):
+    print('Files Processed: ', len(result_arr))
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    for entry in result_arr:
+        (word_match, lemma_match, n_dcsWords, n_output_nodes) = entry
+        recalls.append(lemma_match/n_dcsWords)
+        recalls_of_word.append(word_match/n_dcsWords)
+        precisions.append(lemma_match/n_output_nodes)
+        precisions_of_words.append(word_match/n_output_nodes)
+    print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+modelFile = 'outputs/train_nnet_t764815831413.p'
+# Backup the model file
+copyfile(modelFile, modelFile + '.bk')
+modelFile = modelFile + '.bk'
+# Create Queue, Result array
+queue = mp.Queue()
+result_arr = []
+# Start 6 workers - 8 slows down the pc
+proc_count = 10
+procs = [None]*proc_count
+for i in range(proc_count):
+    vpid = i
+    procs[i] = mp.Process(target = TestPool_Unit.pooled_Test, args = (modelFile, vpid, queue, 700))
+# Start Processes
+for i in range(proc_count):
+    procs[i].start()
+# Properly Join
+for i in range(proc_count):
+    procs[i].join()
+# Fetch partial results
+while not queue.empty():
+    result_arr.append(queue.get())
+# Evaluate results till now
+Evaluate(result_arr)

dir/TestPool_Unit.py ADDED Viewed

	@@ -0,0 +1,54 @@

+from multiprocessing import Process
+import multiprocessing as mp
+import os, sys
+from sentences import *
+import numpy as np
+from Train_n_Save_NNet import *
+def pooled_Test(modelFile, vpid, queue, testfolder, filePerProcess = 100, _dump = False, _outFile = None):
+    n_chkpt = 100
+    print('Child process with vpid:{}, pid:{} started.'.format(vpid, os.getpid()))
+    trainer = Trainer()
+    trainer.Load(modelFile)
+    TestFiles = []
+    for f in os.listdir(testfolder):
+        if '.ds.bz2' in f:
+            TestFiles.append(f)
+    print('vpid:{}: Range is {} -> {} / {}'.format(vpid, vpid*filePerProcess, vpid*filePerProcess + filePerProcess, len(TestFiles)))
+    if _dump:
+        _outFile = '{}_proc{}.csv'.format(_outFile, vpid)
+        with open(_outFile, 'w') as fh:
+            print('File refreshed', _outFile)
+    loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_ho.p', 'rb'))
+    loaded_DCS = pickle.load(open('../Simultaneous_DCS_ho.p', 'rb'))
+    #loader = pickle.load(open('../bz2Dataset_10K.p', 'rb'))
+    #TestFiles = loader['TestFiles']
+    #TrainFiles = loader['TrainFiles']
+    for i in range(vpid*filePerProcess, vpid*filePerProcess + filePerProcess):
+        #if i % n_chkpt == 0:
+            #print('Checkpoint {}, vpid: {}'.format(i, vpid))
+            #sys.stdout.flush()
+        fn = TestFiles[i]
+        fn = fn.replace('.ds.bz2', '.p2')
+        dsbz2_name = testfolder + TestFiles[i]
+        sentenceObj = loaded_SKT[fn]
+        dcsObj = loaded_DCS[fn]
+        try:
+            if _dump:
+                results = trainer.Test(sentenceObj, dcsObj, dsbz2_name, _dump=True, _outFile = _outFile)
+            else:
+                results = trainer.Test(sentenceObj, dcsObj, dsbz2_name)
+        except EOFError as e:
+            print('BADFILE', dsbz2_name)
+        if results is not None:
+                queue.put(results)
+    print('Child process with vpid:{}, pid:{} closed.'.format(vpid, os.getpid()))

dir/TestPool_Unit_clique.py ADDED Viewed

	@@ -0,0 +1,57 @@

+from multiprocessing import Process
+import multiprocessing as mp
+import os, sys
+from sentences import *
+import numpy as np
+from Train_clique import *
+def pooled_Test(modelFile, vpid, queue, testfolder, filePerProcess = 100, _dump = False, _outFile = None):
+    n_chkpt = 100
+    print('Child process with vpid:{}, pid:{} started.'.format(vpid, os.getpid()))
+    trainer = Trainer()
+    trainer.Load(modelFile)
+    TestFiles = []
+    for f in os.listdir(testfolder):
+        if '.ds.bz2' in f:
+            TestFiles.append(f)
+    # print('TestFIles')
+    # print(TestFiles)
+    print('vpid:{}: Range is {} -> {} / {}'.format(vpid, vpid*filePerProcess, vpid*filePerProcess + filePerProcess, len(TestFiles)))
+    if _dump:
+        _outFile = '{}_proc{}.csv'.format(_outFile, vpid)
+        with open(_outFile, 'w') as fh:
+            print('File refreshed', _outFile)
+    loaded_SKT = pickle.load(open('Simultaneous_CompatSKT_ho.p', 'rb'))
+    loaded_DCS = pickle.load(open('Simultaneous_DCS_ho.p', 'rb'))
+    #loader = pickle.load(open('../bz2Dataset_10K.p', 'rb'))
+    #TestFiles = loader['TestFiles']
+    #TrainFiles = loader['TrainFiles']
+    for i in range(vpid*filePerProcess, vpid*filePerProcess + filePerProcess):
+        #if i % n_chkpt == 0:
+            #print('Checkpoint {}, vpid: {}'.format(i, vpid))
+            #sys.stdout.flush()
+        fn = TestFiles[i]
+        fn = fn.replace('.ds.bz2', '.p2')
+        dsbz2_name = testfolder + TestFiles[i]
+        sentenceObj = loaded_SKT[fn]
+        # print(fn)
+        # print(type(fn))
+        dcsObj = loaded_DCS[fn]
+        try:
+            if _dump:
+                results = trainer.Test(sentenceObj, dcsObj, dsbz2_name, _dump=True, _outFile = _outFile)
+            else:
+                results = trainer.Test(sentenceObj, dcsObj, dsbz2_name)
+        except EOFError as e:
+            print('BADFILE', dsbz2_name)
+        if results is not None:
+                queue.put(results)
+    print('Child process with vpid:{}, pid:{} closed.'.format(vpid, os.getpid()))

dir/Train_bron.py ADDED Viewed

	@@ -0,0 +1,830 @@

+"""
+IMPORTS
+"""
+## bUILT-iN pACKAGES
+import sys, os, time, bz2, zlib, pickle, math, json, csv
+from collections import defaultdict
+import numpy as np
+np.set_printoptions(suppress=True)
+from IPython.display import display
+## lAST sUMMER
+from romtoslp import *
+from sentences import *
+from DCS import *
+import MatDB
+from bronclique import *
+from ECL_MST import *
+import word_definite as WD
+# from heap_n_PrimMST import *
+from nnet import *
+"""
+################################################################################################
+###########################  LOAD SENTENCE AND DCS OBJECT FILES  ###############################
+################################################################################################
+"""
+# loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'))
+# loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'))
+"""
+################################################################################################
+###########################  OPENS AND EXTRACTS DCS AND SKT DATA STRUCTURES  ###################
+######################################  FROM BZ2 FILES  ########################################
+"""
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+"""
+################################################################################################
+######################  CREATE SEVERAL DATA STRUCTURES FROM SENTENCE/DCS  ######################
+###########################  NODELIST, ADJACENCY LIST, GRAPH, HEAP #############################
+################################################################################################
+"""
+def GetTrainingKit(sentenceObj, dcsObj):
+    nodelist = GetNodes(sentenceObj)
+    # Nodelist with only the correct_nodes
+    nodelist2 = GetNodes(sentenceObj)
+    nodelist2_to_correct_mapping = {}
+    nodelist_correct = []
+    search_key = 0
+    first_key = 0
+    for chunk_id in range(len(dcsObj.lemmas)):
+        while nodelist2[first_key].chunk_id != chunk_id:
+            first_key += 1
+        for j in range(len(dcsObj.lemmas[chunk_id])):
+            search_key = first_key
+            while (nodelist2[search_key].lemma != rom_slp(dcsObj.lemmas[chunk_id][j])) or (nodelist2[search_key].cng != dcsObj.cng[chunk_id][j]):
+                search_key += 1
+                if search_key >= len(nodelist2) or nodelist2[search_key].chunk_id > chunk_id:
+                    break
+    #         print((rom_slp(dcsObj.lemmas[chunk_id][j]), dcsObj.cng[chunk_id][j]))
+    #         print(nodelist[search_key])
+            nodelist2_to_correct_mapping[len(nodelist_correct)] = search_key
+            nodelist_correct.append(nodelist2[search_key])
+    return (nodelist, nodelist_correct, nodelist2_to_correct_mapping)
+def GetGraph(nodelist, neuralnet):
+    if not neuralnet.outer_relu:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat, SigmoidGateOutput)
+    else:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat)
+# NEW LOSS FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _negLogLikelies):
+    _negLogLikelies = _negLogLikelies.copy()
+    _negLogLikelies[~_mst_adj_graph] = 0
+    _negLogLikelies[~_mask_de_correct_edges] *= -1 # BAKA!!! Check before you try to fix this again
+    return np.sum(_negLogLikelies)
+"""
+################################################################################################
+##############################  MAIN FUNCTION  #################################################
+################################################################################################
+"""
+trainingStatus = defaultdict(lambda: bool(False))
+"""
+################################################################################################
+##############################  TRAIN FUNCTION  ################################################
+################################################################################################
+"""
+def train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, iterationPerBatch = 10, filePerBatch = 20, _debug = True, superEpochs = 1):
+    # Train
+    if n_trainset == -1:
+        n_trainset = len(TrainFiles)
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    else:
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    register_nnet(trainer.neuralnet, bz2_input_folder)
+    print('Epoch:'+str(superEpochs))
+    print('iters:'+str(totalBatchToTrain))
+    for _epoch in range(superEpochs):
+        for iterout in range(totalBatchToTrain):
+            # Add timer
+            startT = time.time()
+            # Change current batch
+            if(iterout % 50 == 0):
+                trainer.Save(p_name.replace('.p', '_e{}_i{}.p'.format(_epoch, iterout)))
+            else:
+                trainer.Save(p_name)
+            print('Epoch: {}, Batch: {}'.format(_epoch, iterout))
+            files_for_batch = TrainFiles[iterout*filePerBatch:(iterout + 1)*filePerBatch]
+            # print(files_for_batch)
+            # trainer.Load('outputs/neuralnet_trained.p')
+            try:
+                # Run few times on same set of files
+                for iterin in range(iterationPerBatch):
+                    print('ITERATION IN', iterin)
+                    for fn in files_for_batch:
+                        stime = time.time()
+                        trainFileName = fn.replace('.ds.bz2', '.p2')
+                        sentenceObj = loaded_SKT[trainFileName]
+                        # print(trainFileName)
+                        dcsObj = loaded_DCS[trainFileName]
+                        if trainingStatus[sentenceObj.sent_id]:
+                            continue
+                        # trainer.Save('outputs/saved_trainer.p')
+                        try:
+                            trainer.Train(sentenceObj, dcsObj, bz2_input_folder, _debug)
+                        except (IndexError, KeyError) as e:
+                            print(e)
+                            print('\x1b[31mFailed: {} \x1b[0m'.format(sentenceObj.sent_id))
+                        except EOFError as e:
+                            print('\x1b[31mBADFILE: {} \x1b[0m'.format(sentenceObj.sent_id))
+                        ftime = time.time()
+                        print('Time taken for file '+str(trainFileName)+'is '+str(ftime-stime)+"seconds")
+                        with open('bron.csv','a') as fh:
+                            rd = csv.writer(fh)
+                            rd.writerow([str(ftime-stime)])
+                    sys.stdout.flush() # Flush IO buffer
+                finishT = time.time()
+                print('Avg. time taken by 1 file(1 iteration): {:.3f}'.format((finishT - startT)/(iterationPerBatch*filePerBatch)))
+            except KeyboardInterrupt:
+                print('Training paused')
+                trainer.Save(p_name)
+                yield None
+    trainer.Save(p_name)
+def test(loaded_SKT, loaded_DCS, n_testSet = -1, _testFiles = None, n_checkpt = 100):
+    total_lemma = 0;
+    correct_lemma = 0;
+    total_word = 0;
+    total_output_nodes = 0
+    correct_word = 0;
+    file_counter = 0
+    if _testFiles is None:
+        if n_testSet == -1:
+            _testFiles = TestFiles
+        else:
+            _testFiles = TestFiles[0:n_testSet]
+    else:
+        if n_testSet == -1:
+            _testFiles = _testFiles
+        else:
+            _testFiles = _testFiles[0:n_testSet]
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    for fn in _testFiles:
+        if file_counter % n_checkpt == 0:
+            print(file_counter,' Checkpoint... ')
+            if file_counter > 0:
+                print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+                print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+                print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+                print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+            sys.stdout.flush() # Flush IO buffer
+        file_counter += 1
+        testFileName = fn.replace('.ds.bz2', '.p2')
+        sentenceObj = loaded_SKT[testFileName]
+        # print(testFileName)
+        # print(type(testFileName))
+        dcsObj = loaded_DCS[testFileName]
+        try:
+            (word_match, lemma_match, n_dcsWords, n_output_nodes) = trainer.Test(sentenceObj, dcsObj)
+            recalls.append(lemma_match/n_dcsWords)
+            recalls_of_word.append(word_match/n_dcsWords)
+            precisions.append(lemma_match/n_output_nodes)
+            precisions_of_words.append(word_match/n_output_nodes)
+            total_lemma += n_dcsWords
+            total_word += n_dcsWords
+            total_output_nodes += n_output_nodes
+            correct_lemma += lemma_match
+            correct_word += word_match
+        except (IndexError, KeyError) as e:
+            print('Failed!')
+    print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+    return (recalls, recalls_of_word, precisions, precisions_of_words)
+# NEW FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _WScalarMat):
+    _WScalarMat = _WScalarMat.copy()
+    _WScalarMat[_mst_adj_graph&(~_mask_de_correct_edges)] *= -1 # BAKA!!! Check before you try to fix this again
+    _WScalarMat[~_mst_adj_graph] = 0
+    return np.sum(_WScalarMat)
+"""
+################################################################################################
+#############################   TRAINER CLASS DEFINITION  ######################################
+################################################################################################
+"""
+class Trainer:
+    def __init__(self, modelFile = None):
+        if modelFile is None:
+            singleLayer = True
+            self._edge_vector_dim = 1500
+            if singleLayer:
+                self.hidden_layer_size = 1200
+                keep_prob = 0.6
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True, keep_prob=keep_prob)
+            else:
+                # DeepR Network
+                self.hidden_layer_size = 800
+                self.hidden_layer_size2 = 800
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size,\
+                                      hidden_layer_2_size = self.hidden_layer_size2, outer_relu=True)
+                self.history = defaultdict(lambda: list())
+        else:
+            loader = pickle.load(open(filename, 'rb'))
+            self.neuralnet.n = loader['n']
+            self.neuralnet.d = loader['d']
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.history = defaultdict(lambda: list())
+        # SET LEARNING RATES
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.etaW = 3e-5
+            self.neuralnet.etaB1 = 1e-5
+            self.neuralnet.etaU = 1e-5
+            self.neuralnet.etaB2 = 1e-5
+        elif self.neuralnet.version == 'h2':
+            self.neuralnet.etaW1 = 3e-4
+            self.neuralnet.etaB1 = 1e-4
+            self.neuralnet.etaW2 = 1e-4
+            self.neuralnet.etaB2 = 1e-4
+            self.neuralnet.etaU = 1e-4
+            self.neuralnet.etaB3 = 1e-4
+    def Reset(self):
+        self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size)
+        self.history = defaultdict(lambda: list())
+    def Save(self, filename):
+        print('Weights Saved: ', filename)
+        if self.neuralnet.version == 'h1':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'W': self.neuralnet.W,
+                    'n': self.neuralnet.n,
+                    'd': self.neuralnet.d,
+                    'B1': self.neuralnet.B1,
+                    'B2': self.neuralnet.B2,
+                    'keep_prob': self.neuralnet.keep_prob,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+        elif self.neuralnet.version == 'h2':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'B3': self.neuralnet.B3,
+                    'W2': self.neuralnet.W2,
+                    'B2': self.neuralnet.B2,
+                    'W1': self.neuralnet.W1,
+                    'B1': self.neuralnet.B1,
+                    'h1': self.neuralnet.h1,
+                    'h2': self.neuralnet.h2,
+                    'd': self.neuralnet.d,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+    def Load(self, filename):
+        loader = pickle.load(open(filename, 'rb'))
+        if 'version' not in loader: # means 1 hidden layer
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.neuralnet.hidden_layer_size = loader['n']
+            self.neuralnet._edge_vector_dim = loader['d']
+            if 'keep_prob' in loader:
+                self.neuralnet.keep_prob = loader['keep_prob']
+                self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+            print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+        else:
+            if loader['version'] == 'h1':
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.W = loader['W']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.hidden_layer_size = loader['n']
+                self.neuralnet._edge_vector_dim = loader['d']
+                if 'keep_prob' in loader:
+                    self.neuralnet.keep_prob = loader['keep_prob']
+                    self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+                print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+            elif loader['version'] == 'h2':
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.B3 = loader['B3']
+                self.neuralnet.W2 = loader['W2']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.W1 = loader['W1']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.h1 = loader['h1']
+                self.neuralnet.h2 = loader['h2']
+                self.neuralnet.d = loader['d']
+    def CalculateLoss_n_Grads(self, WScalarMat, min_st_adj_worst, max_st_adj_gold, loss_type = 0, min_marginalized_energy = None):
+        doBpp = True
+        # Claculate the enrgies
+        etg = np.sum(WScalarMat[max_st_adj_gold])
+        etq = np.sum(WScalarMat[min_st_adj_worst])
+        if loss_type == 0:
+            # Variable Hinge Loss - CHECKED
+            L = etg - min_marginalized_energy
+            if L > 0:
+                dLdOut = np.zeros_like(WScalarMat)
+                dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+                dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            else:
+                doBpp = False
+                return (L, None, doBpp)
+        elif loss_type == 1:
+            # LOg Loss
+            a = etg - etq
+            b = np.exp(a)
+            L = np.log(1 + b)
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            dLdOut *= (b/(1 + b))
+        elif loss_type == 2:
+            # Square exponential loss
+            gamma = 1
+            b = np.exp(-etq)
+            L = etg**2 + gamma*b
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 2*etg
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -gamma*b
+            pass
+        return (L, dLdOut, doBpp)
+    def Test(self, sentenceObj, dcsObj, dsbz2_name, _dump = False, _outFile = None):
+        if _dump:
+            if _outFile is None:
+                raise Exception('WTH r u thinking! pass me outFolder')
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.ForTesting()
+        # with open('gt_cngs.csv','a') as fh:
+        #     for i in dcsObj.cng:
+        #         for j in i:
+        #             print(str(sentenceObj.sent_id)+":"+str(j))
+        #             wr = csv.writer(fh)
+        #             wr.writerow([sentenceObj.sent_id,j])
+        # return
+        neuralnet = self.neuralnet
+        minScore = np.inf
+        minMst = None
+        # dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(dsbz2_name)
+        # if len(nodelist) > 50:
+        #     return None
+        if not self.neuralnet.outer_relu:
+            (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        else:
+            WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        # print('NeuralNet Time: ', time.time() - startT)
+        # startT = time.time()
+        # Get all MST
+        # print('before getting all cliques')
+        for source in range(len(nodelist)):
+            (mst_nodes, mst_adj_graph, _) = clique(nodelist, WScalarMat, conflicts_Dict, source)
+            # print('.', end = '')
+            score = GetMSTWeight(mst_adj_graph, WScalarMat)
+            if(score < minScore):
+                minScore = score
+                minMst = mst_nodes
+        # print('after getting all cliques')
+        dcsLemmas = [[rom_slp(l) for l in arr]for arr in dcsObj.lemmas]
+        word_match = 0
+        lemma_match = 0
+        n_output_nodes = 0
+        if _dump:
+            predicted_lemmas = [sentenceObj.sent_id]
+            predicted_cngs = [sentenceObj.sent_id]
+            predicted_chunk_id = [sentenceObj.sent_id]
+            predicted_pos = [sentenceObj.sent_id]
+            predicted_id = [sentenceObj.sent_id]
+        for chunk_id, wdSplit in minMst.items():
+            for wd in wdSplit:
+                if _dump:
+                    predicted_lemmas.append(wd.lemma)
+                    predicted_cngs.append(wd.cng)
+                    predicted_chunk_id.append(wd.chunk_id)
+                    predicted_pos.append(wd.pos)
+                    predicted_id.append(wd.id)
+                n_output_nodes += 1
+                # Match lemma
+                search_result = [i for i, j in enumerate(dcsLemmas[chunk_id]) if j == wd.lemma]
+                if len(search_result) > 0:
+                    lemma_match += 1
+                # Match CNG
+                for i in search_result:
+                    if(dcsObj.cng[chunk_id][i] == str(wd.cng)):
+                        word_match += 1
+                        # print(wd.lemma, wd.cng)
+                        break
+        dcsLemmas = [l for arr in dcsObj.lemmas for l in arr]
+        if _dump:
+            with open(_outFile, 'a') as fh:
+                dcsv = csv.writer(fh)
+                dcsv.writerow(predicted_lemmas)
+                dcsv.writerow(predicted_cngs)
+                dcsv.writerow(predicted_chunk_id)
+                dcsv.writerow(predicted_pos)
+                dcsv.writerow(predicted_id)
+                dcsv.writerow([sentenceObj.sent_id, word_match, lemma_match, len(dcsLemmas), n_output_nodes])
+        # print('All MST Time: ', time.time() - startT)
+        # print('Node Count: ', len(nodelist))
+#         print('\nFull Match: {}, Partial Match: {}, OutOf {}, NodeCount: {}, '.\
+#               format(word_match, lemma_match, len(dcsLemmas), len(nodelist)))
+        return (word_match, lemma_match, len(dcsLemmas), n_output_nodes)
+    def Train(self, sentenceObj, dcsObj, bz2_input_folder, _debug = True):
+        self.neuralnet.ForTraining()
+        self.neuralnet.new_dropout() # renew dropout setting
+        # Hyperparameter for hinge loss: m
+        m_hinge_param = 14
+        dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(bz2_input_folder + dsbz2_name)
+        sub = 0
+        for s in conflicts_Dict.keys():
+            sub = sub+len(conflicts_Dict[s])
+        with open('bron.csv','a') as fh:
+            rd = csv.writer(fh)
+            rd.writerow([str(dsbz2_name),str(len(nodelist)),str(sub)])
+        print(dsbz2_name)
+        print("NodeLength: "+str(len(nodelist)))
+        '''
+        if(len(nodelist)>40):
+            with open('bron.csv','a') as fh:
+                rd = csv.writer(fh)
+                rd.writerow([str(dsbz2_name),'0'])
+            return
+        '''
+        if(len(nodelist)>100):
+            print("Nodelength : "+str(len(nodelist)))
+        # Train for large graphs separately
+#         if len(nodelist) < 40:
+#             return
+        """ FORM MAXIMUM(ENERGY) SPANNING TREE OF THE GOLDEN GRAPH : WORST GOLD STRUCTURE """
+        WScalarMat_correct = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat_correct, nodelist_correct,\
+                                                                      conflicts_Dict_correct, self.neuralnet)
+        source = 0
+        """ Find the max spanning tree : negative Weight matrix passed """
+#         (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+#             clique(nodelist_correct, -WScalarMat_correct, conflicts_Dict_correct, source)
+        # print('sentence start')
+        # print('stuck after this maybe')
+        # print(WScalarMat_correct)
+        # print(nodelist_correct)
+        # print('*')
+        R = set()
+        P = set()
+        X = set()
+        for i in range(len(nodelist_correct)):
+            P.add(i)
+        # print('BEFORE'*3)
+        # print('BFR')
+        # print('#'*50)
+        bcliq = bron(R,P,X,nodelist_correct,conflicts_Dict_correct,1)
+        with open('bron.csv','a') as fh:
+            rd = csv.writer(fh)
+            rd.writerow([str(dsbz2_name),str(len(bcliq))])
+        # print('AFR')
+        # print('-'*40)
+        min_st_adj_gold_small = np.ndarray(WScalarMat_correct.shape, np.bool)*False
+        # print('p1')
+        for i in bcliq[0]:
+            for j in bcliq[0]:
+                if(i==j):
+                    continue
+                min_st_adj_gold_small[i][j]=True
+        # print('p2')
+        # print(' before bron')
+        # print(len(bcliq),bcliq)
+        # print('*'*40)
+        # print('after bron')
+        # (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+        #     clique(nodelist_correct, WScalarMat_correct, conflicts_Dict_correct, source)
+        # print('found a way out')
+        # print('AFTER'*3)
+        # print('-'*30)
+        energy_gold_max_ST = np.sum(WScalarMat_correct[min_st_adj_gold_small])
+        # print("Gold: "+str(energy_gold_max_ST))
+        """ Convert correct spanning tree graph adj matrix to full marix dimensions """
+        """ Create full-size adjacency matrix for correct_mst_small """
+        nodelen = len(nodelist)
+        # print(nodelen)
+        # print('p2.5')
+        # try:
+        min_st_adj_gold = np.ndarray((nodelen, nodelen), np.bool)*False # T_STAR
+        for i in range(min_st_adj_gold_small.shape[0]):
+            for j in range(min_st_adj_gold_small.shape[1]):
+                min_st_adj_gold[nodelist_to_correct_mapping[i], nodelist_to_correct_mapping[j]] =\
+                    min_st_adj_gold_small[i, j]
+        # except Exception as e:
+        #     print(e)
+        # print('p3')
+        """ Delta(Margin) Function : MASK FOR WHICH NODES IN NODELIST BELONG TO DCS """
+        gold_nodes_mask = np.array([False]*len(nodelist))
+        gold_nodes_mask[list(nodelist_to_correct_mapping.values())] = True
+        margin_f = lambda nodes_mask: np.sum(nodes_mask&(~gold_nodes_mask))**2
+        """ FOR ALL POSSIBLE MST FROM THE COMPLETE GRAPH """
+        # print(WScalarMat)
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, self.neuralnet)
+        """ For each node - Find MST with that source"""
+        min_STx = None # Min Energy spanning tree with worst margin with gold_STx
+        min_marginalized_energy = np.inf
+        # print('p4')
+        # Generate random set of nodes from which mSTs are to be considered
+        n_nodes = len(nodelist)
+        selection_prob = 0.4
+        select_flag = np.random.rand(n_nodes) < selection_prob
+        # Fix if all zeros
+        if np.sum(select_flag) == 0:
+            select_flag[np.random.randint(n_nodes)] = 1
+        best_node_diff = np.Inf
+        best_energy = np.inf
+        # print('before###')
+        R = set()
+        P = set()
+        X = set()
+        # print('bfr')
+        for i in range(len(nodelist)):
+            P.add(i)
+        # try:
+        bcliq = (bron(R,P,X,nodelist,conflicts_Dict,1))
+        print('Enumerated number of cliques :'+str(len(bcliq)))
+        # except Exception as e:
+        #     print(e)
+        # print('aftr')
+        for clique in bcliq:
+            # print('b1')
+            mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+            mst_nodes_bool = np.array([False]*len(nodelist))
+            for nd in clique:
+                mst_nodes_bool[nd] = True
+                for nd2 in clique:
+                    if(nd==nd2):
+                        continue
+                    mst_adj_graph[nd][nd2] = True
+            # print("b2")
+            en_st = np.sum(WScalarMat[mst_adj_graph])
+            # Pick up the node_diff with lowest energy
+            delta_st = margin_f(mst_nodes_bool)
+            if _debug:
+                if best_energy > en_st:
+                    best_node_diff = delta_st
+                    best_energy = en_st
+            # print('b3')
+            # Minimum marginalized energy calculation
+            marginalized_en = en_st - delta_st
+            # print("Source:"+str(source)+"; Energy"+str(min_marginalized_energy))
+            # Minimum marginalized spanning tree : Randomization applied
+            # if marginalized_en < min_marginalized_energy and select_flag[source]:
+            if marginalized_en < min_marginalized_energy:
+                min_marginalized_energy = marginalized_en
+                min_STx = mst_adj_graph
+            # Energy diff should all be negative
+            if _debug:
+                print('Source: [{}], Node_Diff:{}, Max_Gold_En: {:.3f}, Energy: {:.3f}'.\
+                      format(source, np.sum((~gold_nodes_mask)&mst_nodes_bool), energy_gold_max_ST,  np.sum(WScalarMat[mst_adj_graph])))
+            # print('b4')
+        # for source in range(len(nodelist)):
+        #     (mst_nodes, mst_adj_graph, mst_nodes_bool) = clique(nodelist, WScalarMat, conflicts_Dict, source) # T_X
+        #     # Calculate energy of spanning tree
+        #     en_st = np.sum(WScalarMat[mst_adj_graph])
+        #     # Pick up the node_diff with lowest energy
+        #     delta_st = margin_f(mst_nodes_bool)
+        #     if _debug:
+        #         if best_energy > en_st:
+        #             best_node_diff = delta_st
+        #             best_energy = en_st
+        #     # Minimum marginalized energy calculation
+        #     marginalized_en = en_st - delta_st
+        #     # print("Source:"+str(source)+"; Energy"+str(min_marginalized_energy))
+        #     # Minimum marginalized spanning tree : Randomization applied
+        #     # if marginalized_en < min_marginalized_energy and select_flag[source]:
+        #     if marginalized_en < min_marginalized_energy:
+        #         min_marginalized_energy = marginalized_en
+        #         min_STx = mst_adj_graph
+        #     # Energy diff should all be negative
+        #     if _debug:
+        #         print('Source: [{}], Node_Diff:{}, Max_Gold_En: {:.3f}, Energy: {:.3f}'.\
+        #               format(source, np.sum((~gold_nodes_mask)&mst_nodes_bool), energy_gold_max_ST,  np.sum(WScalarMat[mst_adj_graph])))
+        # print('after###')
+        # print("Min-Energy:"+str(min_marginalized_energy))
+        # print("Gold-Energy"+str(np.sum(WScalarMat[min_st_adj_gold])))
+        # print("*"*40)
+        if _debug:
+            print('Best Node diff: {} with EN: {}'.format(np.sqrt(best_node_diff), best_energy))
+        """ Gradient Descent """
+        # LOSS TYPES -> hinge(0), log-loss(1), square-exponential(2)
+        Total_Loss, dLdOut, doBpp = self.CalculateLoss_n_Grads(WScalarMat, min_STx, min_st_adj_gold,\
+                                                                 loss_type = 0, min_marginalized_energy = min_marginalized_energy)
+        if doBpp:
+            if _debug:
+                print('{}. '.format(sentenceObj.sent_id), end = '')
+            self.neuralnet.Back_Prop(dLdOut, len(nodelist), featVMat, _debug)
+        else:
+            trainingStatus[sentenceObj.sent_id] = True
+        if _debug:
+            print("\nFileKey: %s, Loss: %6.3f" % (sentenceObj.sent_id, Total_Loss))
+TrainFiles = None
+trainer = None
+p_name = ''
+odir = ''
+def InitModule():
+    global trainer
+    trainer = Trainer()
+def register_nnet(nnet, bz2_input_folder):
+    if not os.path.isdir(odir):
+        os.mkdir(odir)
+    if not os.path.isfile('outputs/nnet_LOGS.csv'):
+        with open('outputs/nnet_LOGS.csv', 'a') as fh:
+            csv_r = csv.writer(fh)
+            csv_r.writerow(['odir', 'p_name', 'hidden_layer_size', '_edge_vector_dim'])
+    with open('outputs/nnet_LOGS.csv', 'a') as fh:
+        csv_r = csv.writer(fh)
+        if nnet.version == 'h1':
+            csv_r.writerow([odir, p_name, nnet.n, nnet.d, bz2_input_folder])
+        elif nnet.version == 'h2':
+            csv_r.writerow([odir, p_name, nnet.h1, nnet.h2, nnet.d, bz2_input_folder])
+"""
+################################################################################################
+################################################################################################
+################################################################################################
+"""
+def main():
+    global TrainFiles, p_name, odir
+    """
+    ################################################################################################
+    ##############################  GET A FILENAME TO SAVE WEIGHTS  ################################
+    ################################################################################################
+    """
+    st = str(int((time.time() * 1e6) % 1e13))
+    log_name = 'logs/train_nnet_t{}.out'.format(st)
+    odir = 'outputs/train_t{}'.format(st)
+    p_name = 'outputs/train_t{}/nnet.p'.format(st)
+    print('nEURAL nET wILL bE sAVED hERE: ', p_name)
+    # Create Training File List
+    excluded_files = []
+    with open('inputs/Baseline4_advSample.csv', 'r') as f_handle:
+        opener = csv.reader(f_handle)
+        for line in opener:
+            excluded_files.append(line[1].replace('.p', '.ds.bz2'))
+    # Load Simultaneous files
+    print('Loading Large Files')
+    loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'), encoding=u'utf-8')
+    loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'), encoding=u'utf-8')
+    # loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT.p', 'rb'), encoding=u'utf-8')
+    # loaded_DCS = pickle.load(open('../Simultaneous_DCS.p', 'rb'), encoding=u'utf-8')
+    bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_bigram_mir_10K/'  #bm2
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_bigram_mir_10K/'   #bm3
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_bigram_rfe_10K/'   #br2
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_bigram_rfe_10K/'   #br3
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_pmi_mir_10K/'   #pm2
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_pmi_mir_10K2/'   #pm3
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_pmi_rfe_10K/'   #pr2
+    # bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_pmi_rfe_10K/'   #pr3    # bz2_input_folder = '/home/rs/15CS91R05/vishnu/Data/skt_dcs_DS.bz2_compat_10k_check_again/'
+    all_files = []
+    skipped = 0
+    for f in os.listdir(bz2_input_folder):
+        if '.ds.bz2' in f:
+            if f in excluded_files:
+                skipped += 1
+                continue
+            if f.replace('.ds.bz2', '.p2') not in loaded_DCS:
+                print('Couldnt find ', f)
+                continue
+            all_files.append(f)
+    print(skipped, 'files will not be used for training')
+    print('Size of training set:', len(all_files))
+    all_files=['32517.p2']
+    TrainFiles = all_files
+    with open('bron.csv','w') as fh:
+        rd = csv.writer(fh)
+        rd.writerow(['FileName','Nodelength & NCliques'])
+    InitModule()
+    trainingStatus = defaultdict(lambda: bool(False))
+    # train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 10, iterationPerBatch = 5, _debug=False, superEpochs = 5)
+    train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 20, iterationPerBatch = 3, _debug=False, superEpochs = 2)
+    # Complete Training
+    train.__next__()
+    print('Training Complete')
+if __name__ == '__main__':
+    main()

dir/Train_clique.py ADDED Viewed

	@@ -0,0 +1,769 @@

+"""
+IMPORTS
+"""
+## bUILT-iN pACKAGES
+import sys, os, time, bz2, zlib, pickle, math, json, csv
+from collections import defaultdict
+import numpy as np
+np.set_printoptions(suppress=True)
+from IPython.display import display
+## lAST sUMMER
+from romtoslp import *
+from sentences import *
+from DCS import *
+import MatDB
+from heap_n_clique import *
+from ECL_MST import *
+import word_definite as WD
+# from heap_n_PrimMST import *
+from nnet import *
+"""
+################################################################################################
+###########################  LOAD SENTENCE AND DCS OBJECT FILES  ###############################
+################################################################################################
+"""
+# loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'))
+# loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'))
+"""
+################################################################################################
+###########################  OPENS AND EXTRACTS DCS AND SKT DATA STRUCTURES  ###################
+######################################  FROM BZ2 FILES  ########################################
+"""
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+"""
+################################################################################################
+######################  CREATE SEVERAL DATA STRUCTURES FROM SENTENCE/DCS  ######################
+###########################  NODELIST, ADJACENCY LIST, GRAPH, HEAP #############################
+################################################################################################
+"""
+def GetTrainingKit(sentenceObj, dcsObj):
+    nodelist = GetNodes(sentenceObj)
+    # Nodelist with only the correct_nodes
+    nodelist2 = GetNodes(sentenceObj)
+    nodelist2_to_correct_mapping = {}
+    nodelist_correct = []
+    search_key = 0
+    first_key = 0
+    for chunk_id in range(len(dcsObj.lemmas)):
+        while nodelist2[first_key].chunk_id != chunk_id:
+            first_key += 1
+        for j in range(len(dcsObj.lemmas[chunk_id])):
+            search_key = first_key
+            while (nodelist2[search_key].lemma != rom_slp(dcsObj.lemmas[chunk_id][j])) or (nodelist2[search_key].cng != dcsObj.cng[chunk_id][j]):
+                search_key += 1
+                if search_key >= len(nodelist2) or nodelist2[search_key].chunk_id > chunk_id:
+                    break
+    #         print((rom_slp(dcsObj.lemmas[chunk_id][j]), dcsObj.cng[chunk_id][j]))
+    #         print(nodelist[search_key])
+            nodelist2_to_correct_mapping[len(nodelist_correct)] = search_key
+            nodelist_correct.append(nodelist2[search_key])
+    return (nodelist, nodelist_correct, nodelist2_to_correct_mapping)
+def GetGraph(nodelist, neuralnet):
+    if not neuralnet.outer_relu:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat, SigmoidGateOutput)
+    else:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat)
+# NEW LOSS FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _negLogLikelies):
+    _negLogLikelies = _negLogLikelies.copy()
+    _negLogLikelies[~_mst_adj_graph] = 0
+    _negLogLikelies[~_mask_de_correct_edges] *= -1 # BAKA!!! Check before you try to fix this again
+    return np.sum(_negLogLikelies)
+"""
+################################################################################################
+##############################  MAIN FUNCTION  #################################################
+################################################################################################
+"""
+trainingStatus = defaultdict(lambda: bool(False))
+"""
+################################################################################################
+##############################  TRAIN FUNCTION  ################################################
+################################################################################################
+"""
+def train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, iterationPerBatch = 10, filePerBatch = 20, _debug = True, superEpochs = 1):
+    # Train
+    if n_trainset == -1:
+        n_trainset = len(TrainFiles)
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    else:
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    register_nnet(trainer.neuralnet, bz2_input_folder)
+    print('Epoch:'+str(superEpochs))
+    print('iters:'+str(totalBatchToTrain))
+    for _epoch in range(superEpochs):
+        for iterout in range(totalBatchToTrain):
+            # Add timer
+            startT = time.time()
+            # Change current batch
+            if(iterout % 50 == 0):
+                trainer.Save(p_name.replace('.p', '_e{}_i{}.p'.format(_epoch, iterout)))
+            else:
+                trainer.Save(p_name)
+            print('Epoch: {}, Batch: {}'.format(_epoch, iterout))
+            files_for_batch = TrainFiles[iterout*filePerBatch:(iterout + 1)*filePerBatch]
+            # print(files_for_batch)
+            # trainer.Load('outputs/neuralnet_trained.p')
+            try:
+                # Run few times on same set of files
+                for iterin in range(iterationPerBatch):
+                    print('ITERATION IN', iterin)
+                    for fn in files_for_batch:
+                        stime = time.time()
+                        trainFileName = fn.replace('.ds.bz2', '.p2')
+                        sentenceObj = loaded_SKT[trainFileName]
+                        # print(trainFileName)
+                        dcsObj = loaded_DCS[trainFileName]
+                        if trainingStatus[sentenceObj.sent_id]:
+                            continue
+                        # trainer.Save('outputs/saved_trainer.p')
+                        try:
+                            trainer.Train(sentenceObj, dcsObj, bz2_input_folder, _debug)
+                        except (IndexError, KeyError) as e:
+                            print(e)
+                            print('\x1b[31mFailed: {} \x1b[0m'.format(sentenceObj.sent_id))
+                        except EOFError as e:
+                            print('\x1b[31mBADFILE: {} \x1b[0m'.format(sentenceObj.sent_id))
+                        ftime = time.time()
+                        print('Time taken for file '+str(trainFileName)+'is '+str(ftime-stime)+"seconds")
+                        with open('cliq.csv','a') as fh:
+                            rd = csv.writer(fh)
+                            rd.writerow([str(ftime-stime)])
+                    sys.stdout.flush() # Flush IO buffer
+                finishT = time.time()
+                print('Avg. time taken by 1 file(1 iteration): {:.3f}'.format((finishT - startT)/(iterationPerBatch*filePerBatch)))
+            except KeyboardInterrupt:
+                print('Training paused')
+                trainer.Save(p_name)
+                yield None
+    trainer.Save(p_name)
+def test(loaded_SKT, loaded_DCS, n_testSet = -1, _testFiles = None, n_checkpt = 100):
+    total_lemma = 0;
+    correct_lemma = 0;
+    total_word = 0;
+    total_output_nodes = 0
+    correct_word = 0;
+    file_counter = 0
+    if _testFiles is None:
+        if n_testSet == -1:
+            _testFiles = TestFiles
+        else:
+            _testFiles = TestFiles[0:n_testSet]
+    else:
+        if n_testSet == -1:
+            _testFiles = _testFiles
+        else:
+            _testFiles = _testFiles[0:n_testSet]
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    for fn in _testFiles:
+        if file_counter % n_checkpt == 0:
+            print(file_counter,' Checkpoint... ')
+            if file_counter > 0:
+                print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+                print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+                print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+                print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+            sys.stdout.flush() # Flush IO buffer
+        file_counter += 1
+        testFileName = fn.replace('.ds.bz2', '.p2')
+        sentenceObj = loaded_SKT[testFileName]
+        # print(testFileName)
+        # print(type(testFileName))
+        dcsObj = loaded_DCS[testFileName]
+        try:
+            (word_match, lemma_match, n_dcsWords, n_output_nodes) = trainer.Test(sentenceObj, dcsObj)
+            recalls.append(lemma_match/n_dcsWords)
+            recalls_of_word.append(word_match/n_dcsWords)
+            precisions.append(lemma_match/n_output_nodes)
+            precisions_of_words.append(word_match/n_output_nodes)
+            total_lemma += n_dcsWords
+            total_word += n_dcsWords
+            total_output_nodes += n_output_nodes
+            correct_lemma += lemma_match
+            correct_word += word_match
+        except (IndexError, KeyError) as e:
+            print('Failed!')
+    print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+    return (recalls, recalls_of_word, precisions, precisions_of_words)
+# NEW FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _WScalarMat):
+    _WScalarMat = _WScalarMat.copy()
+    _WScalarMat[_mst_adj_graph&(~_mask_de_correct_edges)] *= -1 # BAKA!!! Check before you try to fix this again
+    _WScalarMat[~_mst_adj_graph] = 0
+    return np.sum(_WScalarMat)
+"""
+################################################################################################
+#############################   TRAINER CLASS DEFINITION  ######################################
+################################################################################################
+"""
+class Trainer:
+    def __init__(self, modelFile = None):
+        if modelFile is None:
+            singleLayer = True
+            self._edge_vector_dim = 1500
+            if singleLayer:
+                self.hidden_layer_size = 1200
+                keep_prob = 0.6
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True, keep_prob=keep_prob)
+            else:
+                # DeepR Network
+                self.hidden_layer_size = 800
+                self.hidden_layer_size2 = 800
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size,\
+                                      hidden_layer_2_size = self.hidden_layer_size2, outer_relu=True)
+                self.history = defaultdict(lambda: list())
+        else:
+            loader = pickle.load(open(filename, 'rb'))
+            self.neuralnet.n = loader['n']
+            self.neuralnet.d = loader['d']
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.history = defaultdict(lambda: list())
+        # SET LEARNING RATES
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.etaW = 3e-5
+            self.neuralnet.etaB1 = 1e-5
+            self.neuralnet.etaU = 1e-5
+            self.neuralnet.etaB2 = 1e-5
+        elif self.neuralnet.version == 'h2':
+            self.neuralnet.etaW1 = 3e-4
+            self.neuralnet.etaB1 = 1e-4
+            self.neuralnet.etaW2 = 1e-4
+            self.neuralnet.etaB2 = 1e-4
+            self.neuralnet.etaU = 1e-4
+            self.neuralnet.etaB3 = 1e-4
+    def Reset(self):
+        self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size)
+        self.history = defaultdict(lambda: list())
+    def Save(self, filename):
+        print('Weights Saved: ', filename)
+        if self.neuralnet.version == 'h1':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'W': self.neuralnet.W,
+                    'n': self.neuralnet.n,
+                    'd': self.neuralnet.d,
+                    'B1': self.neuralnet.B1,
+                    'B2': self.neuralnet.B2,
+                    'keep_prob': self.neuralnet.keep_prob,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+        elif self.neuralnet.version == 'h2':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'B3': self.neuralnet.B3,
+                    'W2': self.neuralnet.W2,
+                    'B2': self.neuralnet.B2,
+                    'W1': self.neuralnet.W1,
+                    'B1': self.neuralnet.B1,
+                    'h1': self.neuralnet.h1,
+                    'h2': self.neuralnet.h2,
+                    'd': self.neuralnet.d,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+    def Load(self, filename):
+        loader = pickle.load(open(filename, 'rb'))
+        if 'version' not in loader: # means 1 hidden layer
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.neuralnet.hidden_layer_size = loader['n']
+            self.neuralnet._edge_vector_dim = loader['d']
+            if 'keep_prob' in loader:
+                self.neuralnet.keep_prob = loader['keep_prob']
+                self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+            print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+        else:
+            if loader['version'] == 'h1':
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.W = loader['W']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.hidden_layer_size = loader['n']
+                self.neuralnet._edge_vector_dim = loader['d']
+                if 'keep_prob' in loader:
+                    self.neuralnet.keep_prob = loader['keep_prob']
+                    self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+                print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+            elif loader['version'] == 'h2':
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.B3 = loader['B3']
+                self.neuralnet.W2 = loader['W2']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.W1 = loader['W1']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.h1 = loader['h1']
+                self.neuralnet.h2 = loader['h2']
+                self.neuralnet.d = loader['d']
+    def CalculateLoss_n_Grads(self, WScalarMat, min_st_adj_worst, max_st_adj_gold, loss_type = 0, min_marginalized_energy = None):
+        doBpp = True
+        # Claculate the enrgies
+        etg = np.sum(WScalarMat[max_st_adj_gold])
+        etq = np.sum(WScalarMat[min_st_adj_worst])
+        if loss_type == 0:
+            # Variable Hinge Loss - CHECKED
+            L = etg - min_marginalized_energy
+            if L > 0:
+                dLdOut = np.zeros_like(WScalarMat)
+                dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+                dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            else:
+                doBpp = False
+                return (L, None, doBpp)
+        elif loss_type == 1:
+            # LOg Loss
+            a = etg - etq
+            b = np.exp(a)
+            L = np.log(1 + b)
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            dLdOut *= (b/(1 + b))
+        elif loss_type == 2:
+            # Square exponential loss
+            gamma = 1
+            b = np.exp(-etq)
+            L = etg**2 + gamma*b
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 2*etg
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -gamma*b
+            pass
+        return (L, dLdOut, doBpp)
+    def Test(self, sentenceObj, dcsObj, dsbz2_name, _dump = False, _outFile = None):
+        if _dump:
+            if _outFile is None:
+                raise Exception('WTH r u thinking! pass me outFolder')
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.ForTesting()
+        # with open('gt_cngs.csv','a') as fh:
+        #     for i in dcsObj.cng:
+        #         for j in i:
+        #             print(str(sentenceObj.sent_id)+":"+str(j))
+        #             wr = csv.writer(fh)
+        #             wr.writerow([sentenceObj.sent_id,j])
+        # return
+        neuralnet = self.neuralnet
+        minScore = np.inf
+        minMst = None
+        # dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(dsbz2_name)
+        # if len(nodelist) > 50:
+        #     return None
+        if not self.neuralnet.outer_relu:
+            (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        else:
+            WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        # print('NeuralNet Time: ', time.time() - startT)
+        # startT = time.time()
+        # Get all MST
+        # print('before getting all cliques')
+        for source in range(len(nodelist)):
+            (mst_nodes, mst_adj_graph, _) = clique(nodelist, WScalarMat, conflicts_Dict, source)
+            # print('.', end = '')
+            score = GetMSTWeight(mst_adj_graph, WScalarMat)
+            if(score < minScore):
+                minScore = score
+                minMst = mst_nodes
+        # print('after getting all cliques')
+        dcsLemmas = [[rom_slp(l) for l in arr]for arr in dcsObj.lemmas]
+        word_match = 0
+        lemma_match = 0
+        n_output_nodes = 0
+        if _dump:
+            predicted_lemmas = [sentenceObj.sent_id]
+            predicted_cngs = [sentenceObj.sent_id]
+            predicted_chunk_id = [sentenceObj.sent_id]
+            predicted_pos = [sentenceObj.sent_id]
+            predicted_id = [sentenceObj.sent_id]
+        for chunk_id, wdSplit in minMst.items():
+            for wd in wdSplit:
+                if _dump:
+                    predicted_lemmas.append(wd.lemma)
+                    predicted_cngs.append(wd.cng)
+                    predicted_chunk_id.append(wd.chunk_id)
+                    predicted_pos.append(wd.pos)
+                    predicted_id.append(wd.id)
+                n_output_nodes += 1
+                # Match lemma
+                search_result = [i for i, j in enumerate(dcsLemmas[chunk_id]) if j == wd.lemma]
+                if len(search_result) > 0:
+                    lemma_match += 1
+                # Match CNG
+                for i in search_result:
+                    if(dcsObj.cng[chunk_id][i] == str(wd.cng)):
+                        word_match += 1
+                        # print(wd.lemma, wd.cng)
+                        break
+        dcsLemmas = [l for arr in dcsObj.lemmas for l in arr]
+        if _dump:
+            with open(_outFile, 'a') as fh:
+                dcsv = csv.writer(fh)
+                dcsv.writerow(predicted_lemmas)
+                dcsv.writerow(predicted_cngs)
+                dcsv.writerow(predicted_chunk_id)
+                dcsv.writerow(predicted_pos)
+                dcsv.writerow(predicted_id)
+                dcsv.writerow([sentenceObj.sent_id, word_match, lemma_match, len(dcsLemmas), n_output_nodes])
+        # print('All MST Time: ', time.time() - startT)
+        # print('Node Count: ', len(nodelist))
+#         print('\nFull Match: {}, Partial Match: {}, OutOf {}, NodeCount: {}, '.\
+#               format(word_match, lemma_match, len(dcsLemmas), len(nodelist)))
+        return (word_match, lemma_match, len(dcsLemmas), n_output_nodes)
+    def Train(self, sentenceObj, dcsObj, bz2_input_folder, _debug = True):
+        self.neuralnet.ForTraining()
+        self.neuralnet.new_dropout() # renew dropout setting
+        # Hyperparameter for hinge loss: m
+        m_hinge_param = 14
+        dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(bz2_input_folder + dsbz2_name)
+        sub = 0
+        for s in conflicts_Dict.keys():
+            sub = sub+len(conflicts_Dict[s])
+        with open('cliq.csv','a') as fh:
+            rd = csv.writer(fh)
+            rd.writerow([str(dsbz2_name),str(len(nodelist)),str(sub)])
+        print(dsbz2_name)
+        print("NodeLength: "+str(len(nodelist)))
+        # if(len(nodelist)>40):
+        #     with open('cliq.csv','a') as fh:
+        #         rd = csv.writer(fh)
+        #         rd.writerow([str(dsbz2_name),'0'])
+        #     return
+        if(len(nodelist)>100):
+            print("Nodelength : "+str(len(nodelist)))
+        # Train for large graphs separately
+#         if len(nodelist) < 40:
+#             return
+        """ FORM MAXIMUM(ENERGY) SPANNING TREE OF THE GOLDEN GRAPH : WORST GOLD STRUCTURE """
+        WScalarMat_correct = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat_correct, nodelist_correct,\
+                                                                      conflicts_Dict_correct, self.neuralnet)
+        source = 0
+        """ Find the max spanning tree : negative Weight matrix passed """
+#         (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+#             clique(nodelist_correct, -WScalarMat_correct, conflicts_Dict_correct, source)
+        # print('BEFORE'*3)
+        # print('sentence start')
+        # print('stuck after this maybe')
+        # print(WScalarMat_correct)
+        # print('-'*40)
+        # print(nodelist_correct)
+        # print('*')
+        (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+            clique(nodelist_correct, WScalarMat_correct, conflicts_Dict_correct, source)
+        # print('found a way out')
+        # print('AFTER'*3)
+        # print('-'*30)
+        energy_gold_max_ST = np.sum(WScalarMat_correct[min_st_adj_gold_small])
+        # print("Gold: "+str(energy_gold_max_ST))
+        """ Convert correct spanning tree graph adj matrix to full marix dimensions """
+        """ Create full-size adjacency matrix for correct_mst_small """
+        nodelen = len(nodelist)
+        # print(nodelen)
+        min_st_adj_gold = np.ndarray((nodelen, nodelen), bool) * False  # T_STAR
+        for i in range(min_st_adj_gold_small.shape[0]):
+            for j in range(min_st_adj_gold_small.shape[1]):
+                min_st_adj_gold[nodelist_to_correct_mapping[i], nodelist_to_correct_mapping[j]] =\
+                    min_st_adj_gold_small[i, j]
+        """ Delta(Margin) Function : MASK FOR WHICH NODES IN NODELIST BELONG TO DCS """
+        gold_nodes_mask = np.array([False]*len(nodelist))
+        gold_nodes_mask[list(nodelist_to_correct_mapping.values())] = True
+        margin_f = lambda nodes_mask: np.sum(nodes_mask&(~gold_nodes_mask))**2
+        """ FOR ALL POSSIBLE MST FROM THE COMPLETE GRAPH """
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, self.neuralnet)
+        # print(WScalarMat)
+        """ For each node - Find MST with that source"""
+        min_STx = None # Min Energy spanning tree with worst margin with gold_STx
+        min_marginalized_energy = np.inf
+        # Generate random set of nodes from which mSTs are to be considered
+        n_nodes = len(nodelist)
+        selection_prob = 0.4
+        select_flag = np.random.rand(n_nodes) < selection_prob
+        # Fix if all zeros
+        if np.sum(select_flag) == 0:
+            select_flag[np.random.randint(n_nodes)] = 1
+        best_node_diff = np.Inf
+        best_energy = np.inf
+        # print('before###')
+        cliqset  = set()
+        for source in range(len(nodelist)):
+            (mst_nodes, mst_adj_graph, mst_nodes_bool) = clique(nodelist, WScalarMat, conflicts_Dict, source) # T_X
+            # Calculate energy of spanning tree
+            cst = ''
+            for i in mst_nodes_bool:
+                if(i):
+                    cst=cst+'1'
+                else:
+                    cst=cst+'0'
+            cliqset.add(cst)
+            en_st = np.sum(WScalarMat[mst_adj_graph])
+            # Pick up the node_diff with lowest energy
+            delta_st = margin_f(mst_nodes_bool)
+            if _debug:
+                if best_energy > en_st:
+                    best_node_diff = delta_st
+                    best_energy = en_st
+            # Minimum marginalized energy calculation
+            marginalized_en = en_st - delta_st
+            # print("Source:"+str(source)+"; Energy"+str(min_marginalized_energy))
+            # Minimum marginalized spanning tree : Randomization applied
+            # if marginalized_en < min_marginalized_energy and select_flag[source]:
+            if marginalized_en < min_marginalized_energy:
+                min_marginalized_energy = marginalized_en
+                min_STx = mst_adj_graph
+            # Energy diff should all be negative
+            if _debug:
+                print('Source: [{}], Node_Diff:{}, Max_Gold_En: {:.3f}, Energy: {:.3f}'.\
+                      format(source, np.sum((~gold_nodes_mask)&mst_nodes_bool), energy_gold_max_ST,  np.sum(WScalarMat[mst_adj_graph])))
+        with open('cliq.csv','a') as fh:
+                rd = csv.writer(fh)
+                rd.writerow([str(dsbz2_name),len(cliqset)])
+        # print('after###')
+        # print("Min-Energy:"+str(min_marginalized_energy))
+        # print("Gold-Energy"+str(np.sum(WScalarMat[min_st_adj_gold])))
+        # print("*"*40)
+        if _debug:
+            print('Best Node diff: {} with EN: {}'.format(np.sqrt(best_node_diff), best_energy))
+        """ Gradient Descent """
+        # LOSS TYPES -> hinge(0), log-loss(1), square-exponential(2)
+        Total_Loss, dLdOut, doBpp = self.CalculateLoss_n_Grads(WScalarMat, min_STx, min_st_adj_gold,\
+                                                                 loss_type = 0, min_marginalized_energy = min_marginalized_energy)
+        if doBpp:
+            if _debug:
+                print('{}. '.format(sentenceObj.sent_id), end = '')
+            self.neuralnet.Back_Prop(dLdOut, len(nodelist), featVMat, _debug)
+        else:
+            trainingStatus[sentenceObj.sent_id] = True
+        if _debug:
+            print("\nFileKey: %s, Loss: %6.3f" % (sentenceObj.sent_id, Total_Loss))
+TrainFiles = None
+trainer = None
+p_name = ''
+odir = ''
+def InitModule():
+    global trainer
+    trainer = Trainer()
+def register_nnet(nnet, bz2_input_folder):
+    if not os.path.isdir(odir):
+        os.mkdir(odir)
+    if not os.path.isfile('outputs/nnet_LOGS.csv'):
+        with open('outputs/nnet_LOGS.csv', 'a') as fh:
+            csv_r = csv.writer(fh)
+            csv_r.writerow(['odir', 'p_name', 'hidden_layer_size', '_edge_vector_dim'])
+    with open('outputs/nnet_LOGS.csv', 'a') as fh:
+        csv_r = csv.writer(fh)
+        if nnet.version == 'h1':
+            csv_r.writerow([odir, p_name, nnet.n, nnet.d, bz2_input_folder])
+        elif nnet.version == 'h2':
+            csv_r.writerow([odir, p_name, nnet.h1, nnet.h2, nnet.d, bz2_input_folder])
+"""
+################################################################################################
+################################################################################################
+################################################################################################
+"""
+def main():
+    global TrainFiles, p_name, odir
+    """
+    ################################################################################################
+    ##############################  GET A FILENAME TO SAVE WEIGHTS  ################################
+    ################################################################################################
+    """
+    st = str(int((time.time() * 1e6) % 1e13))
+    log_name = 'logs/train_nnet_t{}.out'.format(st)
+    odir = 'outputs/train_t{}'.format(st)
+    p_name = 'outputs/train_t{}/nnet.p'.format(st)
+    print('nEURAL nET wILL bE sAVED hERE: ', p_name)
+    if not os.path.isdir('inputs'):
+        os.mkdir('inputs')
+    if not os.path.isdir('outputs'):
+        os.mkdir('outputs')
+    if not os.path.isdir('NewData'):
+        os.mkdir('NewData')
+    # Create Training File List
+    excluded_files = []
+    with open('Baseline4_advSample.csv', 'r') as f_handle:
+        opener = csv.reader(f_handle)
+        for line in opener:
+            excluded_files.append(line[1].replace('.p', '.ds.bz2'))
+    # Load Simultaneous files
+    print('Loading Large Files')
+    loaded_SKT = pickle.load(open('Simultaneous_CompatSKT_10K.p', 'rb'), encoding=u'utf-8')
+    loaded_DCS = pickle.load(open('Simultaneous_DCS_10K.p', 'rb'), encoding=u'utf-8')
+    # loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT.p', 'rb'), encoding=u'utf-8')
+    # loaded_DCS = pickle.load(open('../Simultaneous_DCS.p', 'rb'), encoding=u'utf-8')
+    bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_4K_bigram_mir_10K/'  #bm2
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_1L_bigram_mir_10K/'   #bm3
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_4K_bigram_rfe_10K/'   #br2
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_1L_bigram_rfe_10K/'   #br3
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_4K_pmi_mir_10K/'   #pm2
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_1L_pmi_mir_10K2/'   #pm3
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_4K_pmi_rfe_10K/'   #pr2
+    # bz2_input_folder = '../wordsegmentation/skt_dcs_DS.bz2_1L_pmi_rfe_10K/'   #pr3    # bz2_input_folder = '/home/rs/15CS91R05/vishnu/Data/skt_dcs_DS.bz2_compat_10k_check_again/'
+    all_files = []
+    skipped = 0
+    for f in os.listdir(bz2_input_folder):
+        if '.ds.bz2' in f:
+            if f in excluded_files:
+                skipped += 1
+                continue
+            if f.replace('.ds.bz2', '.p2') not in loaded_DCS:
+                print('Couldnt find ', f)
+                continue
+            all_files.append(f)
+    print(skipped, 'files will not be used for training')
+    print('Size of training set:', len(all_files))
+    # all_files=['32517.p2']
+    TrainFiles = all_files
+    with open('cliq.csv','w') as fh:
+        rd = csv.writer(fh)
+        rd.writerow(['FileName','Nodelength & NCliques'])
+    InitModule()
+    trainingStatus = defaultdict(lambda: bool(False))
+    # train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 10, iterationPerBatch = 5, _debug=False, superEpochs = 5)
+    train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 20, iterationPerBatch = 3, _debug=False, superEpochs = 2)
+    # Complete Training
+    train.__next__()
+    print('Training Complete')
+if __name__ == '__main__':
+    main()

dir/Train_n_Save_NNet.py ADDED Viewed

	@@ -0,0 +1,680 @@

+"""
+IMPORTS
+"""
+## bUILT-iN pACKAGES
+import sys, os, time, bz2, zlib, pickle, math, json, csv
+from collections import defaultdict
+import numpy as np
+np.set_printoptions(suppress=True)
+from IPython.display import display
+## lAST sUMMER
+from romtoslp import *
+from sentences import *
+from DCS import *
+import MatDB
+from heap_n_PrimMST import *
+from ECL_MST import *
+import word_definite as WD
+from heap_n_PrimMST import *
+from nnet import *
+"""
+################################################################################################
+###########################  LOAD SENTENCE AND DCS OBJECT FILES  ###############################
+################################################################################################
+"""
+# loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'))
+# loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'))
+"""
+################################################################################################
+###########################  OPENS AND EXTRACTS DCS AND SKT DATA STRUCTURES  ###################
+######################################  FROM BZ2 FILES  ########################################
+"""
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+"""
+################################################################################################
+######################  CREATE SEVERAL DATA STRUCTURES FROM SENTENCE/DCS  ######################
+###########################  NODELIST, ADJACENCY LIST, GRAPH, HEAP #############################
+################################################################################################
+"""
+def GetTrainingKit(sentenceObj, dcsObj):
+    nodelist = GetNodes(sentenceObj)
+    # Nodelist with only the correct_nodes
+    nodelist2 = GetNodes(sentenceObj)
+    nodelist2_to_correct_mapping = {}
+    nodelist_correct = []
+    search_key = 0
+    first_key = 0
+    for chunk_id in range(len(dcsObj.lemmas)):
+        while nodelist2[first_key].chunk_id != chunk_id:
+            first_key += 1
+        for j in range(len(dcsObj.lemmas[chunk_id])):
+            search_key = first_key
+            while (nodelist2[search_key].lemma != rom_slp(dcsObj.lemmas[chunk_id][j])) or (nodelist2[search_key].cng != dcsObj.cng[chunk_id][j]):
+                search_key += 1
+                if search_key >= len(nodelist2) or nodelist2[search_key].chunk_id > chunk_id:
+                    break
+    #         print((rom_slp(dcsObj.lemmas[chunk_id][j]), dcsObj.cng[chunk_id][j]))
+    #         print(nodelist[search_key])
+            nodelist2_to_correct_mapping[len(nodelist_correct)] = search_key
+            nodelist_correct.append(nodelist2[search_key])
+    return (nodelist, nodelist_correct, nodelist2_to_correct_mapping)
+def GetGraph(nodelist, neuralnet):
+    if not neuralnet.outer_relu:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat, SigmoidGateOutput)
+    else:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat)
+# NEW LOSS FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _negLogLikelies):
+    _negLogLikelies = _negLogLikelies.copy()
+    _negLogLikelies[~_mst_adj_graph] = 0
+    _negLogLikelies[~_mask_de_correct_edges] *= -1 # BAKA!!! Check before you try to fix this again
+    return np.sum(_negLogLikelies)
+"""
+################################################################################################
+##############################  MAIN FUNCTION  #################################################
+################################################################################################
+"""
+trainingStatus = defaultdict(lambda: bool(False))
+"""
+################################################################################################
+##############################  TRAIN FUNCTION  ################################################
+################################################################################################
+"""
+def train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, iterationPerBatch = 10, filePerBatch = 20, _debug = True, superEpochs = 1):
+    # Train
+    if n_trainset == -1:
+        n_trainset = len(TrainFiles)
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    else:
+        totalBatchToTrain = math.ceil(n_trainset/filePerBatch)
+    register_nnet(trainer.neuralnet, bz2_input_folder)
+    print('Epoch:'+str(superEpochs))
+    print('iters:'+str(totalBatchToTrain))
+    for _epoch in range(superEpochs):
+        for iterout in range(totalBatchToTrain):
+            # Add timer
+            startT = time.time()
+            # Change current batch
+            if(iterout % 50 == 0):
+                trainer.Save(p_name.replace('.p', '_e{}_i{}.p'.format(_epoch, iterout)))
+            else:
+                trainer.Save(p_name)
+            print('Epoch: {}, Batch: {}'.format(_epoch, iterout))
+            files_for_batch = TrainFiles[iterout*filePerBatch:(iterout + 1)*filePerBatch]
+            # print(files_for_batch)
+            # trainer.Load('outputs/neuralnet_trained.p')
+            try:
+                # Run few times on same set of files
+                for iterin in range(iterationPerBatch):
+                    print('ITERATION IN', iterin)
+                    for fn in files_for_batch:
+                        trainFileName = fn.replace('.ds.bz2', '.p2')
+                        sentenceObj = loaded_SKT[trainFileName]
+                        dcsObj = loaded_DCS[trainFileName]
+                        if trainingStatus[sentenceObj.sent_id]:
+                            continue
+                        # trainer.Save('outputs/saved_trainer.p')
+                        try:
+                            trainer.Train(sentenceObj, dcsObj, bz2_input_folder, _debug)
+                        except (IndexError, KeyError) as e:
+                            print('\x1b[31mFailed: {} \x1b[0m'.format(sentenceObj.sent_id))
+                        except EOFError as e:
+                            print('\x1b[31mBADFILE: {} \x1b[0m'.format(sentenceObj.sent_id))
+                    sys.stdout.flush() # Flush IO buffer
+                finishT = time.time()
+                print('Avg. time taken by 1 file(1 iteration): {:.3f}'.format((finishT - startT)/(iterationPerBatch*filePerBatch)))
+            except KeyboardInterrupt:
+                print('Training paused')
+                trainer.Save(p_name)
+                yield None
+    trainer.Save(p_name)
+def test(loaded_SKT, loaded_DCS, n_testSet = -1, _testFiles = None, n_checkpt = 100):
+    total_lemma = 0;
+    correct_lemma = 0;
+    total_word = 0;
+    total_output_nodes = 0
+    correct_word = 0;
+    file_counter = 0
+    if _testFiles is None:
+        if n_testSet == -1:
+            _testFiles = TestFiles
+        else:
+            _testFiles = TestFiles[0:n_testSet]
+    else:
+        if n_testSet == -1:
+            _testFiles = _testFiles
+        else:
+            _testFiles = _testFiles[0:n_testSet]
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    for fn in _testFiles:
+        if file_counter % n_checkpt == 0:
+            print(file_counter,' Checkpoint... ')
+            if file_counter > 0:
+                print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+                print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+                print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+                print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+            sys.stdout.flush() # Flush IO buffer
+        file_counter += 1
+        testFileName = fn.replace('.ds.bz2', '.p2')
+        sentenceObj = loaded_SKT[testFileName]
+        dcsObj = loaded_DCS[testFileName]
+        try:
+            (word_match, lemma_match, n_dcsWords, n_output_nodes) = trainer.Test(sentenceObj, dcsObj)
+            recalls.append(lemma_match/n_dcsWords)
+            recalls_of_word.append(word_match/n_dcsWords)
+            precisions.append(lemma_match/n_output_nodes)
+            precisions_of_words.append(word_match/n_output_nodes)
+            total_lemma += n_dcsWords
+            total_word += n_dcsWords
+            total_output_nodes += n_output_nodes
+            correct_lemma += lemma_match
+            correct_word += word_match
+        except (IndexError, KeyError) as e:
+            print('Failed!')
+    print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+    return (recalls, recalls_of_word, precisions, precisions_of_words)
+# NEW FUNCTION
+def GetLoss(_mst_adj_graph, _mask_de_correct_edges, _WScalarMat):
+    _WScalarMat = _WScalarMat.copy()
+    _WScalarMat[_mst_adj_graph&(~_mask_de_correct_edges)] *= -1 # BAKA!!! Check before you try to fix this again
+    _WScalarMat[~_mst_adj_graph] = 0
+    return np.sum(_WScalarMat)
+"""
+################################################################################################
+#############################   TRAINER CLASS DEFINITION  ######################################
+################################################################################################
+"""
+class Trainer:
+    def __init__(self, modelFile = None):
+        if modelFile is None:
+            singleLayer = True
+            self._edge_vector_dim = 1500
+            if singleLayer:
+                self.hidden_layer_size = 1200
+                keep_prob = 0.6
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True, keep_prob=keep_prob)
+            else:
+                # DeepR Network
+                self.hidden_layer_size = 800
+                self.hidden_layer_size2 = 800
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size,\
+                                      hidden_layer_2_size = self.hidden_layer_size2, outer_relu=True)
+                self.history = defaultdict(lambda: list())
+        else:
+            loader = pickle.load(open(filename, 'rb'))
+            self.neuralnet.n = loader['n']
+            self.neuralnet.d = loader['d']
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.history = defaultdict(lambda: list())
+        # SET LEARNING RATES
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.etaW = 3e-5
+            self.neuralnet.etaB1 = 1e-5
+            self.neuralnet.etaU = 1e-5
+            self.neuralnet.etaB2 = 1e-5
+        elif self.neuralnet.version == 'h2':
+            self.neuralnet.etaW1 = 3e-4
+            self.neuralnet.etaB1 = 1e-4
+            self.neuralnet.etaW2 = 1e-4
+            self.neuralnet.etaB2 = 1e-4
+            self.neuralnet.etaU = 1e-4
+            self.neuralnet.etaB3 = 1e-4
+    def Reset(self):
+        self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size)
+        self.history = defaultdict(lambda: list())
+    def Save(self, filename):
+        print('Weights Saved: ', filename)
+        if self.neuralnet.version == 'h1':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'W': self.neuralnet.W,
+                    'n': self.neuralnet.n,
+                    'd': self.neuralnet.d,
+                    'B1': self.neuralnet.B1,
+                    'B2': self.neuralnet.B2,
+                    'keep_prob': self.neuralnet.keep_prob,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+        elif self.neuralnet.version == 'h2':
+            pickle.dump({
+                    'U': self.neuralnet.U,
+                    'B3': self.neuralnet.B3,
+                    'W2': self.neuralnet.W2,
+                    'B2': self.neuralnet.B2,
+                    'W1': self.neuralnet.W1,
+                    'B1': self.neuralnet.B1,
+                    'h1': self.neuralnet.h1,
+                    'h2': self.neuralnet.h2,
+                    'd': self.neuralnet.d,
+                    'version': self.neuralnet.version
+                }, open(filename, 'wb'))
+            return
+    def Load(self, filename):
+        loader = pickle.load(open(filename, 'rb'))
+        if 'version' not in loader: # means 1 hidden layer
+            self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+            self.neuralnet.U = loader['U']
+            self.neuralnet.W = loader['W']
+            self.neuralnet.B1 = loader['B1']
+            self.neuralnet.B2 = loader['B2']
+            self.neuralnet.hidden_layer_size = loader['n']
+            self.neuralnet._edge_vector_dim = loader['d']
+            if 'keep_prob' in loader:
+                self.neuralnet.keep_prob = loader['keep_prob']
+                self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+            print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+        else:
+            if loader['version'] == 'h1':
+                self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.W = loader['W']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.hidden_layer_size = loader['n']
+                self.neuralnet._edge_vector_dim = loader['d']
+                if 'keep_prob' in loader:
+                    self.neuralnet.keep_prob = loader['keep_prob']
+                    self.neuralnet.dropout_prob = 1 - loader['keep_prob']
+                print('Keep Prob = {}, Dropout = {}'.format(self.neuralnet.keep_prob, self.neuralnet.dropout_prob))
+            elif loader['version'] == 'h2':
+                self.neuralnet = NN_2(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+                self.neuralnet.U = loader['U']
+                self.neuralnet.B3 = loader['B3']
+                self.neuralnet.W2 = loader['W2']
+                self.neuralnet.B2 = loader['B2']
+                self.neuralnet.W1 = loader['W1']
+                self.neuralnet.B1 = loader['B1']
+                self.neuralnet.h1 = loader['h1']
+                self.neuralnet.h2 = loader['h2']
+                self.neuralnet.d = loader['d']
+    def CalculateLoss_n_Grads(self, WScalarMat, min_st_adj_worst, max_st_adj_gold, loss_type = 0, min_marginalized_energy = None):
+        doBpp = True
+        # Claculate the enrgies
+        etg = np.sum(WScalarMat[max_st_adj_gold])
+        etq = np.sum(WScalarMat[min_st_adj_worst])
+        if loss_type == 0:
+            # Variable Hinge Loss - CHECKED
+            L = etg - min_marginalized_energy
+            if L > 0:
+                dLdOut = np.zeros_like(WScalarMat)
+                dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+                dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            else:
+                doBpp = False
+                return (L, None, doBpp)
+        elif loss_type == 1:
+            # LOg Loss
+            a = etg - etq
+            b = np.exp(a)
+            L = np.log(1 + b)
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 1
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -1
+            dLdOut *= (b/(1 + b))
+        elif loss_type == 2:
+            # Square exponential loss
+            gamma = 1
+            b = np.exp(-etq)
+            L = etg**2 + gamma*b
+            dLdOut = np.zeros_like(WScalarMat)
+            dLdOut[max_st_adj_gold&(~min_st_adj_worst)] = 2*etg
+            dLdOut[(~max_st_adj_gold)&min_st_adj_worst] = -gamma*b
+            pass
+        return (L, dLdOut, doBpp)
+    def Test(self, sentenceObj, dcsObj, dsbz2_name, _dump = False, _outFile = None):
+        if _dump:
+            if _outFile is None:
+                raise Exception('WTH r u thinking! pass me outFolder')
+        if self.neuralnet.version == 'h1':
+            self.neuralnet.ForTesting()
+        neuralnet = self.neuralnet
+        minScore = np.inf
+        minMst = None
+        # dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(dsbz2_name)
+        # if len(nodelist) > 50:
+        #     return None
+        if not self.neuralnet.outer_relu:
+            (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        else:
+            WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        # print('NeuralNet Time: ', time.time() - startT)
+        # startT = time.time()
+        # Get all MST
+        for source in range(len(nodelist)):
+            (mst_nodes, mst_adj_graph, _) = MST(nodelist, WScalarMat, conflicts_Dict, source)
+            # print('.', end = '')
+            score = GetMSTWeight(mst_adj_graph, WScalarMat)
+            if(score < minScore):
+                minScore = score
+                minMst = mst_nodes
+        dcsLemmas = [[rom_slp(l) for l in arr]for arr in dcsObj.lemmas]
+        word_match = 0
+        lemma_match = 0
+        n_output_nodes = 0
+        if _dump:
+            predicted_lemmas = [sentenceObj.sent_id]
+            predicted_cngs = [sentenceObj.sent_id]
+            predicted_chunk_id = [sentenceObj.sent_id]
+            predicted_pos = [sentenceObj.sent_id]
+            predicted_id = [sentenceObj.sent_id]
+        for chunk_id, wdSplit in minMst.items():
+            for wd in wdSplit:
+                if _dump:
+                    predicted_lemmas.append(wd.lemma)
+                    predicted_cngs.append(wd.cng)
+                    predicted_chunk_id.append(wd.chunk_id)
+                    predicted_pos.append(wd.pos)
+                    predicted_id.append(wd.id)
+                n_output_nodes += 1
+                # Match lemma
+                search_result = [i for i, j in enumerate(dcsLemmas[chunk_id]) if j == wd.lemma]
+                if len(search_result) > 0:
+                    lemma_match += 1
+                # Match CNG
+                for i in search_result:
+                    if(dcsObj.cng[chunk_id][i] == str(wd.cng)):
+                        word_match += 1
+                        # print(wd.lemma, wd.cng)
+                        break
+        dcsLemmas = [l for arr in dcsObj.lemmas for l in arr]
+        if _dump:
+            with open(_outFile, 'a') as fh:
+                dcsv = csv.writer(fh)
+                dcsv.writerow(predicted_lemmas)
+                dcsv.writerow(predicted_cngs)
+                dcsv.writerow(predicted_chunk_id)
+                dcsv.writerow(predicted_pos)
+                dcsv.writerow(predicted_id)
+                dcsv.writerow([sentenceObj.sent_id, word_match, lemma_match, len(dcsLemmas), n_output_nodes])
+        # print('All MST Time: ', time.time() - startT)
+        # print('Node Count: ', len(nodelist))
+#         print('\nFull Match: {}, Partial Match: {}, OutOf {}, NodeCount: {}, '.\
+#               format(word_match, lemma_match, len(dcsLemmas), len(nodelist)))
+        return (word_match, lemma_match, len(dcsLemmas), n_output_nodes)
+    def Train(self, sentenceObj, dcsObj, bz2_input_folder, _debug = True):
+        self.neuralnet.ForTraining()
+        self.neuralnet.new_dropout() # renew dropout setting
+        # Hyperparameter for hinge loss: m
+        m_hinge_param = 14
+        dsbz2_name = sentenceObj.sent_id + '.ds.bz2'
+        (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(bz2_input_folder + dsbz2_name)
+        # Train for large graphs separately
+#         if len(nodelist) < 40:
+#             return
+        """ FORM MAXIMUM(ENERGY) SPANNING TREE OF THE GOLDEN GRAPH : WORST GOLD STRUCTURE """
+        WScalarMat_correct = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat_correct, nodelist_correct,\
+                                                                      conflicts_Dict_correct, self.neuralnet)
+        source = 0
+        """ Find the max spanning tree : negative Weight matrix passed """
+#         (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+#             MST(nodelist_correct, -WScalarMat_correct, conflicts_Dict_correct, source)
+        (min_st_gold_ndict, min_st_adj_gold_small, _) =\
+            MST(nodelist_correct, WScalarMat_correct, conflicts_Dict_correct, source)
+        energy_gold_max_ST = np.sum(WScalarMat_correct[min_st_adj_gold_small])
+        """ Convert correct spanning tree graph adj matrix to full marix dimensions """
+        """ Create full-size adjacency matrix for correct_mst_small """
+        nodelen = len(nodelist)
+        min_st_adj_gold = np.ndarray((nodelen, nodelen), np.bool)*False # T_STAR
+        for i in range(min_st_adj_gold_small.shape[0]):
+            for j in range(min_st_adj_gold_small.shape[1]):
+                min_st_adj_gold[nodelist_to_correct_mapping[i], nodelist_to_correct_mapping[j]] =\
+                    min_st_adj_gold_small[i, j]
+        """ Delta(Margin) Function : MASK FOR WHICH NODES IN NODELIST BELONG TO DCS """
+        gold_nodes_mask = np.array([False]*len(nodelist))
+        gold_nodes_mask[list(nodelist_to_correct_mapping.values())] = True
+        margin_f = lambda nodes_mask: np.sum(nodes_mask&(~gold_nodes_mask))**2
+        """ FOR ALL POSSIBLE MST FROM THE COMPLETE GRAPH """
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, self.neuralnet)
+        """ For each node - Find MST with that source"""
+        min_STx = None # Min Energy spanning tree with worst margin with gold_STx
+        min_marginalized_energy = np.inf
+        # Generate random set of nodes from which mSTs are to be considered
+        n_nodes = len(nodelist)
+        selection_prob = 0.4
+        select_flag = np.random.rand(n_nodes) < selection_prob
+        # Fix if all zeros
+        if np.sum(select_flag) == 0:
+            select_flag[np.random.randint(n_nodes)] = 1
+        best_node_diff = np.Inf
+        best_energy = np.Inf
+        for source in range(len(nodelist)):
+            (mst_nodes, mst_adj_graph, mst_nodes_bool) = MST(nodelist, WScalarMat, conflicts_Dict, source) # T_X
+            # Calculate energy of spanning tree
+            en_st = np.sum(WScalarMat[mst_adj_graph])
+            # Pick up the node_diff with lowest energy
+            delta_st = margin_f(mst_nodes_bool)
+            if _debug:
+                if best_energy > en_st:
+                    best_node_diff = delta_st
+                    best_energy = en_st
+            # Minimum marginalized energy calculation
+            marginalized_en = en_st - delta_st
+            # Minimum marginalized spanning tree : Randomization applied
+            # if marginalized_en < min_marginalized_energy and select_flag[source]:
+            if marginalized_en < min_marginalized_energy:
+                min_marginalized_energy = marginalized_en
+                min_STx = mst_adj_graph
+            # Energy diff should all be negative
+            if _debug:
+                print('Source: [{}], Node_Diff:{}, Max_Gold_En: {:.3f}, Energy: {:.3f}'.\
+                      format(source, np.sum((~gold_nodes_mask)&mst_nodes_bool), energy_gold_max_ST,  np.sum(WScalarMat[mst_adj_graph])))
+        if _debug:
+            print('Best Node diff: {} with EN: {}'.format(np.sqrt(best_node_diff), best_energy))
+        """ Gradient Descent """
+        # LOSS TYPES -> hinge(0), log-loss(1), square-exponential(2)
+        Total_Loss, dLdOut, doBpp = self.CalculateLoss_n_Grads(WScalarMat, min_STx, min_st_adj_gold,\
+                                                                 loss_type = 0, min_marginalized_energy = min_marginalized_energy)
+        if doBpp:
+            if _debug:
+                print('{}. '.format(sentenceObj.sent_id), end = '')
+            self.neuralnet.Back_Prop(dLdOut, len(nodelist), featVMat, _debug)
+        else:
+            trainingStatus[sentenceObj.sent_id] = True
+        if _debug:
+            print("\nFileKey: %s, Loss: %6.3f" % (sentenceObj.sent_id, Total_Loss))
+TrainFiles = None
+trainer = None
+p_name = ''
+odir = ''
+def InitModule():
+    global trainer
+    trainer = Trainer()
+def register_nnet(nnet, bz2_input_folder):
+    if not os.path.isdir(odir):
+        os.mkdir(odir)
+    if not os.path.isfile('outputs/nnet_LOGS.csv'):
+        with open('outputs/nnet_LOGS.csv', 'a') as fh:
+            csv_r = csv.writer(fh)
+            csv_r.writerow(['odir', 'p_name', 'hidden_layer_size', '_edge_vector_dim'])
+    with open('outputs/nnet_LOGS.csv', 'a') as fh:
+        csv_r = csv.writer(fh)
+        if nnet.version == 'h1':
+            csv_r.writerow([odir, p_name, nnet.n, nnet.d, bz2_input_folder])
+        elif nnet.version == 'h2':
+            csv_r.writerow([odir, p_name, nnet.h1, nnet.h2, nnet.d, bz2_input_folder])
+"""
+################################################################################################
+################################################################################################
+################################################################################################
+"""
+def main():
+    global TrainFiles, p_name, odir
+    """
+    ################################################################################################
+    ##############################  GET A FILENAME TO SAVE WEIGHTS  ################################
+    ################################################################################################
+    """
+    st = str(int((time.time() * 1e6) % 1e13))
+    log_name = 'logs/train_nnet_t{}.out'.format(st)
+    odir = 'outputs/train_t{}'.format(st)
+    p_name = 'outputs/train_t{}/nnet.p'.format(st)
+    print('nEURAL nET wILL bE sAVED hERE: ', p_name)
+    # Create Training File List
+    excluded_files = []
+    with open('inputs/Baseline4_advSample.csv', 'r') as f_handle:
+        opener = csv.reader(f_handle)
+        for line in opener:
+            excluded_files.append(line[1].replace('.p', '.ds.bz2'))
+    # Load Simultaneous files
+    print('Loading Large Files')
+    loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'), encoding=u'utf-8')
+    loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'), encoding=u'utf-8')
+    # loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT.p', 'rb'), encoding=u'utf-8')
+    # loaded_DCS = pickle.load(open('../Simultaneous_DCS.p', 'rb'), encoding=u'utf-8')
+    bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_bigram_10K/'
+    # bz2_input_folder = '/home/rs/15CS91R05/vishnu/Data/skt_dcs_DS.bz2_compat_10k_check_again/'
+    all_files = []
+    skipped = 0
+    for f in os.listdir(bz2_input_folder):
+        if '.ds.bz2' in f:
+            if f in excluded_files:
+                skipped += 1
+                continue
+            if f.replace('.ds.bz2', '.p2') not in loaded_DCS:
+                print('Couldnt find ', f)
+                continue
+            all_files.append(f)
+    print(skipped, 'files will not be used for training')
+    print('Size of training set:', len(all_files))
+    TrainFiles = all_files
+    InitModule()
+    trainingStatus = defaultdict(lambda: bool(False))
+    # train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 10, iterationPerBatch = 5, _debug=False, superEpochs = 5)
+    train = train_generator(loaded_SKT, loaded_DCS, bz2_input_folder, n_trainset = -1, filePerBatch = 20, iterationPerBatch = 3, _debug=False, superEpochs = 2)
+    # Complete Training
+    train.__next__()
+    print('Training Complete')
+if __name__ == '__main__':
+    main()

dir/__pycache__/TestPool_Unit_clique.cpython-36.pyc ADDED Viewed

Binary file (1.45 kB). View file

dir/__pycache__/heap_n_clique.cpython-36.pyc ADDED Viewed

Binary file (6.34 kB). View file

dir/bronclique.py ADDED Viewed

	@@ -0,0 +1,362 @@

+import numpy as np
+from word_definite import *
+import math
+def Parent(i):
+    return max(0, math.floor((i - 1)/2))
+def Left(i):
+    return 2*i + 1
+def Right(i):
+    return 2*(i + 1)
+"""
+################################################################################################
+######################## NOMINAL NODE CLASS REQUIRED FOR USING  ################################
+#########################  WITH THE HEAP DATA STRUCTURE  #######################################
+################################################################################################
+"""
+class Node:
+    def __init__(self, id, dist):
+        self.dist = dist
+        self.id = id
+        self.isConflicted = False
+        self.src = -1
+"""
+################################################################################################
+############################  IMPLEMENTATION OF HEAP  ##########################################
+################################################################################################
+"""
+class Heap:
+    # It's a minHeap
+    # Nodes are of type Word_definite
+    def __init__(self, nodeList):
+        self.nodeList = [n for n in nodeList]
+        self.len = len(nodeList)
+        self.idLocator = {}
+        for i in range(self.len):
+            self.idLocator[nodeList[i].id] = i
+        self.Build()
+    def Exchange(self, i, j):
+        t = self.nodeList[i]
+        self.nodeList[i] = self.nodeList[j]
+        self.nodeList[j] = t
+        self.idLocator[self.nodeList[i].id] = i
+        self.idLocator[self.nodeList[j].id] = j
+    def Decrease_Key(self, node, newDist, src):
+        if node.isConflicted:
+            return
+        i = self.idLocator[node.id]
+        if newDist > node.dist:
+            # relaxation not possible
+            return
+        else:
+            node.dist = newDist
+            node.src = src
+            parent = Parent(i)
+            while ((i > 0) and (self.nodeList[parent].dist > self.nodeList[i].dist)):
+                self.Exchange(i, parent)
+                i = parent
+                parent = Parent(i)
+    def Pop(self):
+        if(self.len == 0):
+            return None
+        if(self.nodeList[0].isConflicted):
+            # print("Pop has seen conflict!!!")
+            return None
+        # Remove the entry from the top of the heap
+        nMin = self.nodeList[0]
+        self.idLocator[self.nodeList[0].id] = -1
+        # Put the last node on top of heap and heapify
+        self.nodeList[0] = self.nodeList[self.len - 1]
+        self.idLocator[self.nodeList[0].id] = 0
+        self.len -= 1
+        self.Min_Heapify(0)
+        return nMin
+    def Min_Heapify(self, i):
+        nMin = self.nodeList[i]
+        li = Left(i)
+        if(li < self.len):
+            if(self.nodeList[li].dist < nMin.dist):
+                nMin = self.nodeList[li]
+                min_i = li
+        ri = Right(i)
+        if(ri < self.len):
+            if(self.nodeList[ri].dist < nMin.dist):
+                nMin = self.nodeList[ri]
+                min_i = ri
+        if(nMin.id != self.nodeList[i].id):
+            self.Exchange(i, min_i)
+            self.Min_Heapify(min_i)
+    def Delete(self, node):
+        i = self.idLocator[node.id]
+        self.nodeList[i].isConflicted = True
+        self.nodeList[i].dist = np.inf
+        self.Min_Heapify(i)
+    def Build(self):
+        self.len = len(self.nodeList)
+        for i in range(int(Parent(self.len - 1)) + 1):
+            self.Min_Heapify(i)
+    def Print(self):
+        i = 0
+        level = 1
+        ilimit = 0
+        while(i < self.len):
+            print('N(%d, %2.1f)' % (self.nodeList[i].id, self.nodeList[i].dist), end = ' ')
+            i += 1
+            if(i > ilimit):
+                print('\n')
+                level *= 2
+                ilimit += level
+"""
+################################################################################################
+######################  IMPLEMENTATION OF PRIM'S ALGO FOR FINDING MST ##########################
+#############################  USES HEAP DEFINED ABOVE  ########################################
+################################################################################################
+"""
+def MST(nodelist, WScalarMat, conflicts_Dict, source):
+    # WTF Dude!!! This function should not be used... It is running Prim's on a directed graph!!!
+    # Doesn't return MST
+    mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+    # print(len(nodelist))
+    # Reset nodes and put ids
+    for id in range(len(nodelist)):
+        nodelist[id].id = id
+        nodelist[id].dist = np.inf
+        nodelist[id].isConflicted = False
+        nodelist[id].src = -1
+    # Initialize Graph and min-Heap
+    nodelist[source].dist = 0
+    for neighbour in range(len(nodelist)):
+        if neighbour != source:
+            nodelist[neighbour].dist = WScalarMat[source][neighbour]
+            nodelist[neighbour].src = source
+    h = Heap(nodelist)
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    # Run MST only until first conflicting node is seen
+    # Conflicting node will have np.inf as dist
+    while True:
+        nextNode = h.Pop()
+        if nextNode == None:
+            break
+        print("next-id:"+str(nextNode.id))
+        print('picked by '+str(nodelist[nextNode.id].dist))
+        print()
+        # print(nextNode.src, nextNode.id, nextNode)
+        mst_nodes_bool[nextNode.id] = True
+        mst_nodes[nextNode.chunk_id].append(nextNode)
+        if nextNode.src != -1:
+            mst_adj_graph[nextNode.src, nextNode.id] = True
+            # mst_adj_graph[nextNode.id, nextNode.src] = True
+        nid = nextNode.id
+        for conId in conflicts_Dict[nid]:
+            h.Delete(nodelist[conId])
+        for neighbour in range(len(nodelist)):
+            if neighbour != nextNode.id:
+                print(WScalarMat[nextNode.id][neighbour])
+                print(nodelist[neighbour].dist)
+                h.Decrease_Key(nodelist[neighbour], WScalarMat[nextNode.id][neighbour], nextNode.id)
+        print(mst_nodes_bool)
+        # print(mst_nodes_bool)
+    print('#'*30)
+    mst_nodes = dict(mst_nodes)
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def clique(nodelist, WScalarMat, conflicts_Dict, source):
+    # WTF Dude!!! This function should not be used... It is running Prim's on a directed graph!!!
+    # Doesn't return MST
+    mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+    # print(len(nodelist))
+    # Reset nodes and put ids
+    # print('node-ids')
+    for id in range(len(nodelist)):
+        # print(id)
+        nodelist[id].id = id
+        nodelist[id].dist = np.inf
+        nodelist[id].isConflicted = False
+        nodelist[id].src = -1
+    # print('*'*40)
+    # Initialize Graph and min-Heap
+    nodelist[source].dist = 0
+    nodeset=set()
+    for neighbour in range(len(nodelist)):
+        if neighbour != source:
+            nodelist[neighbour].dist = WScalarMat[source][neighbour]
+            nodelist[neighbour].src = source
+            # nodeset.add((nodelist[neighbour].dist,neighbour))
+    # nodeset = sorted(nodeset)
+    nodeset.add((0,source))
+    nodesadded=[]
+    nodesavailable = np.zeros(len(nodelist),dtype=int)  # o if available, 1 if not available
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    # Run MST only until first conflicting node is seen
+    # Conflicting node will have np.inf as dist
+    it=0
+    nextNode=-1
+    while True:
+        # print(nodeset)
+        it+=1
+        # print(it)
+        if(it>1000):
+            break
+        if(len(nodeset)==0):
+            break
+        # print('before nn assign: ')
+        # print(nextNode)
+        nextNode = next(iter(nodeset))
+        # print("after nn assign:")
+        # print(nextNode)
+        # print("Nextnode is :"+str(nextNode[1])+" Picked by :"+str(nextNode[0]))
+        nextNode=nodelist[nextNode[1]]
+        # print(type(nextNode))
+        # print(st_setr(nextNode.id)+"",)
+        # print(nextNode.id)
+        nodesavailable[nextNode.id]=1
+        # nodesavailable=1
+        if nextNode == None:
+            break
+        # print(nextNode.src, nextNode.id, nextNode)
+        mst_nodes_bool[nextNode.id] = True
+        mst_nodes[nextNode.chunk_id].append(nextNode)
+        nodeset = set()
+        if nextNode.src != -1:
+            mst_adj_graph[nextNode.src, nextNode.id] = True
+            # mst_adj_graph[nextNode.id, nextNode.src] = True
+        nid = nextNode.id
+        nodesadded.append(nid)
+        for conId in conflicts_Dict[nid]:
+            # h.Delete(nodelist[conId])
+            nodesavailable[conId]=1
+            # print('here')
+        for neighbour in range(len(nodelist)):
+            # print(type(nodesavailable))
+            # print(type(nodesavailable[0]))
+            if(nodesavailable[neighbour]==1):
+                continue
+            if neighbour != nextNode.id:
+                # h.Decrease_Key(nodelist[neighbour], WScalarMat[nextNode.id][neighbour], nextNode.id)
+                edgewt=0
+                # print(nodesadded)
+                for nodepresent in nodesadded:
+                    edgewt+=WScalarMat[nodepresent][neighbour]
+                # print('adding '+str(neighbour))
+                nodeset.add((edgewt,neighbour))
+                # print(nodeset)
+        nodeset=sorted(nodeset)
+        # print(nodeset)
+    # print("#"*30)
+    # print(mst_nodes_bool)
+    # print('-'*20)
+    # print('#'*30)
+    mst_nodes = dict(mst_nodes)
+    if(it>1000):
+        print('!!!!*10')
+    for i in range(len(mst_nodes_bool)):
+        for j in range(len(mst_nodes_bool)):
+            if(i==j):
+                continue
+            if(mst_nodes_bool[i] and mst_nodes_bool[j]):
+                mst_adj_graph[i][j]=True
+                mst_adj_graph[j][i]=True
+    # print(mst_adj_graph)
+    # print("#")
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def bron(R,P,X,nodelist,conflicts_Dict,level):
+    L = []
+    if(len(P)==0 and len(X)==0):
+        L.append(R)
+        return L
+    # print('P'*30)
+    # print(P)
+    # print('R'*30)
+    # print(R)
+    # print('X'*30)
+    # print(X)
+    Pit = P.copy()
+    # while(len(P)>0):
+    for v in Pit:
+        # v = next(iter(P))
+        R1 = R.copy()
+        P1 = P.copy()
+        X1 = X.copy()
+        R1.add(v)
+        for i in conflicts_Dict[v]:
+            if(i in P1):
+                P1.remove(i)
+            if(i in X1):
+                X1.remove(i)
+        if(v in P1):
+            P1.remove(v)
+        if(v in X1):
+            X1.remove(v)
+        G = bron(R1,P1,X1,nodelist,conflicts_Dict,level+1)
+        if(v in P):
+            P.remove(v)
+        X.add(v)
+        for i in G:
+            L.append(i)
+    return L
+def RandomST_GoldOnly(nodelist, WScalarMat, conflicts_Dict, source):
+    (mst_nodes, mst_adj_graph, mst_nodes_bool) = MST(nodelist, WScalarMat, conflicts_Dict, source)
+    mst_adj_graph = np.zeros_like(mst_adj_graph)
+    nodelen = len(nodelist)
+    ## Random mst_adj_graph
+    free_set = list(range(nodelen))
+    full_set = list(range(nodelen))
+    st_set = []
+    start_node = np.random.randint(nodelen)
+    st_set.append(start_node)
+    free_set.remove(start_node)
+    for x in range(nodelen - 1):
+        a = st_set[np.random.randint(len(st_set))]
+        b = free_set[np.random.randint(len(free_set))]
+        if b not in st_set:
+            st_set.append(b)
+            free_set.remove(b)
+        mst_adj_graph[a, b] = 1
+        # mst_adj_graph[b, a] = 1 # Directed Spanning tree
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def GetMSTWeight(mst_adj_graph, WScalarMat):
+    return np.sum(WScalarMat[mst_adj_graph])

dir/bucket_by_conflicting_nodes_0.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import pickle
+import os
+import bz2
+def harmonic(P, R):
+    """For Calculation of F-Score since it is the HM of P and R"""
+    return(2 * P * R / float(P + R))
+# Test on a couple of files
+base_path_csv = '/home/rs/15CS91R05/gaurav/myTryouts/init_results/prediction_csvs/'
+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+pred_csvs = os.listdir(base_path_csv)
+"""Task 5: See data from number of conflicts
+    Approach: Select a node from DCS, take count of conflicting nodes using the conflictsDict_correct
+"""
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bucket_by_conflicting_nodes = {}
+num_conflicting_nodes = set()
+csv = open(base_path_csv + pred_csvs[0], 'r').readlines()
+for line in range(0, len(csv), 6):
+    head_line = csv[line].strip().split(',')
+    fname = head_line[0]
+    print("Bz2 File number", fname, line / 6)
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+        nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + fname + '.ds.bz2')
+    assert len(nodelist_correct) == len(conflicts_Dict_correct)
+    for node in conflicts_Dict_correct:
+        lemma = nodelist_correct[node].lemma
+        conflicting_nodes_count = len(conflicts_Dict_correct[node])
+        if conflicting_nodes_count not in num_conflicting_nodes:
+            num_conflicting_nodes.add(conflicting_nodes_count)
+            bucket_by_conflicting_nodes[conflicting_nodes_count] = {'lemmas': set(), 'precision': [0, 0], 'recall': [0, 0]}
+        bucket_by_conflicting_nodes[conflicting_nodes_count]['lemmas'].add(lemma)
+    data = csv[line + 5].strip().split(',')
+    word_recall = float(data[1]) / float(data[3])
+    lemma_recall = float(data[2]) / float(data[3])
+    word_precision = float(data[1]) / float(data[4])
+    lemma_precision = float(data[2]) / float(data[4])
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][0] += word_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][1] += lemma_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][0] += word_precision
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][1] += lemma_precision
+# for conflicting_count in bucket_by_conflicting_nodes:
+    # # Average P & R
+    # bucket_by_conflicting_nodes[conflicting_count]['precision'][0] /= bucket_by_conflicting_nodes[conflicting_count]['num_lemmas']
+    # bucket_by_conflicting_nodes[conflicting_count]['precision'][1] /= bucket_by_conflicting_nodes[conflicting_count]['num_lemmas']
+    # bucket_by_conflicting_nodes[conflicting_count]['recall'][0] /= bucket_by_conflicting_nodes[conflicting_count]['num_lemmas']
+    # bucket_by_conflicting_nodes[conflicting_count]['recall'][1] /= bucket_by_conflicting_nodes[conflicting_count]['num_lemmas']
+    #
+    # # Find F-Score
+    # wrd_fscore = harmonic(bucket_by_conflicting_nodes[conflicting_count]['precision'][0], bucket_by_conflicting_nodes[conflicting_count]['recall'][0])
+    # lma_fscore = harmonic(bucket_by_conflicting_nodes[conflicting_count]['precision'][1], bucket_by_conflicting_nodes[conflicting_count]['recall'][1])
+    # bucket_by_conflicting_nodes[conflicting_count]['fscore'] = [wrd_fscore, lma_fscore]
+with open('final_task_gaurav/bucket_by_conflicting_nodes_0.p', 'wb') as f:
+    pickle.dump(bucket_by_conflicting_nodes, f)

dir/bucket_by_conflicting_nodes_1.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import pickle
+import os
+import bz2
+def harmonic(P, R):
+    """For Calculation of F-Score since it is the HM of P and R"""
+    return(2 * P * R / float(P + R))
+# Test on a couple of files
+base_path_csv = '/home/rs/15CS91R05/gaurav/myTryouts/init_results/prediction_csvs/'
+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+pred_csvs = os.listdir(base_path_csv)
+"""Task 5: See data from number of conflicts
+    Approach: Select a node from DCS, take count of conflicting nodes using the conflictsDict_correct
+"""
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bucket_by_conflicting_nodes = {}
+num_conflicting_nodes = set()
+csv = open(base_path_csv + pred_csvs[1], 'r').readlines()
+for line in range(0, len(csv), 6):
+    head_line = csv[line].strip().split(',')
+    fname = head_line[0]
+    print("Bz2 File number", fname, line / 6)
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+        nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + fname + '.ds.bz2')
+    assert len(nodelist_correct) == len(conflicts_Dict_correct)
+    for node in conflicts_Dict_correct:
+        lemma = nodelist_correct[node].lemma
+        conflicting_nodes_count = len(conflicts_Dict_correct[node])
+        if conflicting_nodes_count not in num_conflicting_nodes:
+            num_conflicting_nodes.add(conflicting_nodes_count)
+            bucket_by_conflicting_nodes[conflicting_nodes_count] = {'lemmas': set(), 'precision': [0, 0], 'recall': [0, 0]}
+        bucket_by_conflicting_nodes[conflicting_nodes_count]['lemmas'].add(lemma)
+    data = csv[line + 5].strip().split(',')
+    word_recall = float(data[1]) / float(data[3])
+    lemma_recall = float(data[2]) / float(data[3])
+    word_precision = float(data[1]) / float(data[4])
+    lemma_precision = float(data[2]) / float(data[4])
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][0] += word_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][1] += lemma_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][0] += word_precision
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][1] += lemma_precision
+with open('final_task_gaurav/bucket_by_conflicting_nodes_1.p', 'wb') as f:
+    pickle.dump(bucket_by_conflicting_nodes, f)

dir/bucket_by_conflicting_nodes_2.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import pickle
+import os
+import bz2
+def harmonic(P, R):
+    """For Calculation of F-Score since it is the HM of P and R"""
+    return(2 * P * R / float(P + R))
+# Test on a couple of files
+base_path_csv = '/home/rs/15CS91R05/gaurav/myTryouts/init_results/prediction_csvs/'
+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+pred_csvs = os.listdir(base_path_csv)
+"""Task 5: See data from number of conflicts
+    Approach: Select a node from DCS, take count of conflicting nodes using the conflictsDict_correct
+"""
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bucket_by_conflicting_nodes = {}
+num_conflicting_nodes = set()
+csv = open(base_path_csv + pred_csvs[2], 'r').readlines()
+for line in range(0, len(csv), 6):
+    head_line = csv[line].strip().split(',')
+    fname = head_line[0]
+    print("Bz2 File number", fname, line / 6)
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+        nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + fname + '.ds.bz2')
+    assert len(nodelist_correct) == len(conflicts_Dict_correct)
+    for node in conflicts_Dict_correct:
+        lemma = nodelist_correct[node].lemma
+        conflicting_nodes_count = len(conflicts_Dict_correct[node])
+        if conflicting_nodes_count not in num_conflicting_nodes:
+            num_conflicting_nodes.add(conflicting_nodes_count)
+            bucket_by_conflicting_nodes[conflicting_nodes_count] = {'lemmas': set(), 'precision': [0, 0], 'recall': [0, 0]}
+        bucket_by_conflicting_nodes[conflicting_nodes_count]['lemmas'].add(lemma)
+    data = csv[line + 5].strip().split(',')
+    word_recall = float(data[1]) / float(data[3])
+    lemma_recall = float(data[2]) / float(data[3])
+    word_precision = float(data[1]) / float(data[4])
+    lemma_precision = float(data[2]) / float(data[4])
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][0] += word_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][1] += lemma_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][0] += word_precision
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][1] += lemma_precision
+with open('final_task_gaurav/bucket_by_conflicting_nodes_2.p', 'wb') as f:
+    pickle.dump(bucket_by_conflicting_nodes, f)

dir/bucket_by_conflicting_nodes_3.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import pickle
+import os
+import bz2
+def harmonic(P, R):
+    """For Calculation of F-Score since it is the HM of P and R"""
+    return(2 * P * R / float(P + R))
+# Test on a couple of files
+base_path_csv = '/home/rs/15CS91R05/gaurav/myTryouts/init_results/prediction_csvs/'
+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+pred_csvs = os.listdir(base_path_csv)
+"""Task 5: See data from number of conflicts
+    Approach: Select a node from DCS, take count of conflicting nodes using the conflictsDict_correct
+"""
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bucket_by_conflicting_nodes = {}
+num_conflicting_nodes = set()
+csv = open(base_path_csv + pred_csvs[3], 'r').readlines()
+for line in range(0, len(csv), 6):
+    head_line = csv[line].strip().split(',')
+    fname = head_line[0]
+    print("Bz2 File number", fname, line / 6)
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+        nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + fname + '.ds.bz2')
+    assert len(nodelist_correct) == len(conflicts_Dict_correct)
+    for node in conflicts_Dict_correct:
+        lemma = nodelist_correct[node].lemma
+        conflicting_nodes_count = len(conflicts_Dict_correct[node])
+        if conflicting_nodes_count not in num_conflicting_nodes:
+            num_conflicting_nodes.add(conflicting_nodes_count)
+            bucket_by_conflicting_nodes[conflicting_nodes_count] = {'lemmas': set(), 'precision': [0, 0], 'recall': [0, 0]}
+        bucket_by_conflicting_nodes[conflicting_nodes_count]['lemmas'].add(lemma)
+    data = csv[line + 5].strip().split(',')
+    word_recall = float(data[1]) / float(data[3])
+    lemma_recall = float(data[2]) / float(data[3])
+    word_precision = float(data[1]) / float(data[4])
+    lemma_precision = float(data[2]) / float(data[4])
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][0] += word_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][1] += lemma_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][0] += word_precision
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][1] += lemma_precision
+with open('final_task_gaurav/bucket_by_conflicting_nodes_3.p', 'wb') as f:
+    pickle.dump(bucket_by_conflicting_nodes, f)

dir/bucket_by_conflicting_nodes_4.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import pickle
+import os
+import bz2
+def harmonic(P, R):
+    """For Calculation of F-Score since it is the HM of P and R"""
+    return(2 * P * R / float(P + R))
+# Test on a couple of files
+base_path_csv = '/home/rs/15CS91R05/gaurav/myTryouts/init_results/prediction_csvs/'
+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+pred_csvs = os.listdir(base_path_csv)
+"""Task 5: See data from number of conflicts
+    Approach: Select a node from DCS, take count of conflicting nodes using the conflictsDict_correct
+"""
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bucket_by_conflicting_nodes = {}
+num_conflicting_nodes = set()
+csv = open(base_path_csv + pred_csvs[4], 'r').readlines()
+for line in range(0, len(csv), 6):
+    head_line = csv[line].strip().split(',')
+    fname = head_line[0]
+    print("Bz2 File number", fname, line / 6)
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+        nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + fname + '.ds.bz2')
+    assert len(nodelist_correct) == len(conflicts_Dict_correct)
+    for node in conflicts_Dict_correct:
+        lemma = nodelist_correct[node].lemma
+        conflicting_nodes_count = len(conflicts_Dict_correct[node])
+        if conflicting_nodes_count not in num_conflicting_nodes:
+            num_conflicting_nodes.add(conflicting_nodes_count)
+            bucket_by_conflicting_nodes[conflicting_nodes_count] = {'lemmas': set(), 'precision': [0, 0], 'recall': [0, 0]}
+        bucket_by_conflicting_nodes[conflicting_nodes_count]['lemmas'].add(lemma)
+    data = csv[line + 5].strip().split(',')
+    word_recall = float(data[1]) / float(data[3])
+    lemma_recall = float(data[2]) / float(data[3])
+    word_precision = float(data[1]) / float(data[4])
+    lemma_precision = float(data[2]) / float(data[4])
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][0] += word_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['recall'][1] += lemma_recall
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][0] += word_precision
+    bucket_by_conflicting_nodes[conflicting_nodes_count]['precision'][1] += lemma_precision
+with open('final_task_gaurav/bucket_by_conflicting_nodes_4.p', 'wb') as f:
+    pickle.dump(bucket_by_conflicting_nodes, f)

dir/bz2_counter.py ADDED Viewed

	@@ -0,0 +1,8 @@

+import os
+folders = ['skt_dcs_DS.bz2_1L_bigram_mir_Large']
+for folder in folders:
+	path = os.path.join('../NewData', folder)
+	c = len(os.listdir(path))
+	print('Folder: {:35s} ------> File_Count: {}\n'.format(folder, c))

dir/cliq.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

dir/datainspect.py ADDED Viewed

	@@ -0,0 +1,37 @@

+#from Train_n_Save_NNet import *
+import bz2,pickle
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    print("conflicts_Dict_correct:",)
+    print(conflicts_Dict_correct)
+    print("nodelist_to_correct_mapping: ")
+    print(nodelist_to_correct_mapping)
+    print("nodelist_correct")
+    nc0 = nodelist_correct[0]
+    print(type(nc0))
+    print(nodelist_correct[0])
+    print("featVMat_correct")
+    print(featVMat_correct[0][1][0])
+    print("featVMat")
+    print(featVMat[0][0])
+    print("conflicts_Dict")
+    print(conflicts_Dict)
+    print("nodelist")
+    print(nodelist)
+    #return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,nodelist, conflicts_Dict, featVMat)
+print(open_dsbz2("100004.ds.bz2"))

dir/dcs_skt_bzipper.py ADDED Viewed

	@@ -0,0 +1,196 @@

+## bUILT-iN pACKAGES
+import sys, os, csv
+import pickle
+from collections import defaultdict
+import json
+import numpy as np
+import math
+np.set_printoptions(suppress=True)
+from IPython.display import display
+## lAST sUMMER
+from romtoslp import *
+from sentences import *
+from DCS import *
+import MatDB
+import time
+import bz2
+import zlib
+## lAST yEAR
+# from word_definite import *
+# from nnet import *
+# from heap_n_PrimMST import *
+# from word_definite import *
+is10K = False
+if is10K:
+    loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'), encoding=u'utf-8')
+    loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'), encoding=u'utf-8')
+    outFolder = '../NewData/skt_dcs_DS.bz2_10K/'
+else:
+    loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT.p', 'rb'), encoding=u'utf-8')
+    loaded_DCS = pickle.load(open('../Simultaneous_DCS.p', 'rb'), encoding=u'utf-8')
+    outFolder = '../NewData/skt_dcs_DS.bz2/'
+conversion_file_list = list(loaded_DCS.keys())
+outFolder = '../NewData/skt_dcs_DS.bz2_1L_bigram_mir_Large/'
+## SPECIAL - HELD OUT DATASET - OVERWRITES
+'''
+outFolder = '../NewData/skt_dcs_DS.bz2_1L_bigram_rfe_heldout/'
+baseline_filelist = []
+with open('inputs/Baseline4_advSample.csv') as f:
+    baseline_reader = csv.reader(f)
+    for line in baseline_reader:
+        baseline_filelist.append(line[1])
+conversion_file_list = [f.replace('.p', '.p2') for f in baseline_filelist]
+#'''
+## SPECIAL CODE ENDS HERE
+dataset_4k_1k = pickle.load(open('../SmallDataset_4K_1K.p', 'rb'))
+TrainFiles = dataset_4k_1k['TrainFiles']
+TestFiles = dataset_4k_1k['TestFiles']
+dataset_6k_3k = pickle.load(open('../SmallDataset_6K_3K.p', 'rb'))
+TrainFiles_2 = dataset_6k_3k['TrainFiles']
+TestFiles_2 = dataset_6k_3k['TestFiles']
+matDB = MatDB.MatDB()
+# from MatDB import *
+import word_definite as WD
+from heap_n_PrimMST import *
+from nnet import *
+"""
+################################################################################################
+######################  CREATE SEVERAL DATA STRUCTURES FROM SENTENCE/DCS  ######################
+###########################  NODELIST, ADJACENCY LIST, GRAPH, HEAP #############################
+################################################################################################
+"""
+def GetTrainingKit(sentenceObj, dcsObj):
+    nodelist = GetNodes(sentenceObj)
+    # Nodelist with only the correct_nodes
+    nodelist2 = GetNodes(sentenceObj)
+    nodelist2_to_correct_mapping = {}
+    nodelist_correct = []
+    search_key = 0
+    first_key = 0
+    for chunk_id in range(len(dcsObj.lemmas)):
+        while nodelist2[first_key].chunk_id != chunk_id:
+            first_key += 1
+        for j in range(len(dcsObj.lemmas[chunk_id])):
+            search_key = first_key
+            while (nodelist2[search_key].lemma != rom_slp(dcsObj.lemmas[chunk_id][j])) or (nodelist2[search_key].cng != dcsObj.cng[chunk_id][j]):
+                search_key += 1
+                if search_key >= len(nodelist2) or nodelist2[search_key].chunk_id > chunk_id:
+                    break
+    #         print((rom_slp(dcsObj.lemmas[chunk_id][j]), dcsObj.cng[chunk_id][j]))
+    #         print(nodelist[search_key])
+            nodelist2_to_correct_mapping[len(nodelist_correct)] = search_key
+            nodelist_correct.append(nodelist2[search_key])
+    return (nodelist, nodelist_correct, nodelist2_to_correct_mapping)
+def GetGraph(nodelist, neuralnet):
+    if not neuralnet.outer_relu:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        (WScalarMat, SigmoidGateOutput) = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat, SigmoidGateOutput)
+    else:
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+        WScalarMat = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat, nodelist, conflicts_Dict, neuralnet)
+        return (conflicts_Dict, featVMat, WScalarMat)
+"""
+################################################################################################
+##############################  GET A FILENAME TO SAVE WEIGHTS  ################################
+################################################################################################
+"""
+trainingStatus = defaultdict(lambda: bool(False))
+class Trainer:
+    def __init__(self):
+        self.hidden_layer_size = 300
+        self._edge_vector_dim = WD._edge_vector_dim
+        # self._full_cnglist = list(WD.mat_cngCount_1D)
+        self.neuralnet = NN(self._edge_vector_dim, self.hidden_layer_size, outer_relu=True)
+        self.history = defaultdict(lambda: list())
+    def SaveToMem(self, sentenceObj, dcsObj, _debug = True):
+        """ Pre-Process DCS and SKT to get all Nodes etc. """
+        try:
+            (nodelist, nodelist_correct, nodelist_to_correct_mapping) = GetTrainingKit(sentenceObj, dcsObj)
+        except IndexError as e:
+            # print('\x1b[31mError with {} \x1b[0m'.format(sentenceObj.sent_id))
+            # print(e)
+            return
+#         startT = time.time()
+        """ SKT FEATURE VECTOR MATRIX """
+        conflicts_Dict_correct = Get_Conflicts(nodelist_correct)
+        featVMat_correct = Get_Feat_Vec_Matrix(nodelist_correct, conflicts_Dict_correct)
+        """ SKT FEATURE VECTOR MATRIX """
+        conflicts_Dict = Get_Conflicts(nodelist)
+        featVMat = Get_Feat_Vec_Matrix(nodelist, conflicts_Dict)
+#         print('Nodelen: {}, Time taken to create: {}'.format(len(nodelist), time.time() - startT))
+        with bz2.BZ2File(outFolder + sentenceObj.sent_id + '.ds.bz2', 'w') as f:
+            pickle.dump({
+                    'nodelist': nodelist,
+                    'nodelist_correct': nodelist_correct,
+                    'nodelist_to_correct_mapping': nodelist_to_correct_mapping,
+                    'conflicts_Dict_correct': conflicts_Dict_correct,
+                    'featVMat_correct': featVMat_correct,
+                    'conflicts_Dict': conflicts_Dict,
+                    'featVMat': featVMat
+                }, f)
+trainer = None
+def InitModule(_matDB):
+    global WD, trainer
+    _edge_vec_dim = 1500
+    WD.word_definite_extInit(_matDB, _edge_vec_dim)
+    trainer = Trainer()
+InitModule(matDB)
+trainingStatus = defaultdict(lambda: bool(False))
+# trainer.Load('outputs/train_nnet_t427031523027.p')
+"""
+################################################################################################
+##############################  TRAIN FUNCTION  ################################################
+################################################################################################
+"""
+def save_all_bz2(loaded_SKT, loaded_DCS, n_checkpt = 100):
+    file_counter = 0
+    print('{} files to process'.format(len(conversion_file_list)))
+    for fn in conversion_file_list:
+        if file_counter % n_checkpt == 0:
+            print(file_counter,' Checkpoint... ')
+            sys.stdout.flush() # Flush IO buffer
+        if os.path.isfile(outFolder + fn.replace('.p2', '.ds.bz2')):
+            print('Skipping: ', fn)
+            continue
+        try:
+            _ = trainer.SaveToMem(loaded_SKT[fn], loaded_DCS[fn])
+        except (IndexError, KeyError) as e:
+            pass
+        file_counter += 1
+if not os.path.isdir(outFolder):
+    print('Creating directory: ', outFolder)
+    os.mkdir(outFolder)
+save_all_bz2(loaded_SKT, loaded_DCS, n_checkpt=100)

dir/evaluate.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# coding: utf-8
+# In[1]:
+import pandas,sys
+# In[7]:
+fils = {
+'BM3' : ["BM3_NLoss_proc0.csv","BM3_NLoss_proc2.csv","BM3_NLoss_proc1.csv","BM3_NLoss_proc3.csv"],
+'BM2' : ["BM2_NLoss_proc0.csv","BM2_NLoss_proc2.csv","BM2_NLoss_proc1.csv","BM2_NLoss_proc3.csv"],
+'BR2' : ["BR2_NLoss_proc0.csv","BR2_NLoss_proc2.csv","BR2_NLoss_proc1.csv","BR2_NLoss_proc3.csv"],
+'BR3' : ["BR3_NLoss_proc0.csv","BR3_NLoss_proc2.csv","BR3_NLoss_proc1.csv","BR3_NLoss_proc3.csv"],
+'PM2' : ["PM2_NLoss_proc0.csv","PM2_NLoss_proc2.csv","PM2_NLoss_proc1.csv","PM2_NLoss_proc3.csv"],
+'PM3' : ["PM3_NLoss_proc0.csv","PM3_NLoss_proc2.csv","PM3_NLoss_proc1.csv","PM3_NLoss_proc3.csv"],
+'PR2' : ["PR2_NLoss_proc0.csv","PR2_NLoss_proc2.csv","PR2_NLoss_proc1.csv","PR2_NLoss_proc3.csv"],
+'PR3' : ["PR3_NLoss_proc0.csv","PR3_NLoss_proc2.csv","PR3_NLoss_proc1.csv","PR3_NLoss_proc3.csv"]
+}
+import pandas
+from collections import defaultdict
+def predLoss(tag):
+    gt = defaultdict(dict)
+    for item in fils[tag]:
+        fil = open('outputs/'+str(item)).read().splitlines()
+        for i,line in enumerate(fil):
+            if i % 6 == 0:
+                setCol = line.split(',')
+                gt[setCol[0]]['predLemma'] = setCol[1:]
+            if i%6 == 1:
+                gt[setCol[0]]['predCNG'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['predCNG']):
+                    print(gt[setCol[0]])
+            if i%6 == 2:
+                gt[setCol[0]]['chunkID'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['chunkID']):
+                    print(gt[setCol[0]])
+            if i%6 == 3:
+                gt[setCol[0]]['chunkIDCNG'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['chunkIDCNG']):
+                    print(gt[setCol[0]])
+            if i%6 == 4:
+                gt[setCol[0]]['idInNodeID'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['idInNodeID']):
+                    print(gt[setCol[0]])
+            if i%6 == 5:
+                gt[setCol[0]]['params'] = line.split(',')[1:]
+            if line.split(',')[0] != setCol[0]:
+                print(i,setCol,line)
+                print('breakin')
+                break
+    return gt
+def pdframe(gt):
+    params = defaultdict(dict)
+    for item in gt.keys():
+        tatkal = gt[item]['params']
+        params[item]['corrWords'],params[item]['corrLemma'] = int(tatkal[0]),int(tatkal[1])
+        params[item]['dcsSize'],params[item]['predictions'] = int(tatkal[2]),int(tatkal[3])
+        params[item]['word++Precision'] = params[item]['corrWords']*1.0/params[item]['predictions']
+        params[item]['word++Recall'] = params[item]['corrWords']*1.0/params[item]['dcsSize']
+        params[item]['wordPrecision'] = params[item]['corrLemma']*1.0/params[item]['predictions']
+        params[item]['wordRecall'] = params[item]['corrLemma']*1.0/params[item]['dcsSize']
+    initRes = pandas.DataFrame.from_dict(params,orient='index')
+    return initRes
+# In[8]:
+if(len(sys.argv)<2):
+    print("Provide an argument for the feature to be evaluated")
+else:
+    BM2gt = predLoss(str(sys.argv[1]))
+    BM2pd = pdframe(BM2gt)
+    print(BM2pd.mean())

dir/generate_dcs_and_skt_csv.py ADDED Viewed

	@@ -0,0 +1,66 @@

+base_path_bz2 = '/home/rs/15CS91R05/Bishal/NewData/skt_dcs_DS.bz2_1L_bigram_heldout_dev/'
+import operator
+import bz2
+import os
+import pickle
+# Function to open bz2 files (that contains both DCS & SKT info)
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+bz2_files = os.listdir(base_path_bz2)
+dcs_heldout_csv = ''
+skt_heldout_csv = ''
+count = 0
+for ds in bz2_files:
+    (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(base_path_bz2 + ds)
+    fname = ds.replace('.ds.bz2', '')
+    print(count, fname)
+    count += 1
+    lemmas = ''
+    cngs = ''
+    lemmas_dcs = ''
+    cngs_dcs = ''
+    for node in nodelist:
+        lemmas += ',' + node.lemma
+        cngs += ',' + node.cng
+    lemmas += '\n'
+    cngs += '\n'
+    for node in nodelist_correct:
+        lemmas_dcs += ',' + node.lemma
+        cngs_dcs += ',' + node.cng
+    lemmas_dcs += '\n'
+    cngs_dcs += '\n'
+    entry = fname + lemmas + fname + cngs
+    entry_dcs = fname + lemmas_dcs + fname + cngs_dcs
+    dcs_heldout_csv += entry_dcs
+    skt_heldout_csv += entry
+with open("final_task_gaurav/dcs_heldout.csv", "w") as f:
+    f.write(dcs_heldout_csv)
+with open("final_task_gaurav/skt_heldout.csv", "w") as f:
+    f.write(skt_heldout_csv)

dir/gt2.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import pandas as pd
+import numpy as np
+import csv,pickle,json,bz2
+from romtoslp import *
+loaded_DCS = pickle.load(open('../Simultaneous_DCS_ho.p', 'rb'))
+folder =  '../NewData/skt_dcs_DS.bz2_4K_bigram_mir_heldout/'
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+#snippet for forming the groundtruth csv file
+with open('groundtruth2.csv','w') as fh:
+    rd = csv.writer(fh)
+    rd.writerow(['File','Lemma','CNG','lemmaCorr','lemmaCNGcorr','predCNG','Conflicts'])
+count=0
+for ii in range(4):
+    with open("BM2_NLoss_proc"+str(ii)+".csv",'r') as fh:
+        rd = csv.reader(fh)
+        while(True):
+            try:
+                print(count)
+                count+=1
+                x=next(rd)  #predicted lemmas
+                sentid = x[0]
+                dcsobj = loaded_DCS[str(sentid)+'.p2']
+#                 print(dcsobj.cng)
+#                 print(dcsobj.lemmas)
+#                 print(dcsobj.dcs_chunks)
+                nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat = open_dsbz2(folder+str(sentid)+'.ds.bz2')
+#                 print(conflicts_Dict_correct)
+#                 print(nodelist_correct)
+#                 break
+                dll = 0
+                for i in dcsobj.lemmas:
+                    dll+=len(i)
+                if(dll!=len(nodelist_correct)):
+                    print('here')
+                    print(dcsobj.lemmas)
+                    print(nodelist_correct)
+                gtlemmas = []
+                for outerlist in dcsobj.lemmas:
+                    for element in outerlist:
+                        gtlemmas.append(rom_slp(element))
+                pdlemmas = x[1:]
+                x=next(rd) #predicted cngs
+                gtcngs = []
+                i = 0
+                for outerlist in dcsobj.cng:
+                    for element in outerlist:
+                        gtcngs.append((element,len(conflicts_Dict_correct[i])))
+                        i+=1
+                pdcngs = x[1:]
+                for i in range(4):
+                    x=(next(rd))
+#                 print(gtlemmas)
+#                 print(pdlemmas)
+#                 print(gtcngs)
+#                 print(pdcngs)
+                pdldict = dict()
+                gtldict = dict()
+                for i in range(len(gtlemmas)):
+                    if(gtlemmas[i] in gtldict):
+                        gtldict[gtlemmas[i]].append(gtcngs[i])
+                    else:
+                        gtldict[gtlemmas[i]] = [gtcngs[i]]
+                for i in range(len(pdlemmas)):
+                    if(pdlemmas[i] in pdldict):
+                        pdldict[pdlemmas[i]].append(pdcngs[i])
+                    else:
+                        pdldict[pdlemmas[i]] = [pdcngs[i]]
+#                 print(gtldict)
+#                 print(gtldict)
+                lemmaround2 = []
+                cnground2 = []
+                for gtl in gtldict.keys():
+                    for gtlcng in gtldict[gtl]:
+                        lemmacorr = 0
+                        lemmaCNGcorr=0
+                        predictedcng = 'nil'
+                        confcount = gtlcng[1]
+                        gtlcng = gtlcng[0]
+                        if(gtl in pdldict.keys()):
+                            if(len(pdldict[gtl])>0):
+                                if(gtlcng in pdldict[gtl]):
+                                    lemmacorr = 1
+                                    predictedcng = gtlcng
+                                    lemmaCNGcorr = 1
+                                    pdldict[gtl].remove(gtlcng)
+                                    with open('groundtruth2.csv','a') as fh:
+                                        rwd = csv.writer(fh)
+                                        row = [sentid,gtl,gtlcng,lemmacorr,lemmaCNGcorr,gtlcng,confcount]
+                                        rwd.writerow(row)
+                                else:
+                                    lemmaround2.append(gtl)
+                                    cnground2.append((gtlcng,confcount))
+                            else:
+                                with open('groundtruth2.csv','a') as fh:
+                                        rwd = csv.writer(fh)
+                                        row = [sentid,gtl,gtlcng,lemmacorr,lemmaCNGcorr,predictedcng,confcount]
+                                        rwd.writerow(row)
+                        else:
+                             with open('groundtruth2.csv','a') as fh:
+                                        rwd = csv.writer(fh)
+                                        row = [sentid,gtl,gtlcng,lemmacorr,lemmaCNGcorr,predictedcng,confcount]
+                                        rwd.writerow(row)
+                # now all elements with lemmaCNGcorr ==1 are out of the way
+                # reiterating for the lemmas which didnt have a cng but had a lemma earlier
+                for i in range(len(lemmaround2)):
+                    gtl = lemmaround2[i]
+                    gtlcng = cnground2[i]
+                    confcount = gtlcng[1]
+                    gtlcng = gtlcng[0]
+                    lemmacorr = 0
+                    lemmaCNGcorr = 0
+                    predictedcng = 'nil'
+                    if(gtl in pdldict.keys()):
+                        if(len(pdldict[gtl])>0):
+                            lemmacorr = 1
+                            predictedcng = pdldict[gtl][0]
+                            pdldict[gtl].remove(pdldict[gtl][0])
+                    with open('groundtruth2.csv','a') as fh:
+                            rwd = csv.writer(fh)
+                            row = [sentid,gtl,gtlcng,lemmacorr,lemmaCNGcorr,predictedcng,confcount]
+                            rwd.writerow(row)
+#                 print('done here')
+            except Exception as e:
+                print(e)
+                print('been there')
+#                 break
+                continue

dir/heap_n_PrimMST.py ADDED Viewed

	@@ -0,0 +1,201 @@

+import numpy as np
+from word_definite import *
+import math
+def Parent(i):
+    return max(0, math.floor((i - 1)/2))
+def Left(i):
+    return 2*i + 1
+def Right(i):
+    return 2*(i + 1)
+"""
+################################################################################################
+######################## NOMINAL NODE CLASS REQUIRED FOR USING  ################################
+#########################  WITH THE HEAP DATA STRUCTURE  #######################################
+################################################################################################
+"""
+class Node:
+    def __init__(self, id, dist):
+        self.dist = dist
+        self.id = id
+        self.isConflicted = False
+        self.src = -1
+"""
+################################################################################################
+############################  IMPLEMENTATION OF HEAP  ##########################################
+################################################################################################
+"""
+class Heap:
+    # It's a minHeap
+    # Nodes are of type Word_definite
+    def __init__(self, nodeList):
+        self.nodeList = [n for n in nodeList]
+        self.len = len(nodeList)
+        self.idLocator = {}
+        for i in range(self.len):
+            self.idLocator[nodeList[i].id] = i
+        self.Build()
+    def Exchange(self, i, j):
+        t = self.nodeList[i]
+        self.nodeList[i] = self.nodeList[j]
+        self.nodeList[j] = t
+        self.idLocator[self.nodeList[i].id] = i
+        self.idLocator[self.nodeList[j].id] = j
+    def Decrease_Key(self, node, newDist, src):
+        if node.isConflicted:
+            return
+        i = self.idLocator[node.id]
+        if newDist > node.dist:
+            # relaxation not possible
+            return
+        else:
+            node.dist = newDist
+            node.src = src
+            parent = Parent(i)
+            while ((i > 0) and (self.nodeList[parent].dist > self.nodeList[i].dist)):
+                self.Exchange(i, parent)
+                i = parent
+                parent = Parent(i)
+    def Pop(self):
+        if(self.len == 0):
+            return None
+        if(self.nodeList[0].isConflicted):
+            # print("Pop has seen conflict!!!")
+            return None
+        # Remove the entry from the top of the heap
+        nMin = self.nodeList[0]
+        self.idLocator[self.nodeList[0].id] = -1
+        # Put the last node on top of heap and heapify
+        self.nodeList[0] = self.nodeList[self.len - 1]
+        self.idLocator[self.nodeList[0].id] = 0
+        self.len -= 1
+        self.Min_Heapify(0)
+        return nMin
+    def Min_Heapify(self, i):
+        nMin = self.nodeList[i]
+        li = Left(i)
+        if(li < self.len):
+            if(self.nodeList[li].dist < nMin.dist):
+                nMin = self.nodeList[li]
+                min_i = li
+        ri = Right(i)
+        if(ri < self.len):
+            if(self.nodeList[ri].dist < nMin.dist):
+                nMin = self.nodeList[ri]
+                min_i = ri
+        if(nMin.id != self.nodeList[i].id):
+            self.Exchange(i, min_i)
+            self.Min_Heapify(min_i)
+    def Delete(self, node):
+        i = self.idLocator[node.id]
+        self.nodeList[i].isConflicted = True
+        self.nodeList[i].dist = np.inf
+        self.Min_Heapify(i)
+    def Build(self):
+        self.len = len(self.nodeList)
+        for i in range(int(Parent(self.len - 1)) + 1):
+            self.Min_Heapify(i)
+    def Print(self):
+        i = 0
+        level = 1
+        ilimit = 0
+        while(i < self.len):
+            print('N(%d, %2.1f)' % (self.nodeList[i].id, self.nodeList[i].dist), end = ' ')
+            i += 1
+            if(i > ilimit):
+                print('\n')
+                level *= 2
+                ilimit += level
+"""
+################################################################################################
+######################  IMPLEMENTATION OF PRIM'S ALGO FOR FINDING MST ##########################
+#############################  USES HEAP DEFINED ABOVE  ########################################
+################################################################################################
+"""
+def MST(nodelist, WScalarMat, conflicts_Dict, source):
+    # WTF Dude!!! This function should not be used... It is running Prim's on a directed graph!!!
+    # Doesn't return MST
+    mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+    # Reset nodes and put ids
+    for id in range(len(nodelist)):
+        nodelist[id].id = id
+        nodelist[id].dist = np.inf
+        nodelist[id].isConflicted = False
+        nodelist[id].src = -1
+    # Initialize Graph and min-Heap
+    nodelist[source].dist = 0
+    for neighbour in range(len(nodelist)):
+        if neighbour != source:
+            nodelist[neighbour].dist = WScalarMat[source][neighbour]
+            nodelist[neighbour].src = source
+    h = Heap(nodelist)
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    # Run MST only until first conflicting node is seen
+    # Conflicting node will have np.inf as dist
+    while True:
+        nextNode = h.Pop()
+        if nextNode == None:
+            break
+        # print(nextNode.src, nextNode.id, nextNode)
+        mst_nodes_bool[nextNode.id] = True
+        mst_nodes[nextNode.chunk_id].append(nextNode)
+        if nextNode.src != -1:
+            mst_adj_graph[nextNode.src, nextNode.id] = True
+            # mst_adj_graph[nextNode.id, nextNode.src] = True
+        nid = nextNode.id
+        for conId in conflicts_Dict[nid]:
+            h.Delete(nodelist[conId])
+        for neighbour in range(len(nodelist)):
+            if neighbour != nextNode.id:
+                h.Decrease_Key(nodelist[neighbour], WScalarMat[nextNode.id][neighbour], nextNode.id)
+    mst_nodes = dict(mst_nodes)
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def RandomST_GoldOnly(nodelist, WScalarMat, conflicts_Dict, source):
+    (mst_nodes, mst_adj_graph, mst_nodes_bool) = MST(nodelist, WScalarMat, conflicts_Dict, source)
+    mst_adj_graph = np.zeros_like(mst_adj_graph)
+    nodelen = len(nodelist)
+    ## Random MST
+    free_set = list(range(nodelen))
+    full_set = list(range(nodelen))
+    st_set = []
+    start_node = np.random.randint(nodelen)
+    st_set.append(start_node)
+    free_set.remove(start_node)
+    for x in range(nodelen - 1):
+        a = st_set[np.random.randint(len(st_set))]
+        b = free_set[np.random.randint(len(free_set))]
+        if b not in st_set:
+            st_set.append(b)
+            free_set.remove(b)
+        mst_adj_graph[a, b] = 1
+        # mst_adj_graph[b, a] = 1 # Directed Spanning tree
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def GetMSTWeight(mst_adj_graph, WScalarMat):
+    return np.sum(WScalarMat[mst_adj_graph])

dir/heap_n_clique.py ADDED Viewed

	@@ -0,0 +1,325 @@

+import numpy as np
+from word_definite import *
+import math
+def Parent(i):
+    return max(0, math.floor((i - 1)/2))
+def Left(i):
+    return 2*i + 1
+def Right(i):
+    return 2*(i + 1)
+"""
+################################################################################################
+######################## NOMINAL NODE CLASS REQUIRED FOR USING  ################################
+#########################  WITH THE HEAP DATA STRUCTURE  #######################################
+################################################################################################
+"""
+class Node:
+    def __init__(self, id, dist):
+        self.dist = dist
+        self.id = id
+        self.isConflicted = False
+        self.src = -1
+"""
+################################################################################################
+############################  IMPLEMENTATION OF HEAP  ##########################################
+################################################################################################
+"""
+class Heap:
+    # It's a minHeap
+    # Nodes are of type Word_definite
+    def __init__(self, nodeList):
+        self.nodeList = [n for n in nodeList]
+        self.len = len(nodeList)
+        self.idLocator = {}
+        for i in range(self.len):
+            self.idLocator[nodeList[i].id] = i
+        self.Build()
+    def Exchange(self, i, j):
+        t = self.nodeList[i]
+        self.nodeList[i] = self.nodeList[j]
+        self.nodeList[j] = t
+        self.idLocator[self.nodeList[i].id] = i
+        self.idLocator[self.nodeList[j].id] = j
+    def Decrease_Key(self, node, newDist, src):
+        if node.isConflicted:
+            return
+        i = self.idLocator[node.id]
+        if newDist > node.dist:
+            # relaxation not possible
+            return
+        else:
+            node.dist = newDist
+            node.src = src
+            parent = Parent(i)
+            while ((i > 0) and (self.nodeList[parent].dist > self.nodeList[i].dist)):
+                self.Exchange(i, parent)
+                i = parent
+                parent = Parent(i)
+    def Pop(self):
+        if(self.len == 0):
+            return None
+        if(self.nodeList[0].isConflicted):
+            # print("Pop has seen conflict!!!")
+            return None
+        # Remove the entry from the top of the heap
+        nMin = self.nodeList[0]
+        self.idLocator[self.nodeList[0].id] = -1
+        # Put the last node on top of heap and heapify
+        self.nodeList[0] = self.nodeList[self.len - 1]
+        self.idLocator[self.nodeList[0].id] = 0
+        self.len -= 1
+        self.Min_Heapify(0)
+        return nMin
+    def Min_Heapify(self, i):
+        nMin = self.nodeList[i]
+        li = Left(i)
+        if(li < self.len):
+            if(self.nodeList[li].dist < nMin.dist):
+                nMin = self.nodeList[li]
+                min_i = li
+        ri = Right(i)
+        if(ri < self.len):
+            if(self.nodeList[ri].dist < nMin.dist):
+                nMin = self.nodeList[ri]
+                min_i = ri
+        if(nMin.id != self.nodeList[i].id):
+            self.Exchange(i, min_i)
+            self.Min_Heapify(min_i)
+    def Delete(self, node):
+        i = self.idLocator[node.id]
+        self.nodeList[i].isConflicted = True
+        self.nodeList[i].dist = np.inf
+        self.Min_Heapify(i)
+    def Build(self):
+        self.len = len(self.nodeList)
+        for i in range(int(Parent(self.len - 1)) + 1):
+            self.Min_Heapify(i)
+    def Print(self):
+        i = 0
+        level = 1
+        ilimit = 0
+        while(i < self.len):
+            print('N(%d, %2.1f)' % (self.nodeList[i].id, self.nodeList[i].dist), end = ' ')
+            i += 1
+            if(i > ilimit):
+                print('\n')
+                level *= 2
+                ilimit += level
+"""
+################################################################################################
+######################  IMPLEMENTATION OF PRIM'S ALGO FOR FINDING MST ##########################
+#############################  USES HEAP DEFINED ABOVE  ########################################
+################################################################################################
+"""
+def MST(nodelist, WScalarMat, conflicts_Dict, source):
+    # WTF Dude!!! This function should not be used... It is running Prim's on a directed graph!!!
+    # Doesn't return MST
+    mst_adj_graph = np.ndarray(WScalarMat.shape, np.bool)*False
+    # print(len(nodelist))
+    # Reset nodes and put ids
+    for id in range(len(nodelist)):
+        nodelist[id].id = id
+        nodelist[id].dist = np.inf
+        nodelist[id].isConflicted = False
+        nodelist[id].src = -1
+    # Initialize Graph and min-Heap
+    nodelist[source].dist = 0
+    for neighbour in range(len(nodelist)):
+        if neighbour != source:
+            nodelist[neighbour].dist = WScalarMat[source][neighbour]
+            nodelist[neighbour].src = source
+    h = Heap(nodelist)
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    # Run MST only until first conflicting node is seen
+    # Conflicting node will have np.inf as dist
+    while True:
+        nextNode = h.Pop()
+        if nextNode == None:
+            break
+        print("next-id:"+str(nextNode.id))
+        print('picked by '+str(nodelist[nextNode.id].dist))
+        print()
+        # print(nextNode.src, nextNode.id, nextNode)
+        mst_nodes_bool[nextNode.id] = True
+        mst_nodes[nextNode.chunk_id].append(nextNode)
+        if nextNode.src != -1:
+            mst_adj_graph[nextNode.src, nextNode.id] = True
+            # mst_adj_graph[nextNode.id, nextNode.src] = True
+        nid = nextNode.id
+        for conId in conflicts_Dict[nid]:
+            h.Delete(nodelist[conId])
+        for neighbour in range(len(nodelist)):
+            if neighbour != nextNode.id:
+                print(WScalarMat[nextNode.id][neighbour])
+                print(nodelist[neighbour].dist)
+                h.Decrease_Key(nodelist[neighbour], WScalarMat[nextNode.id][neighbour], nextNode.id)
+        print(mst_nodes_bool)
+        # print(mst_nodes_bool)
+    print('#'*30)
+    mst_nodes = dict(mst_nodes)
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def clique(nodelist, WScalarMat, conflicts_Dict, source):
+    # WTF Dude!!! This function should not be used... It is running Prim's on a directed graph!!!
+    # Doesn't return MST
+    mst_adj_graph = np.ndarray(WScalarMat.shape, bool) * False
+    # print(len(nodelist))
+    # Reset nodes and put ids
+    # print('node-ids')
+    for id in range(len(nodelist)):
+        # print(id)
+        nodelist[id].id = id
+        nodelist[id].dist = np.inf
+        nodelist[id].isConflicted = False
+        nodelist[id].src = -1
+    # print('*'*40)
+    # Initialize Graph and min-Heap
+    nodelist[source].dist = 0
+    nodeset=set()
+    for neighbour in range(len(nodelist)):
+        if neighbour != source:
+            nodelist[neighbour].dist = WScalarMat[source][neighbour]
+            nodelist[neighbour].src = source
+            # nodeset.add((nodelist[neighbour].dist,neighbour))
+    # nodeset = sorted(nodeset)
+    nodeset.add((0,source))
+    nodesadded=[]
+    nodesavailable = np.zeros(len(nodelist),dtype=int)  # o if available, 1 if not available
+    mst_nodes = defaultdict(lambda: [])
+    mst_nodes_bool = np.array([False]*len(nodelist))
+    # Run MST only until first conflicting node is seen
+    # Conflicting node will have np.inf as dist
+    it=0
+    nextNode=-1
+    while True:
+        # print(nodeset)
+        it+=1
+        # print(it)
+        if(it>1000):
+            break
+        if(len(nodeset)==0):
+            break
+        # print('before nn assign: ')
+        # print(nextNode)
+        nextNode = next(iter(nodeset))
+        # print("after nn assign:")
+        # print(nextNode)
+        # print("Nextnode is :"+str(nextNode[1])+" Picked by :"+str(nextNode[0]))
+        nextNode=nodelist[nextNode[1]]
+        # print(type(nextNode))
+        # print(st_setr(nextNode.id)+"",)
+        # print(nextNode.id)
+        nodesavailable[nextNode.id]=1
+        # nodesavailable=1
+        if nextNode == None:
+            break
+        # print(nextNode.src, nextNode.id, nextNode)
+        mst_nodes_bool[nextNode.id] = True
+        mst_nodes[nextNode.chunk_id].append(nextNode)
+        nodeset = set()
+        if nextNode.src != -1:
+            mst_adj_graph[nextNode.src, nextNode.id] = True
+            # mst_adj_graph[nextNode.id, nextNode.src] = True
+        nid = nextNode.id
+        nodesadded.append(nid)
+        for conId in conflicts_Dict[nid]:
+            # h.Delete(nodelist[conId])
+            nodesavailable[conId]=1
+            # print('here')
+        for neighbour in range(len(nodelist)):
+            # print(type(nodesavailable))
+            # print(type(nodesavailable[0]))
+            if(nodesavailable[neighbour]==1):
+                continue
+            if neighbour != nextNode.id:
+                # h.Decrease_Key(nodelist[neighbour], WScalarMat[nextNode.id][neighbour], nextNode.id)
+                edgewt=0
+                # print(nodesadded)
+                for nodepresent in nodesadded:
+                    edgewt+=WScalarMat[nodepresent][neighbour]
+                # print('adding '+str(neighbour))
+                nodeset.add((edgewt,neighbour))
+                # print(nodeset)
+        nodeset=sorted(nodeset)
+        # print(nodeset)
+    # print("#"*30)
+    # print(mst_nodes_bool)
+    # print('-'*20)
+    # print('#'*30)
+    mst_nodes = dict(mst_nodes)
+    if(it>1000):
+        print('!!!!*10')
+    for i in range(len(mst_nodes_bool)):
+        for j in range(len(mst_nodes_bool)):
+            if(i==j):
+                continue
+            if(mst_nodes_bool[i] and mst_nodes_bool[j]):
+                mst_adj_graph[i][j]=True
+                mst_adj_graph[j][i]=True
+    # print(mst_adj_graph)
+    # print("#")
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def RandomST_GoldOnly(nodelist, WScalarMat, conflicts_Dict, source):
+    (mst_nodes, mst_adj_graph, mst_nodes_bool) = MST(nodelist, WScalarMat, conflicts_Dict, source)
+    mst_adj_graph = np.zeros_like(mst_adj_graph)
+    nodelen = len(nodelist)
+    ## Random mst_adj_graph
+    free_set = list(range(nodelen))
+    full_set = list(range(nodelen))
+    st_set = []
+    start_node = np.random.randint(nodelen)
+    st_set.append(start_node)
+    free_set.remove(start_node)
+    for x in range(nodelen - 1):
+        a = st_set[np.random.randint(len(st_set))]
+        b = free_set[np.random.randint(len(free_set))]
+        if b not in st_set:
+            st_set.append(b)
+            free_set.remove(b)
+        mst_adj_graph[a, b] = 1
+        # mst_adj_graph[b, a] = 1 # Directed Spanning tree
+    return (mst_nodes, mst_adj_graph, mst_nodes_bool)
+def GetMSTWeight(mst_adj_graph, WScalarMat):
+    return np.sum(WScalarMat[mst_adj_graph])

dir/heldoutmatchtest.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import os
+testfolder='../NewData/skt_dcs_DS.bz2_4K_bigram_mir_heldout'
+print('loading Testing FIles')
+TestFiles=set()
+Allfiles=set()
+for f in os.listdir(testfolder):
+    if '.ds.bz2' in f:
+    	f = f.replace('.ds.bz2', '.p2')
+        TestFiles.add(f)
+        Allfiles.add((f,1))
+bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_bigram_mir_10K/'
+print('loading Training Files')
+TrainFiles = set()
+for f in os.listdir(bz2_input_folder):
+    if '.ds.bz2' in f:
+    	f = f.replace('.ds.bz2', '.p2')
+        TrainFiles.add(f)
+        Allfiles.add((f,2))
+# TestFiles = sorted(TestFiles)
+# print(TestFiles)
+# print
+# TrainFiles = sorted(TrainFiles)
+# print(TrainFiles)
+print(TestFiles&TrainFiles)
+print(len(TrainFiles))
+print(len(TestFiles))
+# for i in sorted(Allfiles):
+# 	print i

dir/lemmawise_labeller.py ADDED Viewed

	@@ -0,0 +1,77 @@

+import csv, os, pickle
+import bz2
+from optparse import OptionParser
+def open_dsbz2(filename):
+    with bz2.BZ2File(filename, 'r') as f:
+        loader = pickle.load(f)
+    conflicts_Dict_correct = loader['conflicts_Dict_correct']
+    nodelist_to_correct_mapping = loader['nodelist_to_correct_mapping']
+    nodelist_correct = loader['nodelist_correct']
+    featVMat_correct = loader['featVMat_correct']
+    featVMat = loader['featVMat']
+    conflicts_Dict = loader['conflicts_Dict']
+    nodelist = loader['nodelist']
+    return (nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat)
+def main(small_tag):
+    ho_folders = {
+        'PR2': 'skt_dcs_DS.bz2_4K_pmi_rfe_heldout',
+        'BR2': 'skt_dcs_DS.bz2_4K_bigram_rfe_heldout',
+        'PM2': 'skt_dcs_DS.bz2_4K_pmi_mir_heldout',
+        'BM2': 'skt_dcs_DS.bz2_4K_bigram_mir_heldout',
+        'PR3': 'skt_dcs_DS.bz2_1L_pmi_rfe_heldout',
+        'BR3': 'skt_dcs_DS.bz2_1L_bigram_rfe_heldout',
+        'PM3': 'skt_dcs_DS.bz2_1L_pmi_mir_heldout_again',
+        'BM3': 'skt_dcs_DS.bz2_1L_bigram_heldout'
+    }
+    bz_folder = '../NewData/{}/'.format(ho_folders[small_tag])
+    files = []
+    tag = '{}_NLoss_'.format(small_tag)
+    outFile = 'outputs/dump_predictions/lemma_label_{}.csv'.format(small_tag)
+    print('Writing to ', outFile)
+    for f in os.listdir('outputs/dump_predictions/'):
+        if tag in f:
+            print('Adding ', f)
+            files.append(f)
+    with open(outFile, 'w') as out_fh:
+        out_fh_csv = csv.writer(out_fh)
+        fi = 0
+        for root_file in files:
+            with open(os.path.join('outputs/dump_predictions/', root_file)) as fh:
+                print('Processing File: ', root_file)
+                fh_csv = csv.reader(fh)
+                for lr in fh_csv:
+                    if fi % 100 == 0:
+                        print('Files done: ', fi)
+                    fi += 1
+                    sent_id = lr[0]
+                    dcs_name = sent_id + '.ds.bz2'
+                    (nodelist_correct, _, _, nodelist_to_correct_mapping,\
+                        _, _, _) = open_dsbz2(os.path.join(bz_folder, dcs_name))
+                    for rx in range(5):
+                        lr = next(fh_csv)[1:]
+                        if rx == 3:
+                            iam = [int(x) for x in lr]
+                    for i in range(len(nodelist_correct)):
+                        out_fh_csv.writerow([sent_id, nodelist_correct[i].lemma, 1*(nodelist_to_correct_mapping[i] in iam)])
+if __name__ == '__main__':
+    parser = OptionParser()
+    parser.add_option("-t", "--tag", dest="tag",
+                      help="Tag for feature set to use", metavar="TAG")
+    (options, args) = parser.parse_args()
+    options = vars(options)
+    _tag = options['tag']
+    if _tag is None:
+        raise Exception('None is tag')
+    print(_tag)
+    main(_tag)

dir/nnet.py ADDED Viewed

	@@ -0,0 +1,371 @@

+import numpy as np
+"""
+################################################################################################
+###################  METHODs: SIGMOID and DERIVATIVE OF SIGMOID ################################
+################################################################################################
+"""
+def sigmoid(vec):
+    evec = 1 + np.exp(-vec)
+    return 1/evec
+def d_sigmoid(output_of_gate):
+    return output_of_gate*(1-output_of_gate)
+"""
+################################################################################################
+###################  METHODs: ReLU AND DERIVATE OF ReLU ########################################
+################################################################################################
+"""
+def relu(vec_x):
+    relu_x = vec_x.copy()
+    relu_x[vec_x < 0] = 0
+    return relu_x
+def lrelu(vec_x):
+    relu_x = vec_x.copy()
+    relu_x[vec_x < 0] = relu_x[vec_x < 0]/100
+    return relu_x
+def d_relu(vec_x):
+    d_relu_x = vec_x.copy()
+    d_relu_x[vec_x > 0] = 1
+    d_relu_x[vec_x <= 0] = 0
+    return d_relu_x
+def d_lrelu(vec_x):
+    d_relu_x = vec_x.copy()
+    d_relu_x[vec_x > 0] = 1
+    d_relu_x[vec_x <= 0] = 0.01
+    return d_relu_x
+"""
+################################################################################################
+##################  IMPLEMENTATION OF NEURAL NETWORK  ##########################################
+################################################################################################
+"""
+class NN:
+    def __init__(self, input_dimension, hidden_layer_size, outer_relu = True, keep_prob = 1.0):
+        # d: Input feature dimension i.e. the dimension of the edge feature vectors
+        # n: Hidden layer size
+        # TODO: Add Bias terms
+        self.n = hidden_layer_size
+        self.d = input_dimension
+        rand_init_range = 1e-2
+        self.W = np.random.uniform(-rand_init_range, rand_init_range, (self.n, self.d))
+        self.B1 = np.random.uniform(-rand_init_range, rand_init_range, (self.n, 1))
+        rand_init_range = 1e-1
+        self.U = np.random.uniform(-rand_init_range, rand_init_range, (self.n, 1))
+        self.B2 = np.random.uniform(-rand_init_range, rand_init_range, (1, 1))
+        # Apply relu or sigmoid at the output layer
+        # If relu is applied it will be assumed that log is applied to the
+        #   feature before passing it to the network
+        # Else in case of outer sigmoid
+        #   log is applied after the neural network
+        self.outer_relu = outer_relu
+        # Learning Rates
+        self.etaW = None
+        self.etaB1 = None
+        self.etaU = None
+        self.etaB2 = None
+        self.version = 'h1'
+        # Dropout
+        self.keep_prob = keep_prob
+        self.dropout_prob = 1 - keep_prob
+        self.r1 = np.ones((input_dimension, 1)) # One hot for input layer
+        self.r2 = np.ones(self.B1.shape) # one hot for hidden layer
+        self.training_time = True
+    def new_dropout(self):
+        self.r1 = np.random.binomial(1, self.keep_prob, size=self.r1.shape)
+        self.r2 = np.random.binomial(1, self.keep_prob, size=self.r2.shape)
+    def ForTraining(self):
+        self.training_time = True
+    def ForTesting(self):
+        self.training_time = False
+    def Forward_Prop(self, x):
+        if self.training_time:
+            z2 = np.matmul(self.W, x*self.r1) + self.B1
+            a2 = lrelu(z2)*self.r2
+            o = np.matmul(self.U.transpose(), a2) + self.B2
+        else:
+            z2 = np.matmul(self.keep_prob*self.W, x) + self.B1
+            a2 = lrelu(z2)
+            o = np.matmul(self.keep_prob*self.U.transpose(), a2) + self.B2
+        if self.outer_relu:
+            # s = relu(o)
+            s = o
+        else:
+            raise Exception('Support for Non-Outer_Relu removed')
+            s = sigmoid(o)
+        return (z2, a2, s)
+    '''
+    def Forward_Prop(self, x):
+        z2 = np.matmul(self.keep_prob*self.W, x) + self.B1
+        a2 = lrelu(z2)
+        o = np.matmul(self.keep_prob*self.U.transpose(), a2) + self.B2
+        if self.outer_relu:
+            # s = relu(o)
+            s = o
+        else:
+            raise Exception('Support for Non-Outer_Relu removed')
+            s = sigmoid(o)
+        return (z2, a2, s)
+    '''
+    def Get_Energy(self, x):
+        # print("problem arises now")
+        x=x[0:1500]
+        # numpy.shape(self.W)
+        # numpy.shape(x)
+        z2 = np.matmul(self.W, x) + self.B1
+        # print(len(x))
+        a2 = lrelu(z2)
+        o = np.matmul(self.U.transpose(), a2) + self.B2
+        if self.outer_relu:
+            # s = relu(o)
+            s = o
+        else:
+            raise Exception('Support for Non-Outer_Relu removed')
+            s = sigmoid(o)
+        return s
+    # Back_Propagate gradient of Loss, L:  Assuming S is the direct output of the network
+    def Back_Prop(self, dLdOut, nodeLen, featVMat, _debug = True):
+        N = nodeLen
+        dLdU = np.zeros(self.U.shape)
+        dLdB2 = np.zeros(self.B2.shape)
+        dLdW = np.zeros(self.W.shape)
+        dLdB1 = np.zeros(self.B1.shape)
+        if not self.outer_relu:
+            raise Exception('Support for Non-Outer_Relu removed')
+            return
+        else:
+            etaW = self.etaW
+            etaB1 = self.etaB1
+            etaU = self.etaU
+            etaB2 = self.etaB2
+            if (etaW is None) or (etaB1 is None) or (etaU is None) or (etaB2 is None):
+                raise Exception('Learning Rates Not Set...')
+            batch_size = 0
+            for i in range(N):
+                for j in range(N):
+                    if dLdOut[i, j] != 0 and (featVMat[i][j] is not None):
+                        batch_size += 1
+                        x = featVMat[i][j][0:1500]
+                        (z2, a2, s) = self.Forward_Prop(x)
+                        # print(a2.transpose())
+                        # print('o')
+                        # print(np.matmul(self.U.transpose(), a2))
+                        dLdU += dLdOut[i, j]*a2
+                        dLdB2 += dLdOut[i, j]
+                        dRelu = d_lrelu(z2)
+                        dLdW += (dLdOut[i, j])*np.matmul((self.U*dRelu), (x*self.r1).transpose())
+                        dLdB1 += dLdOut[i, j]*np.matmul(self.U.transpose(), dRelu)
+            if batch_size > 0:
+                delW = etaW*dLdW/(batch_size)
+                delU = etaU*dLdU/(batch_size)
+                delB1 = etaB1*dLdB1/batch_size
+                delB2 = etaB2*dLdB2/batch_size
+                if _debug:
+                    print('Max(delW): %10.6f\tMax(delU): %10.6f'%(np.max(np.abs(delW)), np.max(np.abs(delU))))
+                self.W -= delW
+                self.B1 -= delB1
+                self.U -= delU
+                self.B2 -= delB2
+class NN_2:
+    def __init__(self, input_dimension, hidden_layer_1_size, hidden_layer_2_size = None, outer_relu = True):
+        # d: Input feature dimension i.e. the dimension of the edge feature vectors
+        # n: Hidden layer size
+        if hidden_layer_2_size is None:
+            hidden_layer_2_size = hidden_layer_1_size
+        # TODO: Add Bias terms
+        self.h1 = hidden_layer_1_size
+        self.h2 = hidden_layer_2_size
+        self.d = input_dimension
+        rand_init_range = 1e-2
+        self.W1 = np.random.uniform(-rand_init_range, rand_init_range, (self.h1, self.d))
+        self.B1 = np.random.uniform(-rand_init_range, rand_init_range, (self.h1, 1))
+        self.W2 = np.random.uniform(-rand_init_range, rand_init_range, (self.h2, self.h1))
+        self.B2 = np.random.uniform(-rand_init_range, rand_init_range, (self.h2, 1))
+        rand_init_range = 1e-1
+        self.U = np.random.uniform(-rand_init_range, rand_init_range, (self.h2, 1))
+        self.B3 = np.random.uniform(-rand_init_range, rand_init_range, (1, 1))
+        # Apply relu or sigmoid at the output layer
+        # If relu is applied it will be assumed that log is applied to the
+        #   feature before passing it to the network
+        # Else in case of outer sigmoid
+        #   log is applied after the neural network
+        self.outer_relu = outer_relu
+        # Learning Rates
+        self.etaW1 = None
+        self.etaB1 = None
+        self.etaW2 = None
+        self.etaB2 = None
+        self.etaU = None
+        self.etaB3 = None
+        self.version = 'h2'
+    def Forward_Prop(self, x):
+        z2 = np.matmul(self.W1, x) + self.B1
+        a2 = lrelu(z2)
+        z3 = np.matmul(self.W2, a2) + self.B2
+        a3 = lrelu(z3)
+        o = np.matmul(self.U.transpose(), a3) + self.B3
+        if self.outer_relu:
+            # s = relu(o)
+            s = o
+        else:
+            raise Exception('Support for Non-Outer_Relu removed')
+            s = sigmoid(o)
+        return (z3, a3, z2, a2, s)
+    def Get_Energy(self, x):
+        z2 = np.matmul(self.W1, x) + self.B1
+        a2 = lrelu(z2)
+        z3 = np.matmul(self.W2, a2) + self.B2
+        a3 = lrelu(z3)
+        o = np.matmul(self.U.transpose(), a3) + self.B3
+        if self.outer_relu:
+            # s = relu(o)
+            s = o
+        else:
+            raise Exception('Support for Non-Outer_Relu removed')
+            s = sigmoid(o)
+        return s
+    # Back_Propagate gradient of Loss, L:  Assuming S is the direct output of the network
+    def Back_Prop(self, dLdOut, nodeLen, featVMat, _debug = True):
+        N = nodeLen
+        dLdU = np.zeros(self.U.shape)
+        dLdB3 = np.zeros(self.B3.shape)
+        dLdW2 = np.zeros(self.W2.shape)
+        dLdB2 = np.zeros(self.B2.shape)
+        dLdW1 = np.zeros(self.W1.shape)
+        dLdB1 = np.zeros(self.B1.shape)
+        if not self.outer_relu:
+            raise Exception('Support for Non-Outer_Relu removed')
+            return
+        else:
+            etaW1 = self.etaW1
+            etaB1 = self.etaB1
+            etaW2 = self.etaW2
+            etaB2 = self.etaB2
+            etaU = self.etaU
+            etaB3 = self.etaB3
+            if (etaW1 is None) or (etaB1 is None) or (etaW2 is None) or (etaB2 is None) or (etaU is None) or (etaB3 is None):
+                raise Exception('Learning Rates Not Set...')
+            batch_size = 0
+            for i in range(N):
+                for j in range(N):
+                    if dLdOut[i, j] != 0 and (featVMat[i][j] is not None):
+                        batch_size += 1
+                        (z3, a3, z2, a2, s) = self.Forward_Prop(featVMat[i][j])
+                        # print(a2.transpose())
+                        # print('o')
+                        # print(np.matmul(self.U.transpose(), a2))
+                        dLdU += dLdOut[i, j]*a3
+                        dLdB3 += dLdOut[i, j]
+                        dRelu_z3 = d_lrelu(z3)
+                        dLdW2 += (dLdOut[i, j])*np.matmul((self.U*dRelu_z3), a2.transpose())
+                        dLdB2 += dLdOut[i, j]*self.U*dRelu_z3
+                        dRelu_z2 = d_lrelu(z2)
+                        dLdW1 += (dLdOut[i, j])*np.matmul(np.matmul(self.W2.transpose(), self.U*dRelu_z3)*dRelu_z2, featVMat[i][j].transpose())
+                        dLdB1 += (dLdOut[i, j])*np.matmul(self.W2.transpose(), self.U*dRelu_z3)*dRelu_z2
+                        # for k in range(self.n):
+                        #     if dRelu[k] != 0:
+                        #         dLdW[k, :, None] += (dLdOut[i, j])*self.U[k]*dRelu[k]*(featVMat[i][j])
+            # print('dlDW:')
+            # print(dLdW/(batch_size))
+            # print('dlDU:')
+            # print(dLdU/(batch_size))
+            # print('Batch size: ', batch_size)
+            if batch_size > 0:
+                delW1 = etaW1*dLdW1/(batch_size)
+                delW2 = etaW1*dLdW2/(batch_size)
+                delU = etaU*dLdU/(batch_size)
+                delB1 = etaB1*dLdB1/batch_size
+                delB2 = etaB2*dLdB2/batch_size
+                delB3 = etaB2*dLdB3/batch_size
+                if _debug:
+                    print('Max(delW2): %10.6f\tMax(delW1): %10.6f\tMax(delU): %10.6f'%(np.max(np.abs(delW2)), np.max(np.abs(delW1)), np.max(np.abs(delU))))
+                # Layer 1
+                self.W1 -= delW1
+                self.B1 -= delB1
+                # Layer 2
+                self.B2 -= delB2
+                self.W2 -= delW2
+                # Layer 3
+                self.U -= delU
+                self.B3 -= delB3

dir/pvb.p ADDED Viewed

Binary file (1.3 kB). View file

dir/rom.txt ADDED Viewed

	@@ -0,0 +1,33 @@

+a,a
+ā,A
+i,i
+ī,I
+u,u
+ū,U
+ṛ,f
+e,e
+o,o
+ṃ,M
+k,k
+g,g
+ṅ,N
+c,c
+j,j
+ñ,Y
+ṭ,w
+ḍ,q
+ṇ,R
+t,t
+d,d
+n,n
+p,p
+b,b
+y,y
+r,r
+l,l
+v,v
+ś,S
+ṣ,z
+s,s
+h,h
+ḥ,H

dir/rom2.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+ai,E
+au,O
+kh,K
+gh,G
+ch,C
+jh,J
+ṭh,W
+ḍh,Q
+th,T
+dh,D
+ph,P
+bh,B

dir/romtoslp.py ADDED Viewed

	@@ -0,0 +1,35 @@

+# -*- coding: utf-8 -*-
+"""
+Created on Tue Apr  5 19:14:27 2016
+@author: puneet
+"""
+def rom_slp(a):
+    double_dict={}
+    f=open('rom2.txt','r')
+    for lines in f.readlines():
+            words=lines.split(',')
+            words[1]=words[1].replace('\n','')
+            double_dict[words[0]]=words[1]
+    f.close()
+    single_dict={}
+    q=open('rom.txt','r')
+    for lines in q.readlines():
+            words=lines.split(',')
+            words[1]=words[1].replace('\n','')
+            single_dict[words[0]]=words[1]
+    q.close()
+    for elem in double_dict:
+        if elem in a:
+            a=a.replace(elem,double_dict[elem])
+    for elem in single_dict:
+        if elem in a:
+            a=a.replace(elem,single_dict[elem])
+    return(a)

dir/romtoslp.pyc ADDED Viewed

Binary file (934 Bytes). View file

dir/sandhiRules.p ADDED Viewed

Binary file (95.8 kB). View file

dir/sentences.py ADDED Viewed

	@@ -0,0 +1,319 @@

+#Loading of SKT Pickles
+from romtoslp import rom_slp
+from json import *
+import pprint
+from utilities import *
+class word_new():
+    def __init__(self,names):
+        self.lemmas=[]
+        self.names=names
+        self.urls=[]
+        self.forms=[]
+class chunks:
+    def __init__(self,chunk_name):
+        self.chunk_name=chunk_name
+        self.chunk_words={}
+class sentences:
+    def __init__(self,sent_id,sentence):
+        self.sent_id=sent_id
+        self.sentence=sentence
+        self.chunk=[]
+# def getCNGs(formsDict):
+#         l = []
+#         if type(formsDict) == int or type(formsDict) == str:
+#             return [int(formsDict)]
+#         else:
+#             for form, configs in formsDict.items():
+#                 for c in configs:
+#                     if(form == 'verbform'):
+#                         continue
+#                     else:
+#                         l.append(wtc_recursive(form, configs))
+#             return list(set(l))
+class SentenceError(Exception):
+    def __init__(self, message):
+        # Call the base class constructor with the parameters it needs
+        super(SentenceError, self).__init__(message)
+def SeeSentence(sentenceObj):
+    print('SKT ANALYZE')
+    print('-'*15)
+    print(sentenceObj.sentence)
+    zz = 0
+    # (chunkDict, lemmaList, wordList, revMap2Chunk, qu, cngList, verbs, tuplesMain) = SentencePreprocess(sentenceObj)
+    # for cid in chunkDict.keys():
+    #     print('Analyzing:', rom_slp(sentenceObj.chunk[cid].chunk_name))
+    #     for pos in chunkDict[cid].keys():
+    #         tupIds = chunkDict[cid][pos]
+    #         for ti in tupIds:
+    #             print('%d :' % (pos, ), end = ' ')
+    #             print(tuplesMain[ti][0][1], end=' ')
+    #             for tup in tuplesMain[ti]:
+    #                 print([zz, tup[2], tup[3]], end=' ')
+    #                 zz += 1
+    #             print('')
+    #     print('-'*25)
+    for chunk in sentenceObj.chunk:
+        print("Analyzing ", rom_slp(chunk.chunk_name))
+        for pos in chunk.chunk_words.keys():
+            for word_sense in chunk.chunk_words[pos]:
+                word_sense = fix_w_new(word_sense)
+                print(pos, ": ", rom_slp(word_sense.names), word_sense.lemmas, word_sense.forms)
+                # for formsDict in word_sense.forms:
+                #     print(getCNGs(formsDict))
+    print()
+def getWord(sentenceObj, cid, pos,kii):
+    ch = sentenceObj.chunk[cid]
+    word = ch.chunk_words[pos][kii]
+    return {'lemmas': word.lemmas, 'forms':word.forms, 'names':word.names}
+# ---------------------------------------------------------------------------------------------------------------------
+# ---------------------------------------------------------------------------------------------------------------------
+# ---------------------------------------------------------------------------------------------------------------------
+# ---------------------------------------------------------------------------------------------------------------------
+from wordTypeCheckFunction import *
+import pickle
+"""
+SentencePreprocess:
+-------------------
+    Read a sentence obj and create + return the following objects
+    -> chunkDict: chunk_id -> position -> index in lemmaList (nested dictionary)
+    -> lemmaList: list of possible words as a result of word segmentation
+    -> revMap2Chunk: Map word in wordList to (cid, position) in chunkDict
+    -> qu: Possible query nodes
+"""
+v2t = pickle.load(open('verbs_vs_cngs_matrix_countonly.p', 'rb'), encoding=u'utf8')
+def wtc_recursive(form, c):
+    if type(c) ==list:
+        for cc in c:
+            return wtc_recursive(form, cc)
+    else:
+        return wordTypeCheck(form, c)
+def CanBeQuery(chunk):
+    allLemmas = []
+    for pos, words in chunk.chunk_words.items():
+        for word in words:
+            for lemma in word.lemmas:
+                if lemma != '':
+                    allLemmas.append(lemma)
+    if(len(allLemmas) == 1):
+        return True
+def Get_QCs(tuplesMain, chunkDict):
+    # Form NON-competitor dictionary - Query - Candidate Pairs
+    qc_pairs = {}
+    nodeList = [t for ts in tuplesMain for t in ts]
+    for ni in range(len(nodeList)):
+        qc_pairs[ni] = set(range(len(nodeList))) - set([ni])
+    for cid in chunkDict.keys():
+        # Neighbours
+        for pos1 in chunkDict[cid].keys():
+            for pos2 in chunkDict[cid].keys():
+                if pos1 <= pos2:
+                    nList1 = []
+                    for ti1 in chunkDict[cid][pos1]:
+                        for tup1 in tuplesMain[ti1]:
+                            nList1.append(tup1[0])
+                    nList2 = []
+                    for ti2 in chunkDict[cid][pos2]:
+                        for tup2 in tuplesMain[ti2]:
+                            nList2.append(tup2[0])
+                    nList1 = set(nList1)
+                    nList2 = set(nList2)
+                    for n1 in nList1:
+                        qc_pairs[n1] = qc_pairs[n1] - nList1
+                    for n2 in nList2:
+                        qc_pairs[n2] = qc_pairs[n2] - nList2
+                    if pos1 < pos2:
+                        for n1 in nList1:
+                            for n2 in nList2:
+                                if not CanCoExist_sandhi(pos1, pos2, nodeList[n1][1], nodeList[n2][1]):
+                                    qc_pairs[n1] = qc_pairs[n1] - set([n2])
+                                    qc_pairs[n2] = qc_pairs[n2] - set([n1])
+    return qc_pairs
+'''
+===================
+SentencePreprocess
+===================
+forceQuery: Setting it true will make the longest word available a query if no
+            other query is available
+'''
+def SentencePreprocess(sentenceObj, forceQuery = False):
+    """
+    Considering word names only
+    ***{Word forms or cngs can also be used}
+    """
+    def getCNGs(formsDict):
+        if type(formsDict) == int or type(formsDict) == str:
+            return [int(formsDict)]
+        else:
+            l = []
+            for form, configs in formsDict.items():
+                for c in configs:
+                    if(form == 'verbform'):
+                        continue
+                    else:
+                        l.append(wtc_recursive(form, c))
+            return list(set(l))
+    chunkDict = {}
+    lemmaList = []
+    wordList = []
+    cngList = []
+    revMap2Chunk = []
+    qu = []
+    tuplesMain = []
+    cid = -1
+    tidExclusive = 0
+    ## Traverse sentence and form data-structures
+    for chunk in sentenceObj.chunk:
+        # print(chunk.chunk_name)
+        cid = cid+1
+        chunkDict[cid] = {}
+        for pos in chunk.chunk_words.keys():
+            tupleSet = {}
+            chunkDict[cid][pos] = []
+            for word_sense in chunk.chunk_words[pos]:
+                # word_sense = fix_w_new(word_sense)
+                nama = rom_slp(word_sense.names)
+                if nama == '':
+                    raise SentenceError('Empty Name Detected')
+                if(len(word_sense.lemmas) > 0 and len(word_sense.forms) > 0):
+                    tuples = []
+                    for lemmaI in range(len(word_sense.lemmas)):
+                        # lemma = rom_slp(word_sense.lemmas[lemmaI].split('_')[0]) # NOT REQUIRED - DONE IN FIX_W_NEW
+                        lemma = word_sense.lemmas[lemmaI]
+                        if lemma == '':
+                            continue
+                        tempCNGs = getCNGs(word_sense.forms[lemmaI])
+                        for cng in tempCNGs:
+                            # UPDATE LISTS
+                            newT_Key = (lemma, cng)
+                            newT = (tidExclusive, nama, lemma, cng)
+                            if(newT_Key not in tupleSet):
+                                tupleSet[newT_Key] = 1
+                                tuples.append(newT) # Remember the order
+                                lemmaList.append(lemma)
+                                wordList.append(nama)
+                                cngList.append(cng)
+                                revMap2Chunk.append((cid, pos, len(tuplesMain)))
+                                tidExclusive += 1
+                    if(len(tuples) > 0):
+                        # print(tuples)
+                        k = len(tuplesMain)
+                        chunkDict[cid][pos].append(k)
+                        tuplesMain.append(tuples)
+    ## Find QUERY nodes now
+    for cid in chunkDict.keys():
+        tuples = []
+        for pos in chunkDict[cid].keys():
+            tupIds = chunkDict[cid][pos]
+            for tupId in tupIds:
+                [tuples.append((pos, tup[0], tup[1])) for tup in tuplesMain[tupId]]
+        for u in range(len(tuples)):
+            tup1 = tuples[u]
+            quFlag = True
+            for v in range(len(tuples)):
+                if(u == v):
+                    continue
+                tup2 = tuples[v]
+                # '''
+                # FIXME: REMOVE TRY CATCH
+                # '''
+                # try:
+                if(tup1[0] < tup2[0]):
+                    if not CanCoExist_sandhi(tup1[0], tup2[0], tup1[2], tup2[2]):
+                        ## Found a competing node - hence can't be a query
+                        quFlag = False
+                        break
+                elif(tup1[0] > tup2[0]):
+                    if not CanCoExist_sandhi(tup2[0], tup1[0], tup2[2], tup1[2]):
+                        ## Found a competing node - hence can't be a query
+                        quFlag = False
+                        break
+                else:
+                    quFlag = False
+                    break
+                # except IndexError:
+                #     print('From SentencePreprocess IndexError:', sentenceObj.sent_id)
+                #     raise IndexError
+            if quFlag:
+                qu.append(tup1[1])
+    # if len(qu) == 0:
+    #     print('No query available')
+    #     maxI = 0
+    #     for i in range(len(wordList)):
+    #         if len(wordList[i]) > len(wordList[maxI]):
+    #             maxI = i
+    #         elif len(wordList[i]) == len(wordList[maxI]):
+    #             # Check the competitor count
+    #     print(wordList[maxI], 'is forced query')
+    verbs = []
+    i = -1
+    for w in lemmaList:
+        i += 1
+        if w in list(v2t.keys()):
+            verbs.append(i)
+    # pprint.pprint(tuplesMain)
+    # pprint.pprint(chunkDict)
+    # pprint.pprint(revMap2Chunk)
+    qc_pairs = Get_QCs(tuplesMain, chunkDict)
+    '''
+    qu = [] # Have to remove it later
+    '''
+    # print(chunkDict)
+    if len(qu) == 0 and len(lemmaList) > 0:
+        lens = np.array([len(t[1]) for ts in tuplesMain for t in ts])
+        cw = [(t[0], t[1]) for ts in tuplesMain for t in ts]
+        round1 = np.where(lens == np.max(lens))[0]
+        hits = [len(qc_pairs[r]) for r in round1]
+        finalist = round1[np.where(hits == np.min(hits))][0]
+        qu.append(finalist)
+    return (chunkDict, lemmaList, wordList, revMap2Chunk, qu, cngList, verbs, tuplesMain, qc_pairs)

dir/sh_TestPool_MP_clique.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import multiprocessing as mp
+import TestPool_Unit
+from shutil import copyfile
+import numpy as np
+import time
+import sys
+from optparse import OptionParser
+from collections import defaultdict
+def Evaluate(result_arr):
+    print('Files Processed: ', len(result_arr))
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    fully_Correct_l = 0
+    fully_Correct_w = 0
+    for entry in result_arr:
+        (word_match, lemma_match, n_dcsWords, n_output_nodes) = entry
+        recalls.append(lemma_match/n_dcsWords)
+        recalls_of_word.append(word_match/n_dcsWords)
+        precisions.append(lemma_match/n_output_nodes)
+        precisions_of_words.append(word_match/n_output_nodes)
+        if lemma_match == n_dcsWords:
+            fully_Correct_l += 1
+        if word_match == n_dcsWords:
+            fully_Correct_w += 1
+    print('Avg. Micro Recall of Lemmas: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Lemmas: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions_of_words))))
+    rl = np.mean(np.array(recalls))
+    pl = np.mean(np.array(precisions))
+    print('F-Score of Lemmas: ', (2*pl*rl)/(pl+rl))
+    print('Fully Correct Lemmawise: {}'.format(fully_Correct_l/len(recalls_of_word)))
+    print('Fully Correct Wordwise: {}'.format(fully_Correct_w/len(recalls_of_word)))
+    print('[{:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}]'.format(100*np.mean(np.array(recalls)), 100*np.mean(np.array(recalls_of_word)), 100*np.mean(np.array(precisions)), \
+           100*np.mean(np.array(precisions_of_words)), 100*(2*pl*rl)/(pl+rl), 100*fully_Correct_l/len(recalls_of_word),\
+           100*fully_Correct_w/len(recalls_of_word)))
+    sys.stdout.flush()
+tag = None
+proc_count = 4
+def main():
+    global proc_count, tag
+    ho_folders = {
+        'PR2': 'skt_dcs_DS.bz2_4K_pmi_rfe_heldout',
+        'BR2': 'skt_dcs_DS.bz2_4K_bigram_rfe_heldout',
+        'PM2': 'skt_dcs_DS.bz2_4K_pmi_mir_heldout',
+        'BM2': 'skt_dcs_DS.bz2_4K_bigram_mir_heldout',
+        'PR3': 'skt_dcs_DS.bz2_1L_pmi_rfe_heldout',
+        'BR3': 'skt_dcs_DS.bz2_1L_bigram_rfe_heldout',
+        'PM3': 'skt_dcs_DS.bz2_1L_pmi_mir_heldout_again',
+        'BM3': 'skt_dcs_DS.bz2_1L_bigram_heldout'
+    }
+    modelList = {
+        'PR2': 'outputs/train_{}/nnet_e1_i400.p'.format('t2788294192566'),
+        'BR2': 'outputs/train_{}/nnet_e1_i400.p'.format('t2789415023871'),
+        'PM2': 'outputs/train_{}/nnet_e1_i400.p'.format('t2753954441900'),
+        'BM2': 'outputs/train_{}/nnet_e1_i400.p'.format('t3401216067518'),
+        'PR3': 'outputs/train_{}/nnet_e1_i400.p'.format('t2761370242287'),
+        'BR3': 'outputs/train_{}/nnet_e1_i400.p'.format('t2779114903467'),
+        'PM3': 'outputs/train_{}/nnet_e1_i400.p'.format('t2756013734745'),
+        'BM3': 'outputs/train_{}/nnet_e1_i400.p'.format('t3471903174862')
+    }
+    modelFile = modelList[tag]
+    print('Tag: {}, ModelFile: {}'.format(tag, modelFile))
+    print('ProcCount: {}'.format(proc_count))
+    _dump = True
+    if _dump:
+        _outFile = 'outputs/dump_predictions/{}_NLoss'.format(tag)
+    else:
+        _outFile = None
+    print('OutFile: ', _outFile)
+    # Backup the model file
+    copyfile(modelFile, modelFile + '.bk')
+    # Create Queue, Result array
+    queue = mp.Queue()
+    result_arr = []
+    print('Source: ', '../NewData/{}/'.format(ho_folders[tag]))
+    # Start 6 workers - 8 slows down the pc
+    # proc_count = 4
+    procs = [None]*proc_count
+    for i in range(proc_count):
+        vpid = i
+        procs[i] = mp.Process(target = TestPool_Unit.pooled_Test, args = \
+                              (modelFile, vpid, queue, '../NewData/{}/'.format(ho_folders[tag]), int(9600/proc_count), _dump, _outFile))
+    # Start Processes
+    for i in range(proc_count):
+        procs[i].start()
+    # Fetch partial results
+    stillRunning = True
+    printer_timer = 100
+    while stillRunning:
+        stillRunning = False
+        for i in range(proc_count):
+            p = procs[i]
+            # print('Process with\t vpid: {}\t ->\t pid: {}\t ->\t running status: {}'.format(i, p.pid, p.is_alive()))
+            if p.is_alive():
+                stillRunning = True
+        if printer_timer == 0:
+            printer_timer = 100
+            while not queue.empty():
+                result_arr.append(queue.get())
+            # Evaluate results till now
+            if len(result_arr) > 0:
+                Evaluate(result_arr)
+        printer_timer -= 1
+        time.sleep(1)
+    while not queue.empty():
+        result_arr.append(queue.get())
+    Evaluate(result_arr)
+    for i in range(proc_count):
+        procs[i].join()
+def setArgs(_tag, _pc):
+    global proc_count, tag
+    tag = _tag
+    proc_count = _pc
+    print('Tag, ProcCount: {}, {}'.format(tag, proc_count))
+if __name__ == '__main__':
+    #print('Number of arguments:', len(sys.argv), 'arguments.')
+    #print('Argument List:', str(sys.argv))
+    parser = OptionParser()
+    parser.add_option("-t", "--tag", dest="tag",
+                      help="Tag for feature set to use", metavar="TAG")
+    parser.add_option("-p", "--procs", dest="proc_count", default = 4,
+                      help="Number of child process", metavar="PROCS")
+    (options, args) = parser.parse_args()
+    options = vars(options)
+    _tag = options['tag']
+    if _tag is None:
+        raise Exception('None is tag')
+    pc = int(options['proc_count'])
+    setArgs(_tag, pc)
+    main()

dir/test_clique.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import multiprocessing as mp
+import TestPool_Unit_clique
+import csv
+from shutil import copyfile
+import numpy as np
+import time
+import sys
+from optparse import OptionParser
+from collections import defaultdict
+rwfinal=0
+pwfinal=0
+rlfinal=0
+plfinal=0
+fnum=0
+def Evaluate(result_arr):
+    print('Files Processed: ', len(result_arr))
+    recalls = []
+    recalls_of_word = []
+    precisions = []
+    precisions_of_words = []
+    fully_Correct_l = 0
+    fully_Correct_w = 0
+    for entry in result_arr:
+        (word_match, lemma_match, n_dcsWords, n_output_nodes) = entry
+        recalls.append(lemma_match/n_dcsWords)
+        recalls_of_word.append(word_match/n_dcsWords)
+        precisions.append(lemma_match/n_output_nodes)
+        precisions_of_words.append(word_match/n_output_nodes)
+        if lemma_match == n_dcsWords:
+            fully_Correct_l += 1
+        if word_match == n_dcsWords:
+            fully_Correct_w += 1
+    print('Avg. Micro Recall of Words: {}'.format(np.mean(np.array(recalls))))
+    print('Avg. Micro Recall of Word++s: {}'.format(np.mean(np.array(recalls_of_word))))
+    print('Avg. Micro Precision of Words: {}'.format(np.mean(np.array(precisions))))
+    print('Avg. Micro Precision of Word++s: {}'.format(np.mean(np.array(precisions_of_words))))
+    rl = np.mean(np.array(recalls))
+    pl = np.mean(np.array(precisions))
+    print('F-Score of Wordss: ', (2*pl*rl)/(pl+rl))
+    print('Fully Correct Wordwise: {}'.format(fully_Correct_l/len(recalls_of_word)))
+    print('Fully Correct Word++wise: {}'.format(fully_Correct_w/len(recalls_of_word)))
+    print('[{:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}, {:0.2f}]'.format(100*np.mean(np.array(recalls)), 100*np.mean(np.array(recalls_of_word)), 100*np.mean(np.array(precisions)), \
+           100*np.mean(np.array(precisions_of_words)), 100*(2*pl*rl)/(pl+rl), 100*fully_Correct_l/len(recalls_of_word),\
+           100*fully_Correct_w/len(recalls_of_word)))
+    sys.stdout.flush()
+tag = None
+proc_count = 4
+def main():
+    global proc_count, tag
+    ho_folders = {
+        'PR2': 'skt_dcs_DS.bz2_4K_pmi_rfe_heldout',
+        'BR2': 'skt_dcs_DS.bz2_4K_bigram_rfe_heldout',
+        'PM2': 'skt_dcs_DS.bz2_4K_pmi_mir_hel`dout',
+        'BM2': 'skt_dcs_DS.bz2_4K_bigram_mir_heldout',
+        'PR3': 'skt_dcs_DS.bz2_1L_pmi_rfe_heldout',
+        'BR3': 'skt_dcs_DS.bz2_1L_bigram_rfe_heldout',
+        'PM3': 'skt_dcs_DS.bz2_1L_pmi_mir_heldout2',
+        'BM3': 'skt_dcs_DS.bz2_1L_bigram_heldout'
+    }
+    modelList = {
+        'PR2': 'outputs/train_{}/nnet.p'.format('t8006684774222'),
+        'BR2': 'outputs/train_{}/nnet.p'.format('t7978761528557'),
+        'PM2': 'outputs/train_{}/nnet.p'.format('t7323235797178'),
+        'BM2': 'outputs/train_{}/nnet.p'.format('t7978754709018'),
+        'PR3': 'outputs/train_{}/nnet.p'.format('t8006711065860'),
+        'BR3': 'outputs/train_{}/nnet.p'.format('t8103694133496'),
+        'PM3': 'outputs/train_{}/nnet.p'.format('t8006607913382'),
+        'BM3': 'outputs/train_{}/nnet.p'.format('t7274036680592')
+    }
+    modelFile = modelList[tag]
+    print('Tag: {}, ModelFile: {}'.format(tag, modelFile))
+    print('ProcCount: {}'.format(proc_count))
+    _dump = True
+    if _dump:
+        _outFile = 'outputs/{}_NLoss'.format(tag)
+    else:
+        _outFile = None
+    print('OutFile: ', _outFile)
+    # Backup the model file
+    copyfile(modelFile, modelFile + '.bk')
+    # Create Queue, Result array
+    queue = mp.Queue()
+    result_arr = []
+    print('Source: ', '../wordsegmentation/{}/'.format(ho_folders[tag]))
+    # Start 6 workers - 8 slows down the pc
+    # proc_count = 4
+    procs = [None]*proc_count
+    for i in range(proc_count):
+        vpid = i
+        procs[i] = mp.Process(target = TestPool_Unit_clique.pooled_Test, args = \
+                              (modelFile, vpid, queue, '../wordsegmentation/{}/'.format(ho_folders[tag]), int(9600/proc_count), _dump, _outFile))
+    # Start Processes
+    for i in range(proc_count):
+        procs[i].start()
+    # Fetch partial results
+    stillRunning = True
+    printer_timer = 100
+    while stillRunning:
+        stillRunning = False
+        for i in range(proc_count):
+            p = procs[i]
+            # print('Process with\t vpid: {}\t ->\t pid: {}\t ->\t running status: {}'.format(i, p.pid, p.is_alive()))
+            if p.is_alive():
+                stillRunning = True
+        if printer_timer == 0:
+            printer_timer = 100
+            while not queue.empty():
+                result_arr.append(queue.get())
+            # Evaluate results till now
+            if len(result_arr) > 0:
+                Evaluate(result_arr)
+        printer_timer -= 1
+        time.sleep(1)
+    while not queue.empty():
+        result_arr.append(queue.get())
+    Evaluate(result_arr)
+    for i in range(proc_count):
+        procs[i].join()
+def setArgs(_tag, _pc):
+    global proc_count, tag
+    tag = _tag
+    proc_count = _pc
+    print('Tag, ProcCount: {}, {}'.format(tag, proc_count))
+if __name__ == '__main__':
+    parser = OptionParser()
+    parser.add_option("-t", "--tag", dest="tag",
+                      help="Tag for feature set to use", metavar="TAG")
+    parser.add_option("-p", "--procs", dest="proc_count", default = 4,
+                      help="Number of child process", metavar="PROCS")
+    (options, args) = parser.parse_args()
+    options = vars(options)
+    _tag = options['tag']
+    if _tag is None:
+        raise Exception('None is tag')
+    pc = int(options['proc_count'])
+    setArgs(_tag, pc)
+    main()

dir/unpack.py ADDED Viewed

	@@ -0,0 +1,39 @@

+from Train_clique import *
+from heap_n_clique import *
+from nnet import *
+from TestPool_Unit_clique import *
+from sentences import *
+bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_bigram_mir_10K/'  #bm2
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_bigram_mir_10K/'   #bm3
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_bigram_rfe_10K/'   #br2
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_bigram_rfe_10K/'   #br3
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_pmi_mir_10K/'   #pm2
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_pmi_mir_10K2/'   #pm3
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_4K_pmi_rfe_10K/'   #pr2
+# bz2_input_folder = '../NewData/skt_dcs_DS.bz2_1L_pmi_rfe_10K/'   #pr3
+loaded_SKT = pickle.load(open('../Simultaneous_CompatSKT_10K.p', 'rb'), encoding=u'utf-8')
+loaded_DCS = pickle.load(open('../Simultaneous_DCS_10K.p', 'rb'), encoding=u'utf-8')
+dsbz2_name = '4442.ds.bz2'
+(nodelist_correct, conflicts_Dict_correct, featVMat_correct, nodelist_to_correct_mapping,\
+            nodelist, conflicts_Dict, featVMat) = open_dsbz2(bz2_input_folder + dsbz2_name)
+# print(nodelist_correct)
+# print(nodelist)
+sentenceObj = loaded_SKT['4442.p2']
+# SeeSentence(sentenceObj)
+WScalarMat_correct = Get_W_Scalar_Matrix_from_FeatVect_Matrix(featVMat_correct, nodelist_correct,\
+                                                                      conflicts_Dict_correct, self.neuralnet)
+source = 0
+(min_st_gold_ndict, min_st_adj_gold_small, _) =MST(nodelist_correct, WScalarMat_correct, conflicts_Dict_correct, source)
+energy_gold_max_ST = np.sum(WScalarMat_correct[min_st_adj_gold_small])
+print(min_st_gold_ndict)

dir/utilities.py ADDED Viewed

	@@ -0,0 +1,323 @@

+import sys as Sys
+import pickle, re
+import numpy as np
+from romtoslp import *
+# Print iterations progress
+def printProgress (iteration, total, prefix = '', suffix = '', decimals = 2, barLength = 100):
+    """
+    Call in a loop to create terminal progress bar
+    @params:
+        iteration   - Required  : current iteration (Int)
+        total       - Required  : total iterations (Int)
+        prefix      - Optional  : prefix string (Str)
+        suffix      - Optional  : suffix string (Str)
+    """
+    filledLength    = int(round(barLength * iteration / float(total)))
+    percents        = round(100.00 * (iteration / float(total)), decimals)
+    bar             = '#' * filledLength + '-' * (barLength - filledLength)
+    Sys.stdout.write('%s [%s] %s%s %s\r' % (prefix, bar, percents, '%', suffix)),
+    Sys.stdout.flush()
+    if iteration == total:
+        print("\n")
+def pickleFixLoad(filename):
+    return pickle.load(open(filename, 'rb'), encoding=u'utf-8')
+def validatePickleName(fName):
+    m = re.search("^[\w]*.p$", fName)
+    if m != None:
+        return(m.group(0))
+    return("")
+sandhiRules = pickle.load(open('sandhiRules.p','rb'))
+def CanCoExist_sandhi(p1, p2, name1, name2):
+    # P1 must be less than P2
+    # Just send it in the proper order
+    if(p1 < p2):
+        overlap = max((p1 + len(name1)) - p2, 0)
+        if overlap == 0:
+            return True
+        if overlap == 1 or overlap == 2:
+            # try:
+            p1 = (name1[len(name1) - overlap:len(name1):], name2[0])
+            p2 = (name1[-1], name2[0:overlap:])
+            # print(name1, name2, p1, p2)
+            # print(p1, p2)
+            if p1 in sandhiRules:
+                if(sandhiRules[p1]['length'] < len(p1[0]) + len(p1[1])):
+                    return True
+            if p2 in sandhiRules:
+                if(sandhiRules[p2]['length'] < len(p2[0]) + len(p2[1])):
+                    return True
+            # except IndexError:
+            #     print('Sandhi function Error: arguments were', (p1, p2, name1, name2))
+            #     return False
+    return False
+def fix_w_new(word_new_obj):
+    # dicto= { 'asmad':'mad','yuzmad':'tvad','ayam':'idam','agn':'agni','ya':'yad','eza':'etad',
+    #          'tad':'sa','vd':'vid','va':'vE','-tva':'tva','ptta':'pitta','mahat':'mahant','ndra':'indra',
+    #          'ap':'api','h':'hi','t':'iti','tr':'tri','va':'iva'}
+    dicto= { 'asmad':'mad','yuzmad':'tvad','ayam':'idam','agn':'agni','ya':'yad','eza':'etad',
+             'vd':'vid','va':'vE','-tva':'tva','ptta':'pitta','mahat':'mahant','ndra':'indra',
+             'ap':'api','h':'hi','t':'iti','tr':'tri','va':'iva'}
+    for i in range(0,len(word_new_obj.lemmas)):
+        word_new_obj.lemmas[i]= rom_slp(word_new_obj.lemmas[i])
+        word_new_obj.lemmas[i]= word_new_obj.lemmas[i].split('_')[0]
+        if word_new_obj.lemmas[i] in dicto:
+            # print('CHANGED', word_new_obj.lemmas[i])
+            word_new_obj.lemmas[i]= dicto[word_new_obj.lemmas[i]]
+        if(word_new_obj.lemmas[i]== 'yad'):
+            if word_new_obj.names== 'yadi':
+                word_new_obj.lemmas[i]= 'yadi'
+    return(word_new_obj)
+def FixSentence(sentenceObj):
+    for ci in range(len(sentenceObj.chunk)):
+        for pos in sentenceObj.chunk[ci].chunk_words.keys():
+            for wsi in range(len(sentenceObj.chunk[ci].chunk_words[pos])):
+                sentenceObj.chunk[ci].chunk_words[pos][wsi] = fix_w_new(sentenceObj.chunk[ci].chunk_words[pos][wsi])
+    return sentenceObj
+def FillMissing(sentenceObj, dcsObj):
+    for ci in range(len(sentenceObj.chunk)):
+        corrLemmas = dcsObj.lemmas[ci]
+        cli = 0
+        iamdone = False
+        for pos in sentenceObj.chunk[ci].chunk_words.keys():
+            for wsi in range(len(sentenceObj.chunk[ci].chunk_words[pos])):
+                ws = sentenceObj.chunk[ci].chunk_words[pos][wsi]
+                for li in range(len(ws.lemmas)):
+                    if ws.lemmas[li] == rom_slp(corrLemmas[cli]):
+                        # print('MATCHED:', ws.lemmas[li], rom_slp(corrLemmas[cli]))
+                        # print('CNG LIST:', ws.forms[li] if li < len(ws.forms) else [dcsObj.cng[ci][cli]])
+                        if li >= len(ws.forms):
+                            a = ['']*(li + 1)
+                            for i in range(len(ws.forms)):
+                                a[i] = ws.forms[i]
+                            a[li] = int(dcsObj.cng[ci][cli])
+                            sentenceObj.chunk[ci].chunk_words[pos][wsi].forms = a
+                        cli += 1
+                        if cli == len(corrLemmas):
+                            iamdone = True
+                            break
+                if iamdone:
+                    break
+            if iamdone:
+                break
+    return sentenceObj
+# def loadSentence(fName, sntcPath):
+#     try:
+#         dcsObj = pickleFixLoad('../Text Segmentation/DCS_pick/' + fName)
+#         sentenceObj = pickleFixLoad(sntcPath)
+#         sentenceObj = FixSentence(sentenceObj)
+#         sentenceObj = FillMissing(sentenceObj, dcsObj)
+#     except (KeyError, EOFError, pickle.UnpicklingError) as e:
+#         print('Failed to load', sntcPath)
+#         return None, None
+#     return(sentenceObj, dcsObj)
+def loadSentence_with_rom_slp(fName, sntcPath):
+    try:
+        try:
+            if fName[-1] == '2':
+                dcsObj = pickleFixLoad('../Text Segmentation/DCS_pick/' + fName[:-1])
+            else:
+                dcsObj = pickleFixLoad('../Text Segmentation/DCS_pick/' + fName)
+        except FileNotFoundError:
+            dcsObj = None
+        sentenceObj = pickleFixLoad(sntcPath)
+        sentenceObj = FixSentence(sentenceObj)
+    except (KeyError, EOFError, pickle.UnpicklingError) as e:
+        print('Failed to load', sntcPath)
+        return None, None
+    return(sentenceObj, dcsObj)
+def loadSentence_nopre(fName, sntcPath):
+    try:
+        if fName[-1] == '2':
+            dcsObj = pickleFixLoad('../Text Segmentation/DCS_pick/' + fName[:-1])
+        else:
+            dcsObj = pickleFixLoad('../Text Segmentation/DCS_pick/' + fName)
+        sentenceObj = pickleFixLoad(sntcPath)
+    except (KeyError, EOFError, pickle.UnpicklingError) as e:
+        print('Failed to load', sntcPath)
+        return None, None
+    return(sentenceObj, dcsObj)
+preList = pickle.load(open('pvb.p', 'rb'))
+def removePrefix(lemma):
+    for pre in preList:
+        m = re.match(pre, lemma)
+        if(m != None):
+            s = m.span()
+            pat = lemma[s[0]:s[1]]
+            return (lemma.split(pat)[1])
+    return lemma
+def GetSolutions(dcsObj):
+    solution = [rom_slp(c) for arr in dcsObj.lemmas for c in arr]
+    solution_no_pvb = [removePrefix(l) for l in solution]
+    return (solution, solution_no_pvb)
+def Accuracy(prediction, dcsObj):
+    solution, solution_no_pvb = GetSolutions(dcsObj)
+    # print('Solution:', solution)
+    # print('Solution No Pvb:', solution_no_pvb)
+    ac = 0
+    for x in range(len(solution)):
+        if(solution[x] in prediction):
+            ac += 1
+        # elif(solution_no_pvb[x] in prediction):
+        #     ac += 1
+    ac = 100*ac/len(solution)
+    return ac
+def FullCoverage(skt, dcs):
+    # print('-'*40)
+    # print('NEW FILE RCVD')
+    goodFlag = True
+    for ci in range(len(dcs.lemmas)):
+        dlemmas = [rom_slp(l) for l in dcs.lemmas[ci]]
+        slemmas = []
+        chunk = skt.chunk[ci]
+        for pos in chunk.chunk_words.keys():
+            for wsi in range(len(chunk.chunk_words[pos])):
+                ws = chunk.chunk_words[pos][wsi]
+                [slemmas.append(wsl) for wsl in ws.lemmas]
+        # print('DCS:', dlemmas)
+        # print('SKT:', slemmas)
+        for l in dlemmas:
+            if l not in slemmas:
+                # print(l, 'not found')
+                goodFlag = False
+                break
+        if not goodFlag:
+            break
+#     print(goodFlag)
+    return goodFlag
+def GetFeatNameSet():
+    mat_cngCount_1D = pickle.load(open('../NewData/gauravs/Temporary_1D/mat_cngCount_1D.p', 'rb'), encoding = u'utf-8')
+    _full_cnglist = list(mat_cngCount_1D)
+    _cg_count = len(mat_cngCount_1D)
+    feats = {}
+    fIndex = 0
+    feats[fIndex] = ('L', 'L'); fIndex += 1;
+    feats[fIndex] = ('L', 'C'); fIndex += 1;
+    feats[fIndex] = ('L', 'T'); fIndex += 1;
+    feats[fIndex] = ('C', 'L'); fIndex += 1;
+    feats[fIndex] = ('C', 'C'); fIndex += 1;
+    feats[fIndex] = ('C', 'T'); fIndex += 1;
+    feats[fIndex] = ('T', 'L'); fIndex += 1;
+    feats[fIndex] = ('T', 'C'); fIndex += 1;
+    feats[fIndex] = ('T', 'T'); fIndex += 1;
+    # Path Constraint - Length 2 - # _cg_count
+    # LEMMA->CNG->LEMMA
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('L', cng_k, 'L')
+    fIndex += _cg_count
+    # LEMMA->CNG->CNG
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('L', cng_k, 'C')
+    fIndex += _cg_count
+    # LEMMA->CNG->TUP
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('L', cng_k, 'T')
+    fIndex += _cg_count
+    # CNG->CNG->LEMMA
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('C', cng_k, 'L')
+    fIndex += _cg_count
+    # CNG->CNG->CNG
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('C', cng_k, 'C')
+    fIndex += _cg_count
+    # CNG->CNG->TUP
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('C', cng_k, 'T')
+    fIndex += _cg_count
+    # TUP->CNG->LEMMA :: TOO MANY ZEROS
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('T', cng_k, 'L')
+    fIndex += _cg_count
+    # TUP->CNG->CNG :: TOO MANY ZEROS
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('T', cng_k, 'C')
+    fIndex += _cg_count
+    # TUP->CNG->TUP :: TOO MANY ZEROS
+    for k in range(0, _cg_count):
+        cng_k = _full_cnglist[k]
+        feats[fIndex + k] = ('T', cng_k, 'T')
+    fIndex += _cg_count
+    # Path Constraint - Length 3 - # _cg_count^2
+    # LEMMA->CGS->CGS->LEMMA
+    for k1 in range(0, _cg_count):
+        cng_k1 = _full_cnglist[k1]
+        for k2 in range(0, _cg_count):
+            cng_k2 = _full_cnglist[k2]
+            feats[fIndex + k1*_cg_count + k2] = ('L', cng_k1, cng_k2, 'L')
+    fIndex += _cg_count**2
+    # LEMMA->CGS->CGS->TUP
+    for k1 in range(0, _cg_count):
+        cng_k1 = _full_cnglist[k1]
+        for k2 in range(0, _cg_count):
+            cng_k2 = _full_cnglist[k2]
+            feats[fIndex + k1*_cg_count + k2] = ('L', cng_k1, cng_k2, 'T')
+    fIndex += _cg_count**2
+    # TUP->CGS->CGS->LEM
+    for k1 in range(0, _cg_count):
+        cng_k1 = _full_cnglist[k1]
+        for k2 in range(0, _cg_count):
+            cng_k2 = _full_cnglist[k2]
+            feats[fIndex + k1*_cg_count + k2] = ('T', cng_k1, cng_k2, 'L')
+    fIndex += _cg_count**2
+    # TUP->CGS->CGS->TUP
+    for k1 in range(0, _cg_count):
+        cng_k1 = _full_cnglist[k1]
+        for k2 in range(0, _cg_count):
+            cng_k2 = _full_cnglist[k2]
+            feats[fIndex + k1*_cg_count + k2] = ('T', cng_k1, cng_k2, 'T')
+    fIndex += _cg_count**2
+    return feats

dir/verbs_vs_cngs_matrix_countonly.p ADDED Viewed

The diff for this file is too large to render. See raw diff

dir/weighted.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# coding: utf-8
+# In[1]:
+import pandas,sys
+# In[7]:
+fils = {
+'BM3' : ["BM3_NLoss_proc0.csv","BM3_NLoss_proc2.csv","BM3_NLoss_proc1.csv","BM3_NLoss_proc3.csv"],
+'BM2' : ["BM2_NLoss_proc0.csv","BM2_NLoss_proc2.csv","BM2_NLoss_proc1.csv","BM2_NLoss_proc3.csv"],
+'BR2' : ["BR2_NLoss_proc0.csv","BR2_NLoss_proc2.csv","BR2_NLoss_proc1.csv","BR2_NLoss_proc3.csv"],
+'BR3' : ["BR3_NLoss_proc0.csv","BR3_NLoss_proc2.csv","BR3_NLoss_proc1.csv","BR3_NLoss_proc3.csv"],
+'PM2' : ["PM2_NLoss_proc0.csv","PM2_NLoss_proc2.csv","PM2_NLoss_proc1.csv","PM2_NLoss_proc3.csv"],
+'PM3' : ["PM3_NLoss_proc0.csv","PM3_NLoss_proc2.csv","PM3_NLoss_proc1.csv","PM3_NLoss_proc3.csv"],
+'PR2' : ["PR2_NLoss_proc0.csv","PR2_NLoss_proc2.csv","PR2_NLoss_proc1.csv","PR2_NLoss_proc3.csv"],
+'PR3' : ["PR3_NLoss_proc0.csv","PR3_NLoss_proc2.csv","PR3_NLoss_proc1.csv","PR3_NLoss_proc3.csv"]
+}
+import pandas
+from collections import defaultdict
+def predLoss(tag):
+    gt = defaultdict(dict)
+    for item in fils[tag]:
+        fil = open('outputs/'+str(item)).read().splitlines()
+        for i,line in enumerate(fil):
+            if i % 6 == 0:
+                setCol = line.split(',')
+                gt[setCol[0]]['predLemma'] = setCol[1:]
+            if i%6 == 1:
+                gt[setCol[0]]['predCNG'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['predCNG']):
+                    print(gt[setCol[0]])
+            if i%6 == 2:
+                gt[setCol[0]]['chunkID'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['chunkID']):
+                    print(gt[setCol[0]])
+            if i%6 == 3:
+                gt[setCol[0]]['chunkIDCNG'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['chunkIDCNG']):
+                    print(gt[setCol[0]])
+            if i%6 == 4:
+                gt[setCol[0]]['idInNodeID'] = line.split(',')[1:]
+                if len(gt[setCol[0]]['predLemma']) != len(gt[setCol[0]]['idInNodeID']):
+                    print(gt[setCol[0]])
+            if i%6 == 5:
+                gt[setCol[0]]['params'] = line.split(',')[1:]
+            if line.split(',')[0] != setCol[0]:
+                print(i,setCol,line)
+                print('breakin')
+                break
+    return gt
+def pdframe(gt):
+    params = defaultdict(dict)
+    for item in gt.keys():
+        tatkal = gt[item]['params']
+        params[item]['corrWords'],params[item]['corrLemma'] = int(tatkal[0]),int(tatkal[1])
+        params[item]['dcsSize'],params[item]['predictions'] = int(tatkal[2]),int(tatkal[3])
+        params[item]['wordPrec'] = params[item]['corrWords']*1.0/params[item]['predictions']
+        params[item]['wordReca'] = params[item]['corrWords']*1.0/params[item]['dcsSize']
+        params[item]['lemmaPrec'] = params[item]['corrLemma']*1.0/params[item]['predictions']
+        params[item]['lemmaReca'] = params[item]['corrLemma']*1.0/params[item]['dcsSize']
+    initRes = pandas.DataFrame.from_dict(params,orient='index')
+    return initRes
+# In[8]:
+if(len(sys.argv)<2):
+    print("Provide an argument for the feature to be evaluated")
+else:
+    BM2gt = predLoss(str(sys.argv[1]))
+    BM2pd = pdframe(BM2gt)
+    print(BM2pd.mean())

dir/wordTypeCheckFunction.py ADDED Viewed

	@@ -0,0 +1,281 @@

+def splitTillPeriod(config,listInput): #see that config is not empty and is of type string
+    #returns config sans first part and firstpart is appended to listInput
+    configList=list(config)
+    out=''
+    periodIndex=0
+    val=''
+    for i,val in enumerate(configList):
+       a=2
+       periodIndex=i
+       if val=='.':
+        break
+       if val!=" ":
+           out=out+val;
+    if val!=".":
+        config1=" ".join(config.split())
+        listInput.append(config1)
+        return ""
+    else:
+        config1="".join(configList[(periodIndex+1):])
+        listInput.append(out)
+        return config1
+def wordTypeCheck(form,config):
+    #if it is noun Im assuming it has 3 parts
+    #form is noun or verb or...
+    # print(form, config)
+    nounMapping={28:	'xt?',  29:	'Nom. sg. masc.',  30:	'Nom. sg. fem.',  31:	'Nom. sg. neutr.',  32:	'Nom. sg. adj.',  33:	'xt?',  34:	'Nom. du. masc.',  35:	'Nom. du. fem.',  36:	'Nom. du. neutr.',  37:	'Nom. du. adj.',  38:	'xt?',  39:	'Nom. pl. masc.',  40:	'Nom. pl. fem.',  41:	'Nom. pl. neutr.',  42:	'Nom. pl. adj.',  48:	'xt?',  49:	'Voc. sg. masc.',  50:	'Voc. sg. fem.',  51:	'Voc. sg. neutr.',  54:	'Voc. du. masc.',  55:	'Voc. du. fem.',  56:	'Voc. du. neutr.',  58:	'xt?',  59:	'Voc. pl. masc.',  60:	'Voc. pl. fem.',  61:	'Voc. pl. neutr.',  68:	'xt?',  69:	'Acc. sg. masc.',  70:	'Acc. sg. fem.',  71:	'Acc. sg. neutr.',  72:	'Acc. sg. adj.',  73:	'xt?',  74:	'Acc. du. masc.',  75:	'Acc. du. fem.',  76:	'Acc. du. neutr.',  77:	'Acc. du. adj.',  78:	'xt?',  79:	'Acc. pl. masc.',  80:	'Acc. pl. fem.',  81:	'Acc. pl. neutr.',  82:	'Acc. pl. adj.',  88:	'xt?',  89:	'Instr. sg. masc.',  90:	'Instr. sg. fem.',  91:	'Instr. sg. neutr.',  92:	'Instr. sg. adj.',  93:	'xt?',  94:	'Instr. du. masc.',  95:	'Instr. du. fem.',  96:	'Instr. du. neutr.',  97:	'Instr. du. adj.',  98:	'xt?',  99:	'Instr. pl. masc.',  100:	'Instr. pl. fem.',  101:	'Instr. pl. neutr.',  102:	'Instr. pl. adj.',  108:	'xt?',  109:	'Dat. sg. masc.',  110:	'Dat. sg. fem.',  111:	'Dat. sg. neutr.',  112:	'Dat. sg. adj.',  114:	'Dat. du. masc.',  115:	'Dat. du. fem.',  116:	'Dat. du. neutr.',  117:	'Dat. du. adj.',  118:	'xt?',  119:	'Dat. pl. masc.',  120:	'Dat. pl. fem.',  121:	'Dat. pl. neutr.',  122:	'Dat. pl. adj.',  128:	'xt?',  129:	'Abl. sg. masc.',  130:	'Abl. sg. fem.',  131:	'Abl. sg. neutr.',  132:	'Abl. sg. adj.',  134:	'Abl. du. masc.',  135:	'Abl. du. fem.',  136:	'Abl. du. neutr.',  137:	'Abl. du. adj.',  138:	'xt?',  139:	'Abl. pl. masc.',  140:	'Abl. pl. fem.',  141:	'Abl. pl. neutr.',  142:	'Abl. pl. adj.',  148:	'xt?',  149:	'Gen. sg. masc.',  150:	'Gen. sg. fem.',  151:	'Gen. sg. neutr.',  152:	'Gen. sg. adj.',  153:	'xt?',  154:	'Gen. du. masc.',  155:	'Gen. du. fem.',  156:	'Gen. du. neutr.',  157:	'Gen. du. adj.',  158:	'xt?',  159:	'Gen. pl. masc.',  160:	'Gen. pl. fem.',  161:	'Gen. pl. neutr.',  162:	'Gen. pl. adj.',  168:	'xt?',  169:	'Loc. sg. masc.',  170:	'Loc. sg. fem.',  171:	'Loc. sg. neutr.',  172:	'Loc. sg. adj.',  173:	'xt?',  174:	'Loc. du. masc.',  175:	'Loc. du. fem.',  176:	'Loc. du. neutr.',  177:	'Loc. du. adj.',  178:	'xt?',  179:	'Loc. pl. masc.',  180:	'Loc. pl. fem.',  181:	'Loc. pl. neutr.',  182:	'Loc. pl. adj.',  }
+    verbMapping1={1: 'pr. [*] ac.', 2: 'opt. [*] ac.', 3: 'imp. [*] ac', 4: 'impft. [*] ac.', 5: 'fut. ac/ps.', 6: 'cond. ac/ps.', 7: 'per. fut. ac/ps.', 8: 'aor. [1] ac/ps.', 9: 'aor. [2] ac/ps.', 10: 'aor. [3] ac/ps.', 11: 'aor. [4] ac/ps.', 12: 'aor. [5] ac/ps.', 13: 'aor. [7] ac/ps.', 14: 'ben. ac/ps.', 15: 'pft. ac.', 16: 'per. pft.', 19: 'pp.', 20: 'ppa.', 21: 'pfp.', 22: 'inf.', 23: 'abs.', 24: 'pr. ps.', 26: 'imp. ps.', 27: 'impft. ps.', 28: 'aor. ps.', 29: 'opt. ps.', 30: 'ou', }
+    verbMapping2={1: 'sg. 1', 2: 'sg. 2', 3: 'sg. 3', 4: 'du. 1', 5: 'du. 2', 6: 'du. 3', 7: 'pl. 1', 8: 'pl. 2', 9: 'pl. 3', }
+    if form=='indeclinable':
+        if config=='part.':
+            return 2
+        elif config=='conj.':
+            return 2
+        elif config=='abs.':
+            return -230
+        elif config=='prep.':
+            return 2
+        elif config=='ind.':
+            return 2
+        elif config=='ca. abs.':
+            return -230
+        else:
+            return 'config is invalid'
+    elif form=='compound':
+        if config=='iic.':
+            return 3
+        elif config=='iiv.':
+            return 3
+        else:
+            return 'config is invalid'
+    elif form=='undetermined':
+        if config=='adv.':
+            return 2
+        elif config=='und.':
+            return 1
+        elif config=='tasil':
+            return 1
+        else:
+            return 'config is invalid'
+    elif form=='noun':
+#         print("entered noun")
+        config1=config
+        x=[]
+        config1=splitTillPeriod(config1,x)
+        one=x[0]
+        x=[]
+        config1=splitTillPeriod(config1,x)
+        two=x[0]
+        x=[]
+        config1=splitTillPeriod(config1,x)
+        three=x[0]
+        isAdj=0
+        if three=='*':
+            three='n'
+            isAdj=1
+        for i in nounMapping.keys():
+            if one!='i'and one!='g':
+                if one[len(one)-2:] in nounMapping[i]:
+                    if two in nounMapping[i]:
+                        if three in nounMapping[i]:
+                            if(isAdj==0):
+                              return i
+                            else:
+                                return i+1
+            elif one=='i':
+                 if 'Instr' in nounMapping[i]:
+                    if two in nounMapping[i]:
+                      if three=='n':
+                        if 'neutr' in nounMapping[i]:
+                             if(isAdj==0):
+                              return i
+                             else:
+                                return i+1
+                      elif three in nounMapping[i]:
+                           return i
+            elif one=='g':
+                if 'Gen' in nounMapping[i]:
+                   if two in nounMapping[i]:
+                      if three=='n':
+                        if 'neutr' in nounMapping[i]:
+                             if(isAdj==0):
+                              return i
+                             else:
+                                return i+1
+                      elif three in nounMapping[i]:
+                                return i
+    elif form=='verb':
+        unit=0
+        ten=0
+        #to remove ca des int
+        x=[]
+        configActual=config
+        config=splitTillPeriod(config,x)
+        if(x[0]=='ca' or x[0]=='des' or x[0]=='int'):
+            y=2 #do nothing
+        else:
+            config=configActual
+        #if [vn.] is present
+        if 'vn.' in config:
+            config=config.replace('vn.','')
+        x=[]
+        config=splitTillPeriod(config,x)
+        one=x[0]
+        two=''
+        three=''
+        ONE=''
+        TWO=''
+        if config!='':
+            x=[]
+            config=splitTillPeriod(config,x)
+            temp=x[0]
+            if temp!='sg'and temp!='pl' and temp!='du':
+                two=temp
+            else:
+                ONE=temp
+        if config!='':
+            x=[]
+            config=splitTillPeriod(config,x)
+            temp=x[0]
+            print
+            if temp!='sg'and temp!='pl' and temp!='du':
+                if ONE=='':
+                    three=temp
+            elif ONE!='':
+                TWO=temp
+            else:
+                ONE=temp
+        if config!='':
+            x=[]
+            config=splitTillPeriod(config,x)
+            temp=x[0]
+            if temp=='sg'or temp=='pl' or temp=='du':
+                ONE=temp
+            elif temp=='1'or temp=='2' or temp=='3':
+                TWO=temp
+        if config!='':
+            x=[]
+            config=splitTillPeriod(config,x)
+            temp=x[0]
+            if temp=='1'or temp=='2' or temp=='3':
+                TWO=temp
+        for i in verbMapping2.keys():
+            if ONE!='':
+                if ONE in verbMapping2[i] and TWO in verbMapping2[i]:
+                   unit=i
+                   break
+        if one=='pp':
+            ten=19
+        if one=='ppa':
+            ten=20
+        if one=='pfp':
+            ten=21
+        if one=='inf':
+            ten=22
+        if one=='abs':
+            ten=23
+        if one=='inj':
+            ten=30
+        if one=='pr' or one=='ppr':
+            if two=='ps':
+                ten=24
+        if one=='imp':
+            if two=='ps':
+                ten=26
+        if one=='impft':
+            if two=='ps':
+                ten=27
+        if one=='aor':
+            if two=='ps':
+                ten=28
+        if one=='opt':
+            if two=='ps':
+                ten=29
+        if one=='pr'or one=='ppr':
+            if 'ac' in two or 'md' in two:
+                ten=1
+        if one=='opt':
+            if 'ac' in two or 'md' in two:
+                ten=2
+        if one=='imp':
+            if 'ac' in two or 'md' in two:
+                ten=3
+        if one=='impft':
+            if 'ac' in two or 'md' in two:
+                ten=4
+        if one=='pft' or one=='ppf':
+            if 'ac' in two or 'md' in two:
+                ten=15
+        if one=='per':
+            if two=='pft':
+                ten=16
+        if one=='fut' or one=='pfu':
+            if 'ac' in two or 'ps' in two or 'md' in two:
+                ten=5
+        if one=='cond':
+            if 'ac' in two or 'ps' in two or 'md' in two:
+                ten=6
+        if one=='ben':
+            if 'ac' in two or 'ps' in two or 'md' in two:
+                ten=14
+        if one=='aor':
+            if 'ac' in two or 'ps' in two or 'md' in two:
+                if '1' in two:
+                    ten=8
+                if '2' in two:
+                    ten=9
+                if '3' in two:
+                    ten=10
+                if '4' in two:
+                    ten=11
+                if '5' in two or '6' in two:
+                    ten=12
+                if '7' in two:
+                    ten=13
+        if one=='per':
+            if two=='fut':
+                if (('ac' in three) or ('ps' in three) or 'md' in three):
+                    ten=7
+        if ten!=0:
+            output=-1*(ten*10+unit)
+            return output
+        else:
+            x=3
+    else:
+        return 'none'

dir/word_definite.py ADDED Viewed

The diff for this file is too large to render. See raw diff

dir/word_definite[d_1500_BM2_v12].py ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,68 @@

+asttokens==3.0.0
+bz2file==0.98
+certifi==2025.7.14
+charset-normalizer==3.4.2
+comm==0.2.3
+debugpy==1.8.15
+decorator==5.2.1
+dill==0.4.0
+exceptiongroup==1.3.0
+executing==2.2.0
+filelock==3.18.0
+fsspec==2025.7.0
+hf-xet==1.1.5
+huggingface-hub==0.34.1
+idna==3.10
+ipykernel==6.30.0
+ipython==8.37.0
+jedi==0.19.2
+Jinja2==3.1.6
+jupyter_client==8.6.3
+jupyter_core==5.8.1
+matplotlib-inline==0.1.7
+mpmath==1.3.0
+nest-asyncio==1.6.0
+networkx==3.4.2
+numpy==1.26.4
+nvidia-cublas-cu12==12.6.4.1
+nvidia-cuda-cupti-cu12==12.6.80
+nvidia-cuda-nvrtc-cu12==12.6.77
+nvidia-cuda-runtime-cu12==12.6.77
+nvidia-cudnn-cu12==9.5.1.17
+nvidia-cufft-cu12==11.3.0.4
+nvidia-cufile-cu12==1.11.1.6
+nvidia-curand-cu12==10.3.7.77
+nvidia-cusolver-cu12==11.7.1.2
+nvidia-cusparse-cu12==12.5.4.2
+nvidia-cusparselt-cu12==0.6.3
+nvidia-nccl-cu12==2.26.2
+nvidia-nvjitlink-cu12==12.6.85
+nvidia-nvtx-cu12==12.6.77
+packaging==25.0
+pandas==2.3.1
+parso==0.8.4
+prompt_toolkit==3.0.51
+pure_eval==0.2.3
+Pygments==2.19.2
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.2
+pyzmq==27.0.0
+regex==2024.11.6
+requests==2.32.4
+safetensors==0.5.3
+scipy==1.15.3
+sentencepiece==0.2.0
+stack-data==0.6.3
+sympy==1.14.0
+tokenizers==0.21.2
+torch==2.7.1
+tornado==6.5.1
+tqdm==4.67.1
+traitlets==5.14.3
+transformers==4.54.0
+triton==3.3.1
+typing_extensions==4.14.1
+tzdata==2025.2
+urllib3==2.5.0
+wcwidth==0.2.13