nullHawk commited on
Commit
c184121
·
verified ·
0 Parent(s):

add: notebook

Browse files
semantic_search_using_word2vec_runpod_latest.ipynb ADDED
@@ -0,0 +1,2358 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "z9VPy5izbVuc"
7
+ },
8
+ "source": [
9
+ "### Semantic Search on arxive research papers using word2vec skipgram : https://arxiv.org/abs/1301.3781\n",
10
+ "- **Use Case:** Build a semantic search engine for research paper\n",
11
+ "- **Model Used:** Skipgram based word2vec for generating embeddings\n",
12
+ "- **Search Technique:** Cosine Similiarity, FAISS\n",
13
+ "- **Tokenizer Used:** sent_tokizer from nltk library"
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "code",
18
+ "execution_count": 30,
19
+ "metadata": {
20
+ "colab": {
21
+ "base_uri": "https://localhost:8080/"
22
+ },
23
+ "id": "mKW7m--JbVue",
24
+ "outputId": "8464bdee-2f79-4696-9c12-957d17a0fdd7"
25
+ },
26
+ "outputs": [],
27
+ "source": [
28
+ "!pip install -q langdetect\n",
29
+ "!pip install -q gensim\n",
30
+ "!pip install -q nltk\n",
31
+ "!pip install -q spacy\n",
32
+ "!pip install -q kagglehub\n",
33
+ "!pip install -q gensim\n",
34
+ "!pip install -q pandas\n",
35
+ "!pip install -q numpy\n",
36
+ "!pip install -q tqdm\n",
37
+ "!pip install -q gdown\n",
38
+ "!pip install -q fastparquet\n",
39
+ "!pip install -q pyarrow\n",
40
+ "!pip install -q faiss-cpu"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": 8,
46
+ "metadata": {},
47
+ "outputs": [
48
+ {
49
+ "name": "stdout",
50
+ "output_type": "stream",
51
+ "text": [
52
+ "Downloading...\n",
53
+ "From (original): https://drive.google.com/uc?id=1wTB4q0o2qREFNXcS2HZ3Y5tBlu97kF2o\n",
54
+ "From (redirected): https://drive.google.com/uc?id=1wTB4q0o2qREFNXcS2HZ3Y5tBlu97kF2o&confirm=t&uuid=13f5bd40-8f1d-469a-9b4e-4410b512530b\n",
55
+ "To: /workspace/data.pkl\n",
56
+ "100%|██████████████████████████████████████| 3.99G/3.99G [00:57<00:00, 69.8MB/s]\n",
57
+ "Downloading...\n",
58
+ "From: https://drive.google.com/uc?id=1lqo3Gyubefop6DlSnXEru65-sEnx2m9y\n",
59
+ "To: /workspace/word2vec_arxiv_skipgram.model\n",
60
+ "100%|██████████████████████████████████████| 11.3M/11.3M [00:00<00:00, 36.8MB/s]\n",
61
+ "Downloading...\n",
62
+ "From (original): https://drive.google.com/uc?id=1LrZaN2h5TVLC30Yz-VzYRsTQFNk7zSfC\n",
63
+ "From (redirected): https://drive.google.com/uc?id=1LrZaN2h5TVLC30Yz-VzYRsTQFNk7zSfC&confirm=t&uuid=3b8d4172-5e26-48ff-8593-0da1d51240f9\n",
64
+ "To: /workspace/word2vec_arxiv_skipgram.model.syn1neg.npy\n",
65
+ "100%|████████████████████████████████████████| 135M/135M [00:04<00:00, 27.8MB/s]\n",
66
+ "Downloading...\n",
67
+ "From (original): https://drive.google.com/uc?id=1QA_7_YwC6p2_7HhJ19-mqyIKRV-yfFiG\n",
68
+ "From (redirected): https://drive.google.com/uc?id=1QA_7_YwC6p2_7HhJ19-mqyIKRV-yfFiG&confirm=t&uuid=7186517e-df73-4662-85b3-84890b962fc2\n",
69
+ "To: /workspace/word2vec_arxiv_skipgram.model.wv.vectors.npy\n",
70
+ "100%|████████████████████████████████████████| 135M/135M [00:03<00:00, 35.4MB/s]\n"
71
+ ]
72
+ }
73
+ ],
74
+ "source": [
75
+ "!gdown 1wTB4q0o2qREFNXcS2HZ3Y5tBlu97kF2o\n",
76
+ "!gdown 1lqo3Gyubefop6DlSnXEru65-sEnx2m9y\n",
77
+ "!gdown 1LrZaN2h5TVLC30Yz-VzYRsTQFNk7zSfC\n",
78
+ "!gdown 1QA_7_YwC6p2_7HhJ19-mqyIKRV-yfFiG"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": 2,
84
+ "metadata": {
85
+ "colab": {
86
+ "base_uri": "https://localhost:8080/"
87
+ },
88
+ "id": "V_EtGue3cMXF",
89
+ "outputId": "7914db9f-8050-429c-ebbe-1234080ae59c"
90
+ },
91
+ "outputs": [
92
+ {
93
+ "name": "stderr",
94
+ "output_type": "stream",
95
+ "text": [
96
+ "[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n",
97
+ "[nltk_data] Unzipping tokenizers/punkt_tab.zip.\n"
98
+ ]
99
+ },
100
+ {
101
+ "data": {
102
+ "text/plain": [
103
+ "True"
104
+ ]
105
+ },
106
+ "execution_count": 2,
107
+ "metadata": {},
108
+ "output_type": "execute_result"
109
+ }
110
+ ],
111
+ "source": [
112
+ "import nltk\n",
113
+ "nltk.download('punkt_tab')"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": 9,
119
+ "metadata": {
120
+ "id": "0O2gP26JbVuf"
121
+ },
122
+ "outputs": [],
123
+ "source": [
124
+ "import spacy\n",
125
+ "import re\n",
126
+ "\n",
127
+ "import pandas as pd\n",
128
+ "import numpy as np\n",
129
+ "import kagglehub\n",
130
+ "\n",
131
+ "from tqdm.notebook import tqdm\n",
132
+ "from gensim.models import Word2Vec\n",
133
+ "from langdetect import DetectorFactory, detect\n",
134
+ "from nltk.tokenize import sent_tokenize, word_tokenize"
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "code",
139
+ "execution_count": 4,
140
+ "metadata": {
141
+ "id": "kXES0Z6ybVuf"
142
+ },
143
+ "outputs": [],
144
+ "source": [
145
+ "tqdm.pandas()"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "markdown",
150
+ "metadata": {},
151
+ "source": [
152
+ "## Building Word2Vec model for arxive dataset"
153
+ ]
154
+ },
155
+ {
156
+ "cell_type": "code",
157
+ "execution_count": 16,
158
+ "metadata": {
159
+ "colab": {
160
+ "base_uri": "https://localhost:8080/"
161
+ },
162
+ "id": "A0QAH1qubfoT",
163
+ "outputId": "76a9f028-3706-4dc5-e8bd-3bf24b69ac53"
164
+ },
165
+ "outputs": [
166
+ {
167
+ "name": "stdout",
168
+ "output_type": "stream",
169
+ "text": [
170
+ "✅ Dataset downloaded at: /root/.cache/kagglehub/datasets/Cornell-University/arxiv/versions/259\n"
171
+ ]
172
+ }
173
+ ],
174
+ "source": [
175
+ "# Download the dataset locally (returns the local path)\n",
176
+ "dataset_path = kagglehub.dataset_download(\"Cornell-University/arxiv\")\n",
177
+ "\n",
178
+ "print(\"✅ Dataset downloaded at:\", dataset_path)\n",
179
+ "\n",
180
+ "# Load using pandas with lines=True\n",
181
+ "file_path = f\"{dataset_path}/arxiv-metadata-oai-snapshot.json\"\n",
182
+ "\n",
183
+ "chunks = pd.read_json(file_path, lines=True, chunksize=10000)\n"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": null,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "chunks = pd.read_json(\n",
193
+ " \"/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json\",\n",
194
+ " lines=True,\n",
195
+ " chunksize=10000 # tune this based on memory\n",
196
+ ")"
197
+ ]
198
+ },
199
+ {
200
+ "cell_type": "code",
201
+ "execution_count": 17,
202
+ "metadata": {
203
+ "colab": {
204
+ "base_uri": "https://localhost:8080/",
205
+ "height": 695
206
+ },
207
+ "id": "LDMbjWV1bVug",
208
+ "outputId": "a75a4470-8f17-4d66-ced6-4b02ddce8df3"
209
+ },
210
+ "outputs": [
211
+ {
212
+ "data": {
213
+ "text/html": [
214
+ "<div>\n",
215
+ "<style scoped>\n",
216
+ " .dataframe tbody tr th:only-of-type {\n",
217
+ " vertical-align: middle;\n",
218
+ " }\n",
219
+ "\n",
220
+ " .dataframe tbody tr th {\n",
221
+ " vertical-align: top;\n",
222
+ " }\n",
223
+ "\n",
224
+ " .dataframe thead th {\n",
225
+ " text-align: right;\n",
226
+ " }\n",
227
+ "</style>\n",
228
+ "<table border=\"1\" class=\"dataframe\">\n",
229
+ " <thead>\n",
230
+ " <tr style=\"text-align: right;\">\n",
231
+ " <th></th>\n",
232
+ " <th>id</th>\n",
233
+ " <th>submitter</th>\n",
234
+ " <th>authors</th>\n",
235
+ " <th>title</th>\n",
236
+ " <th>comments</th>\n",
237
+ " <th>journal-ref</th>\n",
238
+ " <th>doi</th>\n",
239
+ " <th>report-no</th>\n",
240
+ " <th>categories</th>\n",
241
+ " <th>license</th>\n",
242
+ " <th>abstract</th>\n",
243
+ " <th>versions</th>\n",
244
+ " <th>update_date</th>\n",
245
+ " <th>authors_parsed</th>\n",
246
+ " </tr>\n",
247
+ " </thead>\n",
248
+ " <tbody>\n",
249
+ " <tr>\n",
250
+ " <th>0</th>\n",
251
+ " <td>704.0001</td>\n",
252
+ " <td>Pavel Nadolsky</td>\n",
253
+ " <td>C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-...</td>\n",
254
+ " <td>Calculation of prompt diphoton production cros...</td>\n",
255
+ " <td>37 pages, 15 figures; published version</td>\n",
256
+ " <td>Phys.Rev.D76:013009,2007</td>\n",
257
+ " <td>10.1103/PhysRevD.76.013009</td>\n",
258
+ " <td>ANL-HEP-PR-07-12</td>\n",
259
+ " <td>hep-ph</td>\n",
260
+ " <td>None</td>\n",
261
+ " <td>A fully differential calculation in perturba...</td>\n",
262
+ " <td>[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...</td>\n",
263
+ " <td>2008-11-26</td>\n",
264
+ " <td>[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...</td>\n",
265
+ " </tr>\n",
266
+ " <tr>\n",
267
+ " <th>1</th>\n",
268
+ " <td>704.0002</td>\n",
269
+ " <td>Louis Theran</td>\n",
270
+ " <td>Ileana Streinu and Louis Theran</td>\n",
271
+ " <td>Sparsity-certifying Graph Decompositions</td>\n",
272
+ " <td>To appear in Graphs and Combinatorics</td>\n",
273
+ " <td>None</td>\n",
274
+ " <td>None</td>\n",
275
+ " <td>None</td>\n",
276
+ " <td>math.CO cs.CG</td>\n",
277
+ " <td>http://arxiv.org/licenses/nonexclusive-distrib...</td>\n",
278
+ " <td>We describe a new algorithm, the $(k,\\ell)$-...</td>\n",
279
+ " <td>[{'version': 'v1', 'created': 'Sat, 31 Mar 200...</td>\n",
280
+ " <td>2008-12-13</td>\n",
281
+ " <td>[[Streinu, Ileana, ], [Theran, Louis, ]]</td>\n",
282
+ " </tr>\n",
283
+ " <tr>\n",
284
+ " <th>2</th>\n",
285
+ " <td>704.0003</td>\n",
286
+ " <td>Hongjun Pan</td>\n",
287
+ " <td>Hongjun Pan</td>\n",
288
+ " <td>The evolution of the Earth-Moon system based o...</td>\n",
289
+ " <td>23 pages, 3 figures</td>\n",
290
+ " <td>None</td>\n",
291
+ " <td>None</td>\n",
292
+ " <td>None</td>\n",
293
+ " <td>physics.gen-ph</td>\n",
294
+ " <td>None</td>\n",
295
+ " <td>The evolution of Earth-Moon system is descri...</td>\n",
296
+ " <td>[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...</td>\n",
297
+ " <td>2008-01-13</td>\n",
298
+ " <td>[[Pan, Hongjun, ]]</td>\n",
299
+ " </tr>\n",
300
+ " <tr>\n",
301
+ " <th>3</th>\n",
302
+ " <td>704.0004</td>\n",
303
+ " <td>David Callan</td>\n",
304
+ " <td>David Callan</td>\n",
305
+ " <td>A determinant of Stirling cycle numbers counts...</td>\n",
306
+ " <td>11 pages</td>\n",
307
+ " <td>None</td>\n",
308
+ " <td>None</td>\n",
309
+ " <td>None</td>\n",
310
+ " <td>math.CO</td>\n",
311
+ " <td>None</td>\n",
312
+ " <td>We show that a determinant of Stirling cycle...</td>\n",
313
+ " <td>[{'version': 'v1', 'created': 'Sat, 31 Mar 200...</td>\n",
314
+ " <td>2007-05-23</td>\n",
315
+ " <td>[[Callan, David, ]]</td>\n",
316
+ " </tr>\n",
317
+ " <tr>\n",
318
+ " <th>4</th>\n",
319
+ " <td>704.0005</td>\n",
320
+ " <td>Alberto Torchinsky</td>\n",
321
+ " <td>Wael Abu-Shammala and Alberto Torchinsky</td>\n",
322
+ " <td>From dyadic $\\Lambda_{\\alpha}$ to $\\Lambda_{\\a...</td>\n",
323
+ " <td>None</td>\n",
324
+ " <td>Illinois J. Math. 52 (2008) no.2, 681-689</td>\n",
325
+ " <td>None</td>\n",
326
+ " <td>None</td>\n",
327
+ " <td>math.CA math.FA</td>\n",
328
+ " <td>None</td>\n",
329
+ " <td>In this paper we show how to compute the $\\L...</td>\n",
330
+ " <td>[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...</td>\n",
331
+ " <td>2013-10-15</td>\n",
332
+ " <td>[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]</td>\n",
333
+ " </tr>\n",
334
+ " </tbody>\n",
335
+ "</table>\n",
336
+ "</div>"
337
+ ],
338
+ "text/plain": [
339
+ " id submitter \\\n",
340
+ "0 704.0001 Pavel Nadolsky \n",
341
+ "1 704.0002 Louis Theran \n",
342
+ "2 704.0003 Hongjun Pan \n",
343
+ "3 704.0004 David Callan \n",
344
+ "4 704.0005 Alberto Torchinsky \n",
345
+ "\n",
346
+ " authors \\\n",
347
+ "0 C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-... \n",
348
+ "1 Ileana Streinu and Louis Theran \n",
349
+ "2 Hongjun Pan \n",
350
+ "3 David Callan \n",
351
+ "4 Wael Abu-Shammala and Alberto Torchinsky \n",
352
+ "\n",
353
+ " title \\\n",
354
+ "0 Calculation of prompt diphoton production cros... \n",
355
+ "1 Sparsity-certifying Graph Decompositions \n",
356
+ "2 The evolution of the Earth-Moon system based o... \n",
357
+ "3 A determinant of Stirling cycle numbers counts... \n",
358
+ "4 From dyadic $\\Lambda_{\\alpha}$ to $\\Lambda_{\\a... \n",
359
+ "\n",
360
+ " comments \\\n",
361
+ "0 37 pages, 15 figures; published version \n",
362
+ "1 To appear in Graphs and Combinatorics \n",
363
+ "2 23 pages, 3 figures \n",
364
+ "3 11 pages \n",
365
+ "4 None \n",
366
+ "\n",
367
+ " journal-ref doi \\\n",
368
+ "0 Phys.Rev.D76:013009,2007 10.1103/PhysRevD.76.013009 \n",
369
+ "1 None None \n",
370
+ "2 None None \n",
371
+ "3 None None \n",
372
+ "4 Illinois J. Math. 52 (2008) no.2, 681-689 None \n",
373
+ "\n",
374
+ " report-no categories \\\n",
375
+ "0 ANL-HEP-PR-07-12 hep-ph \n",
376
+ "1 None math.CO cs.CG \n",
377
+ "2 None physics.gen-ph \n",
378
+ "3 None math.CO \n",
379
+ "4 None math.CA math.FA \n",
380
+ "\n",
381
+ " license \\\n",
382
+ "0 None \n",
383
+ "1 http://arxiv.org/licenses/nonexclusive-distrib... \n",
384
+ "2 None \n",
385
+ "3 None \n",
386
+ "4 None \n",
387
+ "\n",
388
+ " abstract \\\n",
389
+ "0 A fully differential calculation in perturba... \n",
390
+ "1 We describe a new algorithm, the $(k,\\ell)$-... \n",
391
+ "2 The evolution of Earth-Moon system is descri... \n",
392
+ "3 We show that a determinant of Stirling cycle... \n",
393
+ "4 In this paper we show how to compute the $\\L... \n",
394
+ "\n",
395
+ " versions update_date \\\n",
396
+ "0 [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2008-11-26 \n",
397
+ "1 [{'version': 'v1', 'created': 'Sat, 31 Mar 200... 2008-12-13 \n",
398
+ "2 [{'version': 'v1', 'created': 'Sun, 1 Apr 2007... 2008-01-13 \n",
399
+ "3 [{'version': 'v1', 'created': 'Sat, 31 Mar 200... 2007-05-23 \n",
400
+ "4 [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2013-10-15 \n",
401
+ "\n",
402
+ " authors_parsed \n",
403
+ "0 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,... \n",
404
+ "1 [[Streinu, Ileana, ], [Theran, Louis, ]] \n",
405
+ "2 [[Pan, Hongjun, ]] \n",
406
+ "3 [[Callan, David, ]] \n",
407
+ "4 [[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]] "
408
+ ]
409
+ },
410
+ "execution_count": 17,
411
+ "metadata": {},
412
+ "output_type": "execute_result"
413
+ }
414
+ ],
415
+ "source": [
416
+ "next(chunks).head()"
417
+ ]
418
+ },
419
+ {
420
+ "cell_type": "markdown",
421
+ "metadata": {
422
+ "id": "tiIgyjkK4q3t"
423
+ },
424
+ "source": [
425
+ "#### Using subset of the given dataset"
426
+ ]
427
+ },
428
+ {
429
+ "cell_type": "code",
430
+ "execution_count": 18,
431
+ "metadata": {
432
+ "colab": {
433
+ "base_uri": "https://localhost:8080/"
434
+ },
435
+ "id": "rsA5QxtHbVug",
436
+ "outputId": "3b7820bc-f46e-4328-bd62-ee04fda760bc"
437
+ },
438
+ "outputs": [
439
+ {
440
+ "name": "stdout",
441
+ "output_type": "stream",
442
+ "text": [
443
+ "Processed 0 records\n",
444
+ "Processed 100000 records\n",
445
+ "Processed 200000 records\n",
446
+ "Processed 300000 records\n",
447
+ "Processed 400000 records\n",
448
+ "Processed 500000 records\n",
449
+ "Processed 600000 records\n",
450
+ "Processed 700000 records\n",
451
+ "Processed 800000 records\n",
452
+ "Processed 900000 records\n",
453
+ "Processed 1000000 records\n",
454
+ "(1010000, 4)\n",
455
+ " id title \\\n",
456
+ "0 706.1314 When Did Cosmic Acceleration Start ? \n",
457
+ "1 706.1315 The Dirac system on the Anti-de Sitter Universe \n",
458
+ "2 706.1316 Coupling of Optical Lumped Nanocircuit Element... \n",
459
+ "3 706.1317 A model for learning to segment temporal seque... \n",
460
+ "4 706.1318 Constructing a maximum utility slate of on-lin... \n",
461
+ "\n",
462
+ " abstract categories \n",
463
+ "0 A precise determination, and comparison, of ... astro-ph gr-qc hep-ph \n",
464
+ "1 We investigate the global solutions of the D... math-ph math.AP math.MP \n",
465
+ "2 We present here a model for the coupling amo... cond-mat.mtrl-sci \n",
466
+ "3 This paper proposes a novel learning method ... nlin.AO \n",
467
+ "4 We present an algorithm for constructing an ... cs.DM cs.DS \n"
468
+ ]
469
+ }
470
+ ],
471
+ "source": [
472
+ "data = []\n",
473
+ "for i, chunk in enumerate(chunks):\n",
474
+ " # Example: keep only title and abstract\n",
475
+ " subset = chunk[[\"id\", \"title\", \"abstract\", \"categories\"]] # taking only abstract for generating word embeddings\n",
476
+ " data.append(subset)\n",
477
+ "\n",
478
+ " if i % 10 == 0:\n",
479
+ " print(f\"Processed {i*10000} records\")\n",
480
+ "\n",
481
+ " # Stop early if you only need part\n",
482
+ " if i == 100: # ~1 million records\n",
483
+ " break\n",
484
+ "\n",
485
+ "df = pd.concat(data, ignore_index=True)\n",
486
+ "print(df.shape)\n",
487
+ "print(df.head())\n"
488
+ ]
489
+ },
490
+ {
491
+ "cell_type": "code",
492
+ "execution_count": 19,
493
+ "metadata": {
494
+ "colab": {
495
+ "base_uri": "https://localhost:8080/",
496
+ "height": 458
497
+ },
498
+ "id": "g1jwkMB_bVuh",
499
+ "outputId": "ba0c426a-3829-4d05-c28f-f1573672c657"
500
+ },
501
+ "outputs": [
502
+ {
503
+ "data": {
504
+ "text/html": [
505
+ "<div>\n",
506
+ "<style scoped>\n",
507
+ " .dataframe tbody tr th:only-of-type {\n",
508
+ " vertical-align: middle;\n",
509
+ " }\n",
510
+ "\n",
511
+ " .dataframe tbody tr th {\n",
512
+ " vertical-align: top;\n",
513
+ " }\n",
514
+ "\n",
515
+ " .dataframe thead th {\n",
516
+ " text-align: right;\n",
517
+ " }\n",
518
+ "</style>\n",
519
+ "<table border=\"1\" class=\"dataframe\">\n",
520
+ " <thead>\n",
521
+ " <tr style=\"text-align: right;\">\n",
522
+ " <th></th>\n",
523
+ " <th>id</th>\n",
524
+ " <th>title</th>\n",
525
+ " <th>abstract</th>\n",
526
+ " <th>categories</th>\n",
527
+ " </tr>\n",
528
+ " </thead>\n",
529
+ " <tbody>\n",
530
+ " <tr>\n",
531
+ " <th>0</th>\n",
532
+ " <td>706.13140</td>\n",
533
+ " <td>When Did Cosmic Acceleration Start ?</td>\n",
534
+ " <td>A precise determination, and comparison, of ...</td>\n",
535
+ " <td>astro-ph gr-qc hep-ph</td>\n",
536
+ " </tr>\n",
537
+ " <tr>\n",
538
+ " <th>1</th>\n",
539
+ " <td>706.13150</td>\n",
540
+ " <td>The Dirac system on the Anti-de Sitter Universe</td>\n",
541
+ " <td>We investigate the global solutions of the D...</td>\n",
542
+ " <td>math-ph math.AP math.MP</td>\n",
543
+ " </tr>\n",
544
+ " <tr>\n",
545
+ " <th>2</th>\n",
546
+ " <td>706.13160</td>\n",
547
+ " <td>Coupling of Optical Lumped Nanocircuit Element...</td>\n",
548
+ " <td>We present here a model for the coupling amo...</td>\n",
549
+ " <td>cond-mat.mtrl-sci</td>\n",
550
+ " </tr>\n",
551
+ " <tr>\n",
552
+ " <th>3</th>\n",
553
+ " <td>706.13170</td>\n",
554
+ " <td>A model for learning to segment temporal seque...</td>\n",
555
+ " <td>This paper proposes a novel learning method ...</td>\n",
556
+ " <td>nlin.AO</td>\n",
557
+ " </tr>\n",
558
+ " <tr>\n",
559
+ " <th>4</th>\n",
560
+ " <td>706.13180</td>\n",
561
+ " <td>Constructing a maximum utility slate of on-lin...</td>\n",
562
+ " <td>We present an algorithm for constructing an ...</td>\n",
563
+ " <td>cs.DM cs.DS</td>\n",
564
+ " </tr>\n",
565
+ " <tr>\n",
566
+ " <th>...</th>\n",
567
+ " <td>...</td>\n",
568
+ " <td>...</td>\n",
569
+ " <td>...</td>\n",
570
+ " <td>...</td>\n",
571
+ " </tr>\n",
572
+ " <tr>\n",
573
+ " <th>1009995</th>\n",
574
+ " <td>1808.10849</td>\n",
575
+ " <td>On sets defining few ordinary hyperplanes</td>\n",
576
+ " <td>Let $P$ be a set of $n$ points in real proje...</td>\n",
577
+ " <td>math.CO math.AG</td>\n",
578
+ " </tr>\n",
579
+ " <tr>\n",
580
+ " <th>1009996</th>\n",
581
+ " <td>1808.10850</td>\n",
582
+ " <td>Quantum walks in external gauge fields</td>\n",
583
+ " <td>Describing a particle in an external electro...</td>\n",
584
+ " <td>quant-ph cond-mat.other math-ph math.MP</td>\n",
585
+ " </tr>\n",
586
+ " <tr>\n",
587
+ " <th>1009997</th>\n",
588
+ " <td>1808.10851</td>\n",
589
+ " <td>Solving the muon g-2 anomaly in CMSSM extensio...</td>\n",
590
+ " <td>We propose to generate non-universal gaugino...</td>\n",
591
+ " <td>hep-ph hep-ex</td>\n",
592
+ " </tr>\n",
593
+ " <tr>\n",
594
+ " <th>1009998</th>\n",
595
+ " <td>1808.10852</td>\n",
596
+ " <td>Towards Asynchronous Motor Imagery-Based Brain...</td>\n",
597
+ " <td>In this paper, the deep learning (DL) approa...</td>\n",
598
+ " <td>eess.SP cs.HC q-bio.NC</td>\n",
599
+ " </tr>\n",
600
+ " <tr>\n",
601
+ " <th>1009999</th>\n",
602
+ " <td>1808.10853</td>\n",
603
+ " <td>Charged impurity scattering in two-dimensional...</td>\n",
604
+ " <td>The singular density of states and the two F...</td>\n",
605
+ " <td>cond-mat.mes-hall</td>\n",
606
+ " </tr>\n",
607
+ " </tbody>\n",
608
+ "</table>\n",
609
+ "<p>1010000 rows × 4 columns</p>\n",
610
+ "</div>"
611
+ ],
612
+ "text/plain": [
613
+ " id title \\\n",
614
+ "0 706.13140 When Did Cosmic Acceleration Start ? \n",
615
+ "1 706.13150 The Dirac system on the Anti-de Sitter Universe \n",
616
+ "2 706.13160 Coupling of Optical Lumped Nanocircuit Element... \n",
617
+ "3 706.13170 A model for learning to segment temporal seque... \n",
618
+ "4 706.13180 Constructing a maximum utility slate of on-lin... \n",
619
+ "... ... ... \n",
620
+ "1009995 1808.10849 On sets defining few ordinary hyperplanes \n",
621
+ "1009996 1808.10850 Quantum walks in external gauge fields \n",
622
+ "1009997 1808.10851 Solving the muon g-2 anomaly in CMSSM extensio... \n",
623
+ "1009998 1808.10852 Towards Asynchronous Motor Imagery-Based Brain... \n",
624
+ "1009999 1808.10853 Charged impurity scattering in two-dimensional... \n",
625
+ "\n",
626
+ " abstract \\\n",
627
+ "0 A precise determination, and comparison, of ... \n",
628
+ "1 We investigate the global solutions of the D... \n",
629
+ "2 We present here a model for the coupling amo... \n",
630
+ "3 This paper proposes a novel learning method ... \n",
631
+ "4 We present an algorithm for constructing an ... \n",
632
+ "... ... \n",
633
+ "1009995 Let $P$ be a set of $n$ points in real proje... \n",
634
+ "1009996 Describing a particle in an external electro... \n",
635
+ "1009997 We propose to generate non-universal gaugino... \n",
636
+ "1009998 In this paper, the deep learning (DL) approa... \n",
637
+ "1009999 The singular density of states and the two F... \n",
638
+ "\n",
639
+ " categories \n",
640
+ "0 astro-ph gr-qc hep-ph \n",
641
+ "1 math-ph math.AP math.MP \n",
642
+ "2 cond-mat.mtrl-sci \n",
643
+ "3 nlin.AO \n",
644
+ "4 cs.DM cs.DS \n",
645
+ "... ... \n",
646
+ "1009995 math.CO math.AG \n",
647
+ "1009996 quant-ph cond-mat.other math-ph math.MP \n",
648
+ "1009997 hep-ph hep-ex \n",
649
+ "1009998 eess.SP cs.HC q-bio.NC \n",
650
+ "1009999 cond-mat.mes-hall \n",
651
+ "\n",
652
+ "[1010000 rows x 4 columns]"
653
+ ]
654
+ },
655
+ "execution_count": 19,
656
+ "metadata": {},
657
+ "output_type": "execute_result"
658
+ }
659
+ ],
660
+ "source": [
661
+ "df.dropna()"
662
+ ]
663
+ },
664
+ {
665
+ "cell_type": "markdown",
666
+ "metadata": {
667
+ "id": "n2J6AYWQbVuh"
668
+ },
669
+ "source": [
670
+ "#### Cleaning abstracts"
671
+ ]
672
+ },
673
+ {
674
+ "cell_type": "code",
675
+ "execution_count": 27,
676
+ "metadata": {
677
+ "colab": {
678
+ "base_uri": "https://localhost:8080/",
679
+ "height": 325,
680
+ "referenced_widgets": [
681
+ "7d5629fd92c543bcaafe7fd352c079c6",
682
+ "a815fb290af249cbb23679e34d1fdaa3",
683
+ "b042bdc131654d25bac957a4e921f59d",
684
+ "5510fb4117fd4954aeeb0de2214b1ed4",
685
+ "4de066fc521f493cab302c8d5ca255e2",
686
+ "06d6872e3d1f4086a4f16b8caababccb",
687
+ "88ba2081fc004bbdb6f438c5031458d5",
688
+ "52f12329e31645888592d0120a05f6ea",
689
+ "214288b8cb6c453084c3cce01982eff7",
690
+ "4df933c8001c4a27b351c5b66df23eb5",
691
+ "80071f18a3fd4ddc865a8a536d83cd2e"
692
+ ]
693
+ },
694
+ "id": "6Lx5uI74bVui",
695
+ "outputId": "97a945eb-38a5-4e23-94ea-4e2bb5e17d6d"
696
+ },
697
+ "outputs": [
698
+ {
699
+ "data": {
700
+ "application/vnd.jupyter.widget-view+json": {
701
+ "model_id": "0c66c831aedf446ea20c323688c43d4b",
702
+ "version_major": 2,
703
+ "version_minor": 0
704
+ },
705
+ "text/plain": [
706
+ " 0%| | 0/1010000 [00:00<?, ?it/s]"
707
+ ]
708
+ },
709
+ "metadata": {},
710
+ "output_type": "display_data"
711
+ },
712
+ {
713
+ "data": {
714
+ "text/html": [
715
+ "<div>\n",
716
+ "<style scoped>\n",
717
+ " .dataframe tbody tr th:only-of-type {\n",
718
+ " vertical-align: middle;\n",
719
+ " }\n",
720
+ "\n",
721
+ " .dataframe tbody tr th {\n",
722
+ " vertical-align: top;\n",
723
+ " }\n",
724
+ "\n",
725
+ " .dataframe thead th {\n",
726
+ " text-align: right;\n",
727
+ " }\n",
728
+ "</style>\n",
729
+ "<table border=\"1\" class=\"dataframe\">\n",
730
+ " <thead>\n",
731
+ " <tr style=\"text-align: right;\">\n",
732
+ " <th></th>\n",
733
+ " <th>id</th>\n",
734
+ " <th>title</th>\n",
735
+ " <th>abstract</th>\n",
736
+ " <th>categories</th>\n",
737
+ " <th>processed_abstract</th>\n",
738
+ " <th>tokenized_abstract</th>\n",
739
+ " </tr>\n",
740
+ " </thead>\n",
741
+ " <tbody>\n",
742
+ " <tr>\n",
743
+ " <th>0</th>\n",
744
+ " <td>706.1314</td>\n",
745
+ " <td>When Did Cosmic Acceleration Start ?</td>\n",
746
+ " <td>A precise determination, and comparison, of ...</td>\n",
747
+ " <td>astro-ph gr-qc hep-ph</td>\n",
748
+ " <td>a precise determination and comparison of the ...</td>\n",
749
+ " <td>[[a, precise, determination, ,, and, compariso...</td>\n",
750
+ " </tr>\n",
751
+ " <tr>\n",
752
+ " <th>1</th>\n",
753
+ " <td>706.1315</td>\n",
754
+ " <td>The Dirac system on the Anti-de Sitter Universe</td>\n",
755
+ " <td>We investigate the global solutions of the D...</td>\n",
756
+ " <td>math-ph math.AP math.MP</td>\n",
757
+ " <td>we investigate the global solutions of the dir...</td>\n",
758
+ " <td>[[we, investigate, the, global, solutions, of,...</td>\n",
759
+ " </tr>\n",
760
+ " <tr>\n",
761
+ " <th>2</th>\n",
762
+ " <td>706.1316</td>\n",
763
+ " <td>Coupling of Optical Lumped Nanocircuit Element...</td>\n",
764
+ " <td>We present here a model for the coupling amo...</td>\n",
765
+ " <td>cond-mat.mtrl-sci</td>\n",
766
+ " <td>we present here a model for the coupling among...</td>\n",
767
+ " <td>[[we, present, here, a, model, for, the, coupl...</td>\n",
768
+ " </tr>\n",
769
+ " <tr>\n",
770
+ " <th>3</th>\n",
771
+ " <td>706.1317</td>\n",
772
+ " <td>A model for learning to segment temporal seque...</td>\n",
773
+ " <td>This paper proposes a novel learning method ...</td>\n",
774
+ " <td>nlin.AO</td>\n",
775
+ " <td>this paper proposes a novel learning method fo...</td>\n",
776
+ " <td>[[this, paper, proposes, a, novel, learning, m...</td>\n",
777
+ " </tr>\n",
778
+ " <tr>\n",
779
+ " <th>4</th>\n",
780
+ " <td>706.1318</td>\n",
781
+ " <td>Constructing a maximum utility slate of on-lin...</td>\n",
782
+ " <td>We present an algorithm for constructing an ...</td>\n",
783
+ " <td>cs.DM cs.DS</td>\n",
784
+ " <td>we present an algorithm for constructing an op...</td>\n",
785
+ " <td>[[we, present, an, algorithm, for, constructin...</td>\n",
786
+ " </tr>\n",
787
+ " </tbody>\n",
788
+ "</table>\n",
789
+ "</div>"
790
+ ],
791
+ "text/plain": [
792
+ " id title \\\n",
793
+ "0 706.1314 When Did Cosmic Acceleration Start ? \n",
794
+ "1 706.1315 The Dirac system on the Anti-de Sitter Universe \n",
795
+ "2 706.1316 Coupling of Optical Lumped Nanocircuit Element... \n",
796
+ "3 706.1317 A model for learning to segment temporal seque... \n",
797
+ "4 706.1318 Constructing a maximum utility slate of on-lin... \n",
798
+ "\n",
799
+ " abstract categories \\\n",
800
+ "0 A precise determination, and comparison, of ... astro-ph gr-qc hep-ph \n",
801
+ "1 We investigate the global solutions of the D... math-ph math.AP math.MP \n",
802
+ "2 We present here a model for the coupling amo... cond-mat.mtrl-sci \n",
803
+ "3 This paper proposes a novel learning method ... nlin.AO \n",
804
+ "4 We present an algorithm for constructing an ... cs.DM cs.DS \n",
805
+ "\n",
806
+ " processed_abstract \\\n",
807
+ "0 a precise determination and comparison of the ... \n",
808
+ "1 we investigate the global solutions of the dir... \n",
809
+ "2 we present here a model for the coupling among... \n",
810
+ "3 this paper proposes a novel learning method fo... \n",
811
+ "4 we present an algorithm for constructing an op... \n",
812
+ "\n",
813
+ " tokenized_abstract \n",
814
+ "0 [[a, precise, determination, ,, and, compariso... \n",
815
+ "1 [[we, investigate, the, global, solutions, of,... \n",
816
+ "2 [[we, present, here, a, model, for, the, coupl... \n",
817
+ "3 [[this, paper, proposes, a, novel, learning, m... \n",
818
+ "4 [[we, present, an, algorithm, for, constructin... "
819
+ ]
820
+ },
821
+ "execution_count": 27,
822
+ "metadata": {},
823
+ "output_type": "execute_result"
824
+ }
825
+ ],
826
+ "source": [
827
+ "import re\n",
828
+ "\n",
829
+ "def pre_processor(sentence):\n",
830
+ " if not isinstance(sentence, str):\n",
831
+ " return \"\"\n",
832
+ " # lowercase\n",
833
+ " sentence = sentence.lower()\n",
834
+ " # remove punctuation using regex\n",
835
+ " sentence = re.sub(r\"[^\\w\\s]\", \"\", sentence)\n",
836
+ " # remove extra multiple spaces\n",
837
+ " sentence = re.sub(r\"\\s+\", \" \", sentence).strip()\n",
838
+ " return sentence\n",
839
+ "\n",
840
+ "df[\"processed_abstract\"] = df[\"abstract\"].progress_apply(pre_processor)\n",
841
+ "df.head()\n"
842
+ ]
843
+ },
844
+ {
845
+ "cell_type": "markdown",
846
+ "metadata": {
847
+ "id": "d-CU5YOfbVui"
848
+ },
849
+ "source": [
850
+ "#### Converting sentences into list of words"
851
+ ]
852
+ },
853
+ {
854
+ "cell_type": "code",
855
+ "execution_count": 29,
856
+ "metadata": {
857
+ "colab": {
858
+ "base_uri": "https://localhost:8080/",
859
+ "height": 49,
860
+ "referenced_widgets": [
861
+ "3048b4405eb44748b38b5237facafade",
862
+ "26fbd3641bb142d0a42a3e9e692b1689",
863
+ "d1404cdae4624c619c7793ac55e7917f",
864
+ "1362965a367146ff94a368a343b39ee8",
865
+ "99baec9dffac45eea0d573e9b61f81c1",
866
+ "a5482757a62545f19ade21ca6f6c4240",
867
+ "b0828f5f2f204222a40c7e37f827b5b8",
868
+ "93a30d3ff5084c58a264f1e1cb6e214c",
869
+ "421f7f03e06546d390c74bd2069883bd",
870
+ "87140c7e325a4fef97173c83ef14dcea",
871
+ "35a3cef4a7ca461a8edc089a9db655a1"
872
+ ]
873
+ },
874
+ "id": "Kz0i9reWbVui",
875
+ "outputId": "7b79cddb-82a0-4e0a-f5d0-af8e5c1a0e94"
876
+ },
877
+ "outputs": [
878
+ {
879
+ "data": {
880
+ "application/vnd.jupyter.widget-view+json": {
881
+ "model_id": "7dfca3237c814703aa915efee7af1032",
882
+ "version_major": 2,
883
+ "version_minor": 0
884
+ },
885
+ "text/plain": [
886
+ " 0%| | 0/1010000 [00:00<?, ?it/s]"
887
+ ]
888
+ },
889
+ "metadata": {},
890
+ "output_type": "display_data"
891
+ }
892
+ ],
893
+ "source": [
894
+ "abstracts = df['processed_abstract'].values\n",
895
+ "\n",
896
+ "def tokenize_sentences(text):\n",
897
+ " if not isinstance(text, str) or not text.strip():\n",
898
+ " return []\n",
899
+ " sentences = sent_tokenize(text)\n",
900
+ " return [word_tokenize(sent) for sent in sentences]\n",
901
+ "\n",
902
+ "df['tokenized_abstract'] = df['processed_abstract'].progress_apply(lambda x: tokenize_sentences(x))\n",
903
+ "\n",
904
+ "corpus_data = df['tokenized_abstract'].to_list()\n",
905
+ "word2vec_corpus = [item for items in corpus_data for item in items]"
906
+ ]
907
+ },
908
+ {
909
+ "cell_type": "markdown",
910
+ "metadata": {
911
+ "id": "K1MFdpWibVui"
912
+ },
913
+ "source": [
914
+ "#### Using Gensim's API to train Word2Vec\n",
915
+ "\n",
916
+ "#### Parameters:\n",
917
+ "- **input** : Corpus data\n",
918
+ "- **min_count** : Ignores all words with total frequency lower than this.\n",
919
+ "- **vector_size** : Dimensionality of the output word vectors.\n",
920
+ "- **workers** : Use these many worker threads to train the model (=faster training with multicore machines).\n",
921
+ "- **sg** : Training algorithm: 1 for skip-gram; otherwise CBOW.\n",
922
+ "- **negative** : If > 0, negative sampling will be used, the int for negative specifies how many \"noise words\"."
923
+ ]
924
+ },
925
+ {
926
+ "cell_type": "code",
927
+ "execution_count": 33,
928
+ "metadata": {
929
+ "id": "iC7gIDHwbVui"
930
+ },
931
+ "outputs": [],
932
+ "source": [
933
+ "model = Word2Vec(word2vec_corpus, min_count=3, vector_size= 100, workers=4, window =5, sg = 1, negative=5)"
934
+ ]
935
+ },
936
+ {
937
+ "cell_type": "code",
938
+ "execution_count": 41,
939
+ "metadata": {},
940
+ "outputs": [],
941
+ "source": [
942
+ "model.save(\"word2vec_arxiv_skipgram.model\")"
943
+ ]
944
+ },
945
+ {
946
+ "cell_type": "markdown",
947
+ "metadata": {},
948
+ "source": [
949
+ "#### Loading model weights"
950
+ ]
951
+ },
952
+ {
953
+ "cell_type": "code",
954
+ "execution_count": 10,
955
+ "metadata": {},
956
+ "outputs": [],
957
+ "source": [
958
+ "model = Word2Vec.load(\"word2vec_arxiv_skipgram.model\")"
959
+ ]
960
+ },
961
+ {
962
+ "cell_type": "markdown",
963
+ "metadata": {},
964
+ "source": [
965
+ "#### Finding embeddings of each word in abstract and calculating centroid of whole document"
966
+ ]
967
+ },
968
+ {
969
+ "cell_type": "code",
970
+ "execution_count": 39,
971
+ "metadata": {
972
+ "id": "lfe_zzszbVui"
973
+ },
974
+ "outputs": [],
975
+ "source": [
976
+ "# initialize centroid column\n",
977
+ "df[\"centroid\"] = [ [0.0]*100 ] * df.shape[0]\n",
978
+ "\n",
979
+ "for index, row in df.iterrows():\n",
980
+ " tokenized = row['tokenized_abstract'] # list of list of tokens\n",
981
+ " \n",
982
+ " # flatten the list of lists\n",
983
+ " words = [w for sent in tokenized for w in sent]\n",
984
+ "\n",
985
+ " centroid = np.zeros(100, dtype=float)\n",
986
+ " word_count = 0\n",
987
+ "\n",
988
+ " for word in words:\n",
989
+ " try:\n",
990
+ " vec = model.wv[word] # get vector from trained Word2Vec\n",
991
+ " centroid += vec\n",
992
+ " word_count += 1\n",
993
+ " except KeyError:\n",
994
+ " # word not in vocabulary\n",
995
+ " continue\n",
996
+ "\n",
997
+ " # average\n",
998
+ " if word_count > 0:\n",
999
+ " centroid = centroid / word_count\n",
1000
+ "\n",
1001
+ " df.at[index, \"centroid\"] = centroid.tolist()\n"
1002
+ ]
1003
+ },
1004
+ {
1005
+ "cell_type": "code",
1006
+ "execution_count": 42,
1007
+ "metadata": {},
1008
+ "outputs": [
1009
+ {
1010
+ "data": {
1011
+ "text/html": [
1012
+ "<div>\n",
1013
+ "<style scoped>\n",
1014
+ " .dataframe tbody tr th:only-of-type {\n",
1015
+ " vertical-align: middle;\n",
1016
+ " }\n",
1017
+ "\n",
1018
+ " .dataframe tbody tr th {\n",
1019
+ " vertical-align: top;\n",
1020
+ " }\n",
1021
+ "\n",
1022
+ " .dataframe thead th {\n",
1023
+ " text-align: right;\n",
1024
+ " }\n",
1025
+ "</style>\n",
1026
+ "<table border=\"1\" class=\"dataframe\">\n",
1027
+ " <thead>\n",
1028
+ " <tr style=\"text-align: right;\">\n",
1029
+ " <th></th>\n",
1030
+ " <th>id</th>\n",
1031
+ " <th>title</th>\n",
1032
+ " <th>abstract</th>\n",
1033
+ " <th>categories</th>\n",
1034
+ " <th>processed_abstract</th>\n",
1035
+ " <th>tokenized_abstract</th>\n",
1036
+ " <th>centroid</th>\n",
1037
+ " </tr>\n",
1038
+ " </thead>\n",
1039
+ " <tbody>\n",
1040
+ " <tr>\n",
1041
+ " <th>0</th>\n",
1042
+ " <td>706.1314</td>\n",
1043
+ " <td>When Did Cosmic Acceleration Start ?</td>\n",
1044
+ " <td>A precise determination, and comparison, of ...</td>\n",
1045
+ " <td>astro-ph gr-qc hep-ph</td>\n",
1046
+ " <td>a precise determination and comparison of the ...</td>\n",
1047
+ " <td>[[a, precise, determination, and, comparison, ...</td>\n",
1048
+ " <td>[-0.08764898034384136, 0.11816140904464356, -0...</td>\n",
1049
+ " </tr>\n",
1050
+ " <tr>\n",
1051
+ " <th>1</th>\n",
1052
+ " <td>706.1315</td>\n",
1053
+ " <td>The Dirac system on the Anti-de Sitter Universe</td>\n",
1054
+ " <td>We investigate the global solutions of the D...</td>\n",
1055
+ " <td>math-ph math.AP math.MP</td>\n",
1056
+ " <td>we investigate the global solutions of the dir...</td>\n",
1057
+ " <td>[[we, investigate, the, global, solutions, of,...</td>\n",
1058
+ " <td>[-0.2019515800796124, 0.14643365251905624, -0....</td>\n",
1059
+ " </tr>\n",
1060
+ " <tr>\n",
1061
+ " <th>2</th>\n",
1062
+ " <td>706.1316</td>\n",
1063
+ " <td>Coupling of Optical Lumped Nanocircuit Element...</td>\n",
1064
+ " <td>We present here a model for the coupling amo...</td>\n",
1065
+ " <td>cond-mat.mtrl-sci</td>\n",
1066
+ " <td>we present here a model for the coupling among...</td>\n",
1067
+ " <td>[[we, present, here, a, model, for, the, coupl...</td>\n",
1068
+ " <td>[-0.10665342995032136, 0.15606039330525942, -0...</td>\n",
1069
+ " </tr>\n",
1070
+ " <tr>\n",
1071
+ " <th>3</th>\n",
1072
+ " <td>706.1317</td>\n",
1073
+ " <td>A model for learning to segment temporal seque...</td>\n",
1074
+ " <td>This paper proposes a novel learning method ...</td>\n",
1075
+ " <td>nlin.AO</td>\n",
1076
+ " <td>this paper proposes a novel learning method fo...</td>\n",
1077
+ " <td>[[this, paper, proposes, a, novel, learning, m...</td>\n",
1078
+ " <td>[-0.018284802690839336, 0.009443171508610248, ...</td>\n",
1079
+ " </tr>\n",
1080
+ " <tr>\n",
1081
+ " <th>4</th>\n",
1082
+ " <td>706.1318</td>\n",
1083
+ " <td>Constructing a maximum utility slate of on-lin...</td>\n",
1084
+ " <td>We present an algorithm for constructing an ...</td>\n",
1085
+ " <td>cs.DM cs.DS</td>\n",
1086
+ " <td>we present an algorithm for constructing an op...</td>\n",
1087
+ " <td>[[we, present, an, algorithm, for, constructin...</td>\n",
1088
+ " <td>[-0.07965052640159434, 0.05054909288165002, -0...</td>\n",
1089
+ " </tr>\n",
1090
+ " </tbody>\n",
1091
+ "</table>\n",
1092
+ "</div>"
1093
+ ],
1094
+ "text/plain": [
1095
+ " id title \\\n",
1096
+ "0 706.1314 When Did Cosmic Acceleration Start ? \n",
1097
+ "1 706.1315 The Dirac system on the Anti-de Sitter Universe \n",
1098
+ "2 706.1316 Coupling of Optical Lumped Nanocircuit Element... \n",
1099
+ "3 706.1317 A model for learning to segment temporal seque... \n",
1100
+ "4 706.1318 Constructing a maximum utility slate of on-lin... \n",
1101
+ "\n",
1102
+ " abstract categories \\\n",
1103
+ "0 A precise determination, and comparison, of ... astro-ph gr-qc hep-ph \n",
1104
+ "1 We investigate the global solutions of the D... math-ph math.AP math.MP \n",
1105
+ "2 We present here a model for the coupling amo... cond-mat.mtrl-sci \n",
1106
+ "3 This paper proposes a novel learning method ... nlin.AO \n",
1107
+ "4 We present an algorithm for constructing an ... cs.DM cs.DS \n",
1108
+ "\n",
1109
+ " processed_abstract \\\n",
1110
+ "0 a precise determination and comparison of the ... \n",
1111
+ "1 we investigate the global solutions of the dir... \n",
1112
+ "2 we present here a model for the coupling among... \n",
1113
+ "3 this paper proposes a novel learning method fo... \n",
1114
+ "4 we present an algorithm for constructing an op... \n",
1115
+ "\n",
1116
+ " tokenized_abstract \\\n",
1117
+ "0 [[a, precise, determination, and, comparison, ... \n",
1118
+ "1 [[we, investigate, the, global, solutions, of,... \n",
1119
+ "2 [[we, present, here, a, model, for, the, coupl... \n",
1120
+ "3 [[this, paper, proposes, a, novel, learning, m... \n",
1121
+ "4 [[we, present, an, algorithm, for, constructin... \n",
1122
+ "\n",
1123
+ " centroid \n",
1124
+ "0 [-0.08764898034384136, 0.11816140904464356, -0... \n",
1125
+ "1 [-0.2019515800796124, 0.14643365251905624, -0.... \n",
1126
+ "2 [-0.10665342995032136, 0.15606039330525942, -0... \n",
1127
+ "3 [-0.018284802690839336, 0.009443171508610248, ... \n",
1128
+ "4 [-0.07965052640159434, 0.05054909288165002, -0... "
1129
+ ]
1130
+ },
1131
+ "execution_count": 42,
1132
+ "metadata": {},
1133
+ "output_type": "execute_result"
1134
+ }
1135
+ ],
1136
+ "source": [
1137
+ "df.head()"
1138
+ ]
1139
+ },
1140
+ {
1141
+ "cell_type": "code",
1142
+ "execution_count": 43,
1143
+ "metadata": {},
1144
+ "outputs": [],
1145
+ "source": [
1146
+ "df.to_pickle(\"data.pkl\")"
1147
+ ]
1148
+ },
1149
+ {
1150
+ "cell_type": "markdown",
1151
+ "metadata": {},
1152
+ "source": [
1153
+ "#### Saving Dataset as parquet for efficient loading and compression"
1154
+ ]
1155
+ },
1156
+ {
1157
+ "cell_type": "code",
1158
+ "execution_count": 17,
1159
+ "metadata": {},
1160
+ "outputs": [],
1161
+ "source": [
1162
+ "df.to_parquet(\"data.parquet\", engine=\"fastparquet\")"
1163
+ ]
1164
+ },
1165
+ {
1166
+ "cell_type": "markdown",
1167
+ "metadata": {},
1168
+ "source": [
1169
+ "#### Loading dataset"
1170
+ ]
1171
+ },
1172
+ {
1173
+ "cell_type": "code",
1174
+ "execution_count": 11,
1175
+ "metadata": {},
1176
+ "outputs": [],
1177
+ "source": [
1178
+ "df = pd.read_pickle(\"data.pkl\")"
1179
+ ]
1180
+ },
1181
+ {
1182
+ "cell_type": "code",
1183
+ "execution_count": 20,
1184
+ "metadata": {},
1185
+ "outputs": [],
1186
+ "source": [
1187
+ "df = pd.read_parquet(\"data.parquet\", engine=\"fastparquet\") ## loading parquet dataset"
1188
+ ]
1189
+ },
1190
+ {
1191
+ "cell_type": "code",
1192
+ "execution_count": 12,
1193
+ "metadata": {},
1194
+ "outputs": [
1195
+ {
1196
+ "data": {
1197
+ "text/html": [
1198
+ "<div>\n",
1199
+ "<style scoped>\n",
1200
+ " .dataframe tbody tr th:only-of-type {\n",
1201
+ " vertical-align: middle;\n",
1202
+ " }\n",
1203
+ "\n",
1204
+ " .dataframe tbody tr th {\n",
1205
+ " vertical-align: top;\n",
1206
+ " }\n",
1207
+ "\n",
1208
+ " .dataframe thead th {\n",
1209
+ " text-align: right;\n",
1210
+ " }\n",
1211
+ "</style>\n",
1212
+ "<table border=\"1\" class=\"dataframe\">\n",
1213
+ " <thead>\n",
1214
+ " <tr style=\"text-align: right;\">\n",
1215
+ " <th></th>\n",
1216
+ " <th>id</th>\n",
1217
+ " <th>title</th>\n",
1218
+ " <th>abstract</th>\n",
1219
+ " <th>categories</th>\n",
1220
+ " <th>processed_abstract</th>\n",
1221
+ " <th>tokenized_abstract</th>\n",
1222
+ " <th>centroid</th>\n",
1223
+ " </tr>\n",
1224
+ " </thead>\n",
1225
+ " <tbody>\n",
1226
+ " <tr>\n",
1227
+ " <th>0</th>\n",
1228
+ " <td>706.1314</td>\n",
1229
+ " <td>When Did Cosmic Acceleration Start ?</td>\n",
1230
+ " <td>A precise determination, and comparison, of ...</td>\n",
1231
+ " <td>astro-ph gr-qc hep-ph</td>\n",
1232
+ " <td>a precise determination and comparison of the ...</td>\n",
1233
+ " <td>[[a, precise, determination, and, comparison, ...</td>\n",
1234
+ " <td>[-0.08764898034384136, 0.11816140904464356, -0...</td>\n",
1235
+ " </tr>\n",
1236
+ " <tr>\n",
1237
+ " <th>1</th>\n",
1238
+ " <td>706.1315</td>\n",
1239
+ " <td>The Dirac system on the Anti-de Sitter Universe</td>\n",
1240
+ " <td>We investigate the global solutions of the D...</td>\n",
1241
+ " <td>math-ph math.AP math.MP</td>\n",
1242
+ " <td>we investigate the global solutions of the dir...</td>\n",
1243
+ " <td>[[we, investigate, the, global, solutions, of,...</td>\n",
1244
+ " <td>[-0.2019515800796124, 0.14643365251905624, -0....</td>\n",
1245
+ " </tr>\n",
1246
+ " <tr>\n",
1247
+ " <th>2</th>\n",
1248
+ " <td>706.1316</td>\n",
1249
+ " <td>Coupling of Optical Lumped Nanocircuit Element...</td>\n",
1250
+ " <td>We present here a model for the coupling amo...</td>\n",
1251
+ " <td>cond-mat.mtrl-sci</td>\n",
1252
+ " <td>we present here a model for the coupling among...</td>\n",
1253
+ " <td>[[we, present, here, a, model, for, the, coupl...</td>\n",
1254
+ " <td>[-0.10665342995032136, 0.15606039330525942, -0...</td>\n",
1255
+ " </tr>\n",
1256
+ " <tr>\n",
1257
+ " <th>3</th>\n",
1258
+ " <td>706.1317</td>\n",
1259
+ " <td>A model for learning to segment temporal seque...</td>\n",
1260
+ " <td>This paper proposes a novel learning method ...</td>\n",
1261
+ " <td>nlin.AO</td>\n",
1262
+ " <td>this paper proposes a novel learning method fo...</td>\n",
1263
+ " <td>[[this, paper, proposes, a, novel, learning, m...</td>\n",
1264
+ " <td>[-0.018284802690839336, 0.009443171508610248, ...</td>\n",
1265
+ " </tr>\n",
1266
+ " <tr>\n",
1267
+ " <th>4</th>\n",
1268
+ " <td>706.1318</td>\n",
1269
+ " <td>Constructing a maximum utility slate of on-lin...</td>\n",
1270
+ " <td>We present an algorithm for constructing an ...</td>\n",
1271
+ " <td>cs.DM cs.DS</td>\n",
1272
+ " <td>we present an algorithm for constructing an op...</td>\n",
1273
+ " <td>[[we, present, an, algorithm, for, constructin...</td>\n",
1274
+ " <td>[-0.07965052640159434, 0.05054909288165002, -0...</td>\n",
1275
+ " </tr>\n",
1276
+ " </tbody>\n",
1277
+ "</table>\n",
1278
+ "</div>"
1279
+ ],
1280
+ "text/plain": [
1281
+ " id title \\\n",
1282
+ "0 706.1314 When Did Cosmic Acceleration Start ? \n",
1283
+ "1 706.1315 The Dirac system on the Anti-de Sitter Universe \n",
1284
+ "2 706.1316 Coupling of Optical Lumped Nanocircuit Element... \n",
1285
+ "3 706.1317 A model for learning to segment temporal seque... \n",
1286
+ "4 706.1318 Constructing a maximum utility slate of on-lin... \n",
1287
+ "\n",
1288
+ " abstract categories \\\n",
1289
+ "0 A precise determination, and comparison, of ... astro-ph gr-qc hep-ph \n",
1290
+ "1 We investigate the global solutions of the D... math-ph math.AP math.MP \n",
1291
+ "2 We present here a model for the coupling amo... cond-mat.mtrl-sci \n",
1292
+ "3 This paper proposes a novel learning method ... nlin.AO \n",
1293
+ "4 We present an algorithm for constructing an ... cs.DM cs.DS \n",
1294
+ "\n",
1295
+ " processed_abstract \\\n",
1296
+ "0 a precise determination and comparison of the ... \n",
1297
+ "1 we investigate the global solutions of the dir... \n",
1298
+ "2 we present here a model for the coupling among... \n",
1299
+ "3 this paper proposes a novel learning method fo... \n",
1300
+ "4 we present an algorithm for constructing an op... \n",
1301
+ "\n",
1302
+ " tokenized_abstract \\\n",
1303
+ "0 [[a, precise, determination, and, comparison, ... \n",
1304
+ "1 [[we, investigate, the, global, solutions, of,... \n",
1305
+ "2 [[we, present, here, a, model, for, the, coupl... \n",
1306
+ "3 [[this, paper, proposes, a, novel, learning, m... \n",
1307
+ "4 [[we, present, an, algorithm, for, constructin... \n",
1308
+ "\n",
1309
+ " centroid \n",
1310
+ "0 [-0.08764898034384136, 0.11816140904464356, -0... \n",
1311
+ "1 [-0.2019515800796124, 0.14643365251905624, -0.... \n",
1312
+ "2 [-0.10665342995032136, 0.15606039330525942, -0... \n",
1313
+ "3 [-0.018284802690839336, 0.009443171508610248, ... \n",
1314
+ "4 [-0.07965052640159434, 0.05054909288165002, -0... "
1315
+ ]
1316
+ },
1317
+ "execution_count": 12,
1318
+ "metadata": {},
1319
+ "output_type": "execute_result"
1320
+ }
1321
+ ],
1322
+ "source": [
1323
+ "df.head()"
1324
+ ]
1325
+ },
1326
+ {
1327
+ "cell_type": "markdown",
1328
+ "metadata": {},
1329
+ "source": [
1330
+ "#### Using Simple Cosine Similiarity to search top K papers"
1331
+ ]
1332
+ },
1333
+ {
1334
+ "cell_type": "code",
1335
+ "execution_count": 23,
1336
+ "metadata": {
1337
+ "id": "YUtUbO4ebVuj"
1338
+ },
1339
+ "outputs": [],
1340
+ "source": [
1341
+ "def rank_docs(model, query, df_covid, num):\n",
1342
+ "\n",
1343
+ " cosine_list = []\n",
1344
+ "\n",
1345
+ " a = []\n",
1346
+ " query_words = query.lower().split(\" \") # Lowercase query words\n",
1347
+ " found_words = []\n",
1348
+ " not_found_words = []\n",
1349
+ "\n",
1350
+ " for q_word in query_words:\n",
1351
+ " try:\n",
1352
+ " a.append(model.wv[q_word]) # Use model.wv for word vectors\n",
1353
+ " found_words.append(q_word)\n",
1354
+ " except KeyError:\n",
1355
+ " not_found_words.append(q_word)\n",
1356
+ " continue\n",
1357
+ "\n",
1358
+ " if not_found_words:\n",
1359
+ " print(f\"Warning: The following query words were not found in the model's vocabulary: {', '.join(not_found_words)}\")\n",
1360
+ "\n",
1361
+ " if not found_words:\n",
1362
+ " print(\"No query words were found in the model's vocabulary. Returning empty list.\")\n",
1363
+ " return []\n",
1364
+ "\n",
1365
+ " for index, row in df_covid.iterrows():\n",
1366
+ " centroid = row['centroid']\n",
1367
+ " total_sim = 0\n",
1368
+ " # Ensure centroid is not all zeros to avoid division by zero\n",
1369
+ " if np.linalg.norm(centroid) == 0:\n",
1370
+ " continue\n",
1371
+ "\n",
1372
+ " for a_i in a:\n",
1373
+ " # Ensure a_i is not all zeros to avoid division by zero\n",
1374
+ " if np.linalg.norm(a_i) == 0:\n",
1375
+ " continue\n",
1376
+ " cos_sim = np.dot(a_i, centroid)/(np.linalg.norm(a_i)*np.linalg.norm(centroid))\n",
1377
+ " total_sim += cos_sim\n",
1378
+ " cosine_list.append((row['title'], total_sim, row['abstract'], row['id']))\n",
1379
+ "\n",
1380
+ " cosine_list.sort(key=lambda x:x[1], reverse=True) ## in Descending order\n",
1381
+ "\n",
1382
+ " papers_list = []\n",
1383
+ " for item in cosine_list[:num]:\n",
1384
+ " papers_list.append((item[0], item[1], item[2], item[3])) # Only append title and total_sim\n",
1385
+ " return papers_list"
1386
+ ]
1387
+ },
1388
+ {
1389
+ "cell_type": "code",
1390
+ "execution_count": 24,
1391
+ "metadata": {
1392
+ "id": "yCo-1JTmbVuj"
1393
+ },
1394
+ "outputs": [],
1395
+ "source": [
1396
+ "def query(query, top_matches=10):\n",
1397
+ " model_to_use = model\n",
1398
+ " df_to_use = df\n",
1399
+ " return rank_docs(model_to_use, query, df_to_use, top_matches)"
1400
+ ]
1401
+ },
1402
+ {
1403
+ "cell_type": "code",
1404
+ "execution_count": 29,
1405
+ "metadata": {
1406
+ "colab": {
1407
+ "base_uri": "https://localhost:8080/"
1408
+ },
1409
+ "id": "PMBh22TBbVuj",
1410
+ "outputId": "07f8412a-47e3-4c5b-a2c9-1234a485085c"
1411
+ },
1412
+ "outputs": [
1413
+ {
1414
+ "name": "stdout",
1415
+ "output_type": "stream",
1416
+ "text": [
1417
+ "[('Reversibility of laser filamentation', np.float64(1.2172915391534613), ' We investigate the reversibility of laser filamentation, a self-sustained,\\nnon-linear propagation regime including dissipation and time-retarded effects.\\nWe show that even losses related to ionization marginally affect the\\npossibility of reverse propagating ultrashort pulses back to the initial\\nconditions, although they make it prone to finite-distance blow-up susceptible\\nto prevent backward propagation.\\n', 1401.284), ('Droplet-shaped waves: Causal finite-support analogs of X-shaped waves', np.float64(1.206201484806961), ' A model of steady-state X-shaped wave generation by a superluminal\\n(supersonic) pointlike source infinitely moving along a straight line is\\nextended to a more realistic causal scenario of a source pulse launched at time\\nzero and propagating rectilinearly at constant superluminal speed. In the case\\nof infinitely short (delta) pulse, the new model yields an analytical solution,\\ncorresponding to the propagation-invariant X-shaped wave clipped by a\\ndroplet-shaped support, which perpetually expands along the propagation and\\ntransversal directions, thus tending the droplet-shaped wave to the X-shaped\\none.\\n', 1110.3494), ('Role of beam propagation in Goos-H\\\\\"{a}nchen and Imbert-Fedorov shifts', np.float64(1.1974150038736082), ' We derive the polarization-dependent displacements parallel and perpendicular\\nto the plane of incidence, for a Gaussian light beam reflected from a planar\\ninterface, taking into account the propagation of the beam. Using a\\nclassical-optics formalism we show that beam propagation may greatly affect\\nboth Goos-H\\\\\"{a}nchen and Imbert-Fedorov shifts when the incident beam is\\nfocussed.\\n', 804.1895), ('The falling slinky', np.float64(1.1963728894486865), ' The slinky, released from rest hanging under its own weight, falls in a\\npeculiar manner. The bottom stays at rest until a wave hits it from above. Two\\ncases -- one unphysical one where the slinky is able to pass through itself,\\nand the other where the coils of the slinky collide creating a shock wave\\ntravelling down the slinky -- are analysed. In the former case, the bottom\\nbegins to move much later than in the latter.\\n', 1110.4368), ('MHD Mode Conversion around a 2D Magnetic Null Point', np.float64(1.1959848595837723), ' Mode conversion occurs when a wave passes through a region where the sound\\nand Alfven speeds are equal. At this point there is a resonance, which allows\\nsome of the incident wave to be converted into a different mode. We study this\\nphenomenon in the vicinity of a two-dimensional, coronal null point. As a wave\\napproaches the null it passes from low- to high-beta plasma, allowing\\nconversion to take place. We simulate this numerically by sending in a slow\\nmagnetoacoustic wave from the upper boundary; as this passes through the\\nconversion layer a fast wave can clearly be seen propagating ahead. Numerical\\nsimulations combined with an analytical WKB investigation allow us to determine\\nand track both the incident and converted waves throughout the domain.\\n', 907.1541), ('An analog fluid model for some tachyonic effects in field theory', np.float64(1.1954942312587349), ' We consider the sound radiation from an acoustic point-like source moving\\nalong a supersonic (\"space-like\") trajectory in a fluid at rest. We call it an\\nacoustic \"tachyonic\" source. We describe the radiation emitted by this\\nsupersonic source. After quantizing the acoustic perturbations, we present the\\ndistribution of phonons generated by this classical tachyonic source and the\\nclassical wave interference pattern.\\n', 1109.098), ('Interactions in an acoustic world', np.float64(1.1949165744874333), ' The present paper aims to complete an earlier paper where the acoustic world\\nwas introduced. This is accomplished by analyzing the interactions which occur\\nbetween the inhomogeneities of the acoustic medium, which are induced by the\\nacoustic vibrations traveling in the medium. When a wave packet travels in a\\nmedium, the medium becomes inhomogeneous. The spherical wave packet behaves\\nlike an acoustic spherical lens for the acoustic plane waves. According to the\\nprinciple of causality, there is an interaction between the wave and plane wave\\npacket. In specific conditions the wave packet behaves as an acoustic black\\nhole.\\n', 1612.00294), ('Peculiar Behavior of Si Cluster Ions in Solid Al', np.float64(1.193792102194581), ' A peculiar ion behavior is found in a Si cluster, moving with a speed of\\n~0.22c (c: speed of light) in a solid Al plasma: the Si ion, moving behind the\\nforward moving Si ion closely in a several angstrom distance in the cluster,\\nfeels the wake field generated by the forward Si. The interaction potential on\\nthe rear Si may balance the deceleration backward force by itself with the\\nacceleration forward force by the forward Si in the longitudinal moving\\ndirection. The forward Si would be decelerated normally. However, the\\ndeceleration of the rear Si, moving behind closely, would be reduced\\nsignificantly, and the rear Si may catch up and overtake the forward moving Si\\nin the cluster during the Si cluster interaction with the high-density Al\\nplasma.\\n', 1808.02649), ('The effective Hamiltonian which governs the propagation dynamics of\\n nonspreading wave packets', np.float64(1.1935298414558697), \" We discuss the propagation dynamics of nonspreading wave packets. We\\ndecompose the Hamiltonian into two parts. The first part is such that wave\\npackets is its instantaneous eigenstate and is therefore irrelevant to the\\npropagation of the packet. The second part is shown to be the effective\\nHamiltonian governing the motion of the packet both classically and quantum\\nmechanically. Thus, analogous to Ehrenfest's theorem, nonspreading wave packets\\noffer another view point directly connecting quantum mechanics and classical\\nmechanics. This analysis also works for non-square-integrable packets, such as\\nAiry packets.\\n\", 1306.1311), ('Multiple trains of same-color surface plasmon-polaritons guided by the\\n planar interface of a metal and a sculptured nematic thin film. Part IV:\\n Canonical problem', np.float64(1.1905086316212172), ' The canonical problem of the propagation of surface-plasmon-polariton (SPP)\\nwaves localized to the planar interface of a metal and a sculptured nematic\\nthin film (SNTF) that is periodically nonhomogeneous along the direction normal\\nto the interface was formulated. Solution of the dispersion equation obtained\\nthereby confirmed the possibility of exciting multiple SPP waves of the same\\nfrequency or color. However, these SPP waves differ in phase speed, field\\nstructure, and the e-folding distance along the direction of propagation.\\n', 1002.2435)]\n",
1418
+ "Total Time taken by query: 91.10928773880005\n"
1419
+ ]
1420
+ }
1421
+ ],
1422
+ "source": [
1423
+ "import time\n",
1424
+ "start_time = time.time()\n",
1425
+ "print(query('back propagation'))\n",
1426
+ "end_time = time.time()\n",
1427
+ "print(f\"Total Time taken by query: {end_time-start_time}\")"
1428
+ ]
1429
+ },
1430
+ {
1431
+ "cell_type": "markdown",
1432
+ "metadata": {},
1433
+ "source": [
1434
+ "## Search Optimizations"
1435
+ ]
1436
+ },
1437
+ {
1438
+ "cell_type": "markdown",
1439
+ "metadata": {},
1440
+ "source": [
1441
+ "#### Using FAISS for getting top K Result"
1442
+ ]
1443
+ },
1444
+ {
1445
+ "cell_type": "code",
1446
+ "execution_count": 31,
1447
+ "metadata": {},
1448
+ "outputs": [],
1449
+ "source": [
1450
+ "import faiss"
1451
+ ]
1452
+ },
1453
+ {
1454
+ "cell_type": "code",
1455
+ "execution_count": 32,
1456
+ "metadata": {},
1457
+ "outputs": [],
1458
+ "source": [
1459
+ "# Convert centroid column to matrix\n",
1460
+ "X = np.vstack(df['centroid'].values).astype('float32')"
1461
+ ]
1462
+ },
1463
+ {
1464
+ "cell_type": "code",
1465
+ "execution_count": 33,
1466
+ "metadata": {},
1467
+ "outputs": [],
1468
+ "source": [
1469
+ "# Normalize for cosine similarity\n",
1470
+ "faiss.normalize_L2(X)"
1471
+ ]
1472
+ },
1473
+ {
1474
+ "cell_type": "code",
1475
+ "execution_count": 36,
1476
+ "metadata": {},
1477
+ "outputs": [],
1478
+ "source": [
1479
+ "# Build FAISS index\n",
1480
+ "dim = X.shape[1]\n",
1481
+ "flat_index = faiss.IndexFlatIP(dim)\n",
1482
+ "index = faiss.IndexIDMap(flat_index)\n",
1483
+ "\n",
1484
+ "ids = np.arange(X.shape[0]).astype('int64')\n",
1485
+ "index.add_with_ids(X, ids)"
1486
+ ]
1487
+ },
1488
+ {
1489
+ "cell_type": "code",
1490
+ "execution_count": 37,
1491
+ "metadata": {},
1492
+ "outputs": [
1493
+ {
1494
+ "name": "stdout",
1495
+ "output_type": "stream",
1496
+ "text": [
1497
+ "FAISS index built with 1010000 vectors\n"
1498
+ ]
1499
+ }
1500
+ ],
1501
+ "source": [
1502
+ "print(\"FAISS index built with\", index.ntotal, \"vectors\")"
1503
+ ]
1504
+ },
1505
+ {
1506
+ "cell_type": "code",
1507
+ "execution_count": 60,
1508
+ "metadata": {},
1509
+ "outputs": [],
1510
+ "source": [
1511
+ "def faiss_query(query, top_k=10):\n",
1512
+ " words = word_tokenize(query.lower()) # use sent_tokenizer here\n",
1513
+ " print(words)\n",
1514
+ " vecs = []\n",
1515
+ " for w in words:\n",
1516
+ " if w in model.wv:\n",
1517
+ " vecs.append(model.wv[w])\n",
1518
+ " if len(vecs) == 0:\n",
1519
+ " return []\n",
1520
+ " qvec = np.mean(vecs, axis=0).astype('float32').reshape(1, -1)\n",
1521
+ " faiss.normalize_L2(qvec)\n",
1522
+ " scores, neighbors = index.search(qvec, top_k)\n",
1523
+ " return df.iloc[neighbors[0]]\n"
1524
+ ]
1525
+ },
1526
+ {
1527
+ "cell_type": "code",
1528
+ "execution_count": 63,
1529
+ "metadata": {},
1530
+ "outputs": [
1531
+ {
1532
+ "name": "stdout",
1533
+ "output_type": "stream",
1534
+ "text": [
1535
+ "['black', 'holes']\n",
1536
+ " id title \\\n",
1537
+ "593935 1503.01221 Superradiantly stable non-extremal Reissner-No... \n",
1538
+ "174856 1004.29160 Area spectra of near extremal black holes \n",
1539
+ "245022 1104.14680 Exact solutions of higher dimensional black holes \n",
1540
+ "111775 905.01790 Thermodynamics of Ho\\v{r}ava-Lifshitz black holes \n",
1541
+ "112553 905.09570 Thermodynamics of black holes in the deformed ... \n",
1542
+ "75657 810.00780 Regular black hole in three dimensions \n",
1543
+ "34761 801.24340 Phase transition for black holes with scalar h... \n",
1544
+ "1001755 1808.02609 Instability of Reissner-Nordstr\\\"{o}m black ho... \n",
1545
+ "192983 1007.37450 A scalar field condensation instability of rot... \n",
1546
+ "470646 1311.69850 Thermodynamic and classical instability of AdS... \n",
1547
+ "\n",
1548
+ " abstract categories \\\n",
1549
+ "593935 The superradiant stability is investigated f... gr-qc \n",
1550
+ "174856 Motivated by Maggiore's new interpretation o... gr-qc \n",
1551
+ "245022 We review exact solutions of black holes in ... hep-th gr-qc \n",
1552
+ "111775 We study black holes in the Ho\\v{r}ava-Lifsh... hep-th gr-qc \n",
1553
+ "112553 We study thermodynamics of black holes in th... hep-th \n",
1554
+ "75657 We find a new black hole in three dimensiona... gr-qc \n",
1555
+ "34761 We study phase transitions between black hol... hep-th gr-qc \n",
1556
+ "1001755 The scalarization of Reissner-Nordstr\\\"{o}m ... gr-qc hep-th \n",
1557
+ "192983 Near-extreme Reissner-Nordstrom-anti-de Sitt... hep-th gr-qc \n",
1558
+ "470646 We study thermodynamic and classical instabi... hep-th gr-qc \n",
1559
+ "\n",
1560
+ " processed_abstract \\\n",
1561
+ "593935 the superradiant stability is investigated for... \n",
1562
+ "174856 motivated by maggiores new interpretation of q... \n",
1563
+ "245022 we review exact solutions of black holes in hi... \n",
1564
+ "111775 we study black holes in the hovravalifshitz gr... \n",
1565
+ "112553 we study thermodynamics of black holes in the ... \n",
1566
+ "75657 we find a new black hole in three dimensional ... \n",
1567
+ "34761 we study phase transitions between black holes... \n",
1568
+ "1001755 the scalarization of reissnernordstrom black h... \n",
1569
+ "192983 nearextreme reissnernordstromantide sitter bla... \n",
1570
+ "470646 we study thermodynamic and classical instabili... \n",
1571
+ "\n",
1572
+ " tokenized_abstract \\\n",
1573
+ "593935 [[the, superradiant, stability, is, investigat... \n",
1574
+ "174856 [[motivated, by, maggiores, new, interpretatio... \n",
1575
+ "245022 [[we, review, exact, solutions, of, black, hol... \n",
1576
+ "111775 [[we, study, black, holes, in, the, hovravalif... \n",
1577
+ "112553 [[we, study, thermodynamics, of, black, holes,... \n",
1578
+ "75657 [[we, find, a, new, black, hole, in, three, di... \n",
1579
+ "34761 [[we, study, phase, transitions, between, blac... \n",
1580
+ "1001755 [[the, scalarization, of, reissnernordstrom, b... \n",
1581
+ "192983 [[nearextreme, reissnernordstromantide, sitter... \n",
1582
+ "470646 [[we, study, thermodynamic, and, classical, in... \n",
1583
+ "\n",
1584
+ " centroid \n",
1585
+ "593935 [-0.1548219741116343, 0.07305222587574674, -0.... \n",
1586
+ "174856 [-0.15942226781044155, 0.09534141645417549, -0... \n",
1587
+ "245022 [-0.16427493500797188, -0.021989740924361872, ... \n",
1588
+ "111775 [-0.185662076366134, 0.1500076398253441, -0.01... \n",
1589
+ "112553 [-0.1874230935668143, 0.0645052320861186, -0.0... \n",
1590
+ "75657 [-0.140232238308366, 0.11955675896450853, -0.0... \n",
1591
+ "34761 [-0.18028071506373716, 0.18077764537577568, -0... \n",
1592
+ "1001755 [-0.1625649823457934, 0.08357451187912375, -0.... \n",
1593
+ "192983 [-0.20853021891077397, 0.17061649511522387, -0... \n",
1594
+ "470646 [-0.22651504412189954, 0.09521500566215427, -0... \n",
1595
+ "Total Time taken by query: 0.0438535213470459\n"
1596
+ ]
1597
+ }
1598
+ ],
1599
+ "source": [
1600
+ "import time\n",
1601
+ "start_time = time.time()\n",
1602
+ "print(faiss_query('black holes'))\n",
1603
+ "end_time = time.time()\n",
1604
+ "print(f\"Total Time taken by query: {end_time-start_time}\")"
1605
+ ]
1606
+ },
1607
+ {
1608
+ "cell_type": "markdown",
1609
+ "metadata": {},
1610
+ "source": [
1611
+ "As we can se using faiss drastically decreases search time from 91.1 seconds to 0.04 seconds"
1612
+ ]
1613
+ },
1614
+ {
1615
+ "cell_type": "code",
1616
+ "execution_count": 64,
1617
+ "metadata": {},
1618
+ "outputs": [],
1619
+ "source": [
1620
+ "faiss.write_index(index, \"faiss_search_index.bin\")"
1621
+ ]
1622
+ }
1623
+ ],
1624
+ "metadata": {
1625
+ "colab": {
1626
+ "provenance": []
1627
+ },
1628
+ "kaggle": {
1629
+ "accelerator": "gpu",
1630
+ "dataSources": [
1631
+ {
1632
+ "datasetId": 612177,
1633
+ "sourceId": 13661773,
1634
+ "sourceType": "datasetVersion"
1635
+ },
1636
+ {
1637
+ "isSourceIdPinned": true,
1638
+ "modelId": 499524,
1639
+ "modelInstanceId": 484022,
1640
+ "sourceId": 641799,
1641
+ "sourceType": "modelInstanceVersion"
1642
+ }
1643
+ ],
1644
+ "dockerImageVersionId": 31192,
1645
+ "isGpuEnabled": true,
1646
+ "isInternetEnabled": true,
1647
+ "language": "python",
1648
+ "sourceType": "notebook"
1649
+ },
1650
+ "kernelspec": {
1651
+ "display_name": "Python 3 (ipykernel)",
1652
+ "language": "python",
1653
+ "name": "python3"
1654
+ },
1655
+ "language_info": {
1656
+ "codemirror_mode": {
1657
+ "name": "ipython",
1658
+ "version": 3
1659
+ },
1660
+ "file_extension": ".py",
1661
+ "mimetype": "text/x-python",
1662
+ "name": "python",
1663
+ "nbconvert_exporter": "python",
1664
+ "pygments_lexer": "ipython3",
1665
+ "version": "3.12.3"
1666
+ },
1667
+ "widgets": {
1668
+ "application/vnd.jupyter.widget-state+json": {
1669
+ "06d6872e3d1f4086a4f16b8caababccb": {
1670
+ "model_module": "@jupyter-widgets/base",
1671
+ "model_module_version": "1.2.0",
1672
+ "model_name": "LayoutModel",
1673
+ "state": {
1674
+ "_model_module": "@jupyter-widgets/base",
1675
+ "_model_module_version": "1.2.0",
1676
+ "_model_name": "LayoutModel",
1677
+ "_view_count": null,
1678
+ "_view_module": "@jupyter-widgets/base",
1679
+ "_view_module_version": "1.2.0",
1680
+ "_view_name": "LayoutView",
1681
+ "align_content": null,
1682
+ "align_items": null,
1683
+ "align_self": null,
1684
+ "border": null,
1685
+ "bottom": null,
1686
+ "display": null,
1687
+ "flex": null,
1688
+ "flex_flow": null,
1689
+ "grid_area": null,
1690
+ "grid_auto_columns": null,
1691
+ "grid_auto_flow": null,
1692
+ "grid_auto_rows": null,
1693
+ "grid_column": null,
1694
+ "grid_gap": null,
1695
+ "grid_row": null,
1696
+ "grid_template_areas": null,
1697
+ "grid_template_columns": null,
1698
+ "grid_template_rows": null,
1699
+ "height": null,
1700
+ "justify_content": null,
1701
+ "justify_items": null,
1702
+ "left": null,
1703
+ "margin": null,
1704
+ "max_height": null,
1705
+ "max_width": null,
1706
+ "min_height": null,
1707
+ "min_width": null,
1708
+ "object_fit": null,
1709
+ "object_position": null,
1710
+ "order": null,
1711
+ "overflow": null,
1712
+ "overflow_x": null,
1713
+ "overflow_y": null,
1714
+ "padding": null,
1715
+ "right": null,
1716
+ "top": null,
1717
+ "visibility": null,
1718
+ "width": null
1719
+ }
1720
+ },
1721
+ "1362965a367146ff94a368a343b39ee8": {
1722
+ "model_module": "@jupyter-widgets/controls",
1723
+ "model_module_version": "1.5.0",
1724
+ "model_name": "HTMLModel",
1725
+ "state": {
1726
+ "_dom_classes": [],
1727
+ "_model_module": "@jupyter-widgets/controls",
1728
+ "_model_module_version": "1.5.0",
1729
+ "_model_name": "HTMLModel",
1730
+ "_view_count": null,
1731
+ "_view_module": "@jupyter-widgets/controls",
1732
+ "_view_module_version": "1.5.0",
1733
+ "_view_name": "HTMLView",
1734
+ "description": "",
1735
+ "description_tooltip": null,
1736
+ "layout": "IPY_MODEL_87140c7e325a4fef97173c83ef14dcea",
1737
+ "placeholder": "​",
1738
+ "style": "IPY_MODEL_35a3cef4a7ca461a8edc089a9db655a1",
1739
+ "value": " 110000/110000 [01:35&lt;00:00, 710.46it/s]"
1740
+ }
1741
+ },
1742
+ "214288b8cb6c453084c3cce01982eff7": {
1743
+ "model_module": "@jupyter-widgets/controls",
1744
+ "model_module_version": "1.5.0",
1745
+ "model_name": "ProgressStyleModel",
1746
+ "state": {
1747
+ "_model_module": "@jupyter-widgets/controls",
1748
+ "_model_module_version": "1.5.0",
1749
+ "_model_name": "ProgressStyleModel",
1750
+ "_view_count": null,
1751
+ "_view_module": "@jupyter-widgets/base",
1752
+ "_view_module_version": "1.2.0",
1753
+ "_view_name": "StyleView",
1754
+ "bar_color": null,
1755
+ "description_width": ""
1756
+ }
1757
+ },
1758
+ "26fbd3641bb142d0a42a3e9e692b1689": {
1759
+ "model_module": "@jupyter-widgets/controls",
1760
+ "model_module_version": "1.5.0",
1761
+ "model_name": "HTMLModel",
1762
+ "state": {
1763
+ "_dom_classes": [],
1764
+ "_model_module": "@jupyter-widgets/controls",
1765
+ "_model_module_version": "1.5.0",
1766
+ "_model_name": "HTMLModel",
1767
+ "_view_count": null,
1768
+ "_view_module": "@jupyter-widgets/controls",
1769
+ "_view_module_version": "1.5.0",
1770
+ "_view_name": "HTMLView",
1771
+ "description": "",
1772
+ "description_tooltip": null,
1773
+ "layout": "IPY_MODEL_a5482757a62545f19ade21ca6f6c4240",
1774
+ "placeholder": "​",
1775
+ "style": "IPY_MODEL_b0828f5f2f204222a40c7e37f827b5b8",
1776
+ "value": "100%"
1777
+ }
1778
+ },
1779
+ "3048b4405eb44748b38b5237facafade": {
1780
+ "model_module": "@jupyter-widgets/controls",
1781
+ "model_module_version": "1.5.0",
1782
+ "model_name": "HBoxModel",
1783
+ "state": {
1784
+ "_dom_classes": [],
1785
+ "_model_module": "@jupyter-widgets/controls",
1786
+ "_model_module_version": "1.5.0",
1787
+ "_model_name": "HBoxModel",
1788
+ "_view_count": null,
1789
+ "_view_module": "@jupyter-widgets/controls",
1790
+ "_view_module_version": "1.5.0",
1791
+ "_view_name": "HBoxView",
1792
+ "box_style": "",
1793
+ "children": [
1794
+ "IPY_MODEL_26fbd3641bb142d0a42a3e9e692b1689",
1795
+ "IPY_MODEL_d1404cdae4624c619c7793ac55e7917f",
1796
+ "IPY_MODEL_1362965a367146ff94a368a343b39ee8"
1797
+ ],
1798
+ "layout": "IPY_MODEL_99baec9dffac45eea0d573e9b61f81c1"
1799
+ }
1800
+ },
1801
+ "35a3cef4a7ca461a8edc089a9db655a1": {
1802
+ "model_module": "@jupyter-widgets/controls",
1803
+ "model_module_version": "1.5.0",
1804
+ "model_name": "DescriptionStyleModel",
1805
+ "state": {
1806
+ "_model_module": "@jupyter-widgets/controls",
1807
+ "_model_module_version": "1.5.0",
1808
+ "_model_name": "DescriptionStyleModel",
1809
+ "_view_count": null,
1810
+ "_view_module": "@jupyter-widgets/base",
1811
+ "_view_module_version": "1.2.0",
1812
+ "_view_name": "StyleView",
1813
+ "description_width": ""
1814
+ }
1815
+ },
1816
+ "421f7f03e06546d390c74bd2069883bd": {
1817
+ "model_module": "@jupyter-widgets/controls",
1818
+ "model_module_version": "1.5.0",
1819
+ "model_name": "ProgressStyleModel",
1820
+ "state": {
1821
+ "_model_module": "@jupyter-widgets/controls",
1822
+ "_model_module_version": "1.5.0",
1823
+ "_model_name": "ProgressStyleModel",
1824
+ "_view_count": null,
1825
+ "_view_module": "@jupyter-widgets/base",
1826
+ "_view_module_version": "1.2.0",
1827
+ "_view_name": "StyleView",
1828
+ "bar_color": null,
1829
+ "description_width": ""
1830
+ }
1831
+ },
1832
+ "4de066fc521f493cab302c8d5ca255e2": {
1833
+ "model_module": "@jupyter-widgets/base",
1834
+ "model_module_version": "1.2.0",
1835
+ "model_name": "LayoutModel",
1836
+ "state": {
1837
+ "_model_module": "@jupyter-widgets/base",
1838
+ "_model_module_version": "1.2.0",
1839
+ "_model_name": "LayoutModel",
1840
+ "_view_count": null,
1841
+ "_view_module": "@jupyter-widgets/base",
1842
+ "_view_module_version": "1.2.0",
1843
+ "_view_name": "LayoutView",
1844
+ "align_content": null,
1845
+ "align_items": null,
1846
+ "align_self": null,
1847
+ "border": null,
1848
+ "bottom": null,
1849
+ "display": null,
1850
+ "flex": null,
1851
+ "flex_flow": null,
1852
+ "grid_area": null,
1853
+ "grid_auto_columns": null,
1854
+ "grid_auto_flow": null,
1855
+ "grid_auto_rows": null,
1856
+ "grid_column": null,
1857
+ "grid_gap": null,
1858
+ "grid_row": null,
1859
+ "grid_template_areas": null,
1860
+ "grid_template_columns": null,
1861
+ "grid_template_rows": null,
1862
+ "height": null,
1863
+ "justify_content": null,
1864
+ "justify_items": null,
1865
+ "left": null,
1866
+ "margin": null,
1867
+ "max_height": null,
1868
+ "max_width": null,
1869
+ "min_height": null,
1870
+ "min_width": null,
1871
+ "object_fit": null,
1872
+ "object_position": null,
1873
+ "order": null,
1874
+ "overflow": null,
1875
+ "overflow_x": null,
1876
+ "overflow_y": null,
1877
+ "padding": null,
1878
+ "right": null,
1879
+ "top": null,
1880
+ "visibility": null,
1881
+ "width": null
1882
+ }
1883
+ },
1884
+ "4df933c8001c4a27b351c5b66df23eb5": {
1885
+ "model_module": "@jupyter-widgets/base",
1886
+ "model_module_version": "1.2.0",
1887
+ "model_name": "LayoutModel",
1888
+ "state": {
1889
+ "_model_module": "@jupyter-widgets/base",
1890
+ "_model_module_version": "1.2.0",
1891
+ "_model_name": "LayoutModel",
1892
+ "_view_count": null,
1893
+ "_view_module": "@jupyter-widgets/base",
1894
+ "_view_module_version": "1.2.0",
1895
+ "_view_name": "LayoutView",
1896
+ "align_content": null,
1897
+ "align_items": null,
1898
+ "align_self": null,
1899
+ "border": null,
1900
+ "bottom": null,
1901
+ "display": null,
1902
+ "flex": null,
1903
+ "flex_flow": null,
1904
+ "grid_area": null,
1905
+ "grid_auto_columns": null,
1906
+ "grid_auto_flow": null,
1907
+ "grid_auto_rows": null,
1908
+ "grid_column": null,
1909
+ "grid_gap": null,
1910
+ "grid_row": null,
1911
+ "grid_template_areas": null,
1912
+ "grid_template_columns": null,
1913
+ "grid_template_rows": null,
1914
+ "height": null,
1915
+ "justify_content": null,
1916
+ "justify_items": null,
1917
+ "left": null,
1918
+ "margin": null,
1919
+ "max_height": null,
1920
+ "max_width": null,
1921
+ "min_height": null,
1922
+ "min_width": null,
1923
+ "object_fit": null,
1924
+ "object_position": null,
1925
+ "order": null,
1926
+ "overflow": null,
1927
+ "overflow_x": null,
1928
+ "overflow_y": null,
1929
+ "padding": null,
1930
+ "right": null,
1931
+ "top": null,
1932
+ "visibility": null,
1933
+ "width": null
1934
+ }
1935
+ },
1936
+ "52f12329e31645888592d0120a05f6ea": {
1937
+ "model_module": "@jupyter-widgets/base",
1938
+ "model_module_version": "1.2.0",
1939
+ "model_name": "LayoutModel",
1940
+ "state": {
1941
+ "_model_module": "@jupyter-widgets/base",
1942
+ "_model_module_version": "1.2.0",
1943
+ "_model_name": "LayoutModel",
1944
+ "_view_count": null,
1945
+ "_view_module": "@jupyter-widgets/base",
1946
+ "_view_module_version": "1.2.0",
1947
+ "_view_name": "LayoutView",
1948
+ "align_content": null,
1949
+ "align_items": null,
1950
+ "align_self": null,
1951
+ "border": null,
1952
+ "bottom": null,
1953
+ "display": null,
1954
+ "flex": null,
1955
+ "flex_flow": null,
1956
+ "grid_area": null,
1957
+ "grid_auto_columns": null,
1958
+ "grid_auto_flow": null,
1959
+ "grid_auto_rows": null,
1960
+ "grid_column": null,
1961
+ "grid_gap": null,
1962
+ "grid_row": null,
1963
+ "grid_template_areas": null,
1964
+ "grid_template_columns": null,
1965
+ "grid_template_rows": null,
1966
+ "height": null,
1967
+ "justify_content": null,
1968
+ "justify_items": null,
1969
+ "left": null,
1970
+ "margin": null,
1971
+ "max_height": null,
1972
+ "max_width": null,
1973
+ "min_height": null,
1974
+ "min_width": null,
1975
+ "object_fit": null,
1976
+ "object_position": null,
1977
+ "order": null,
1978
+ "overflow": null,
1979
+ "overflow_x": null,
1980
+ "overflow_y": null,
1981
+ "padding": null,
1982
+ "right": null,
1983
+ "top": null,
1984
+ "visibility": null,
1985
+ "width": null
1986
+ }
1987
+ },
1988
+ "5510fb4117fd4954aeeb0de2214b1ed4": {
1989
+ "model_module": "@jupyter-widgets/controls",
1990
+ "model_module_version": "1.5.0",
1991
+ "model_name": "HTMLModel",
1992
+ "state": {
1993
+ "_dom_classes": [],
1994
+ "_model_module": "@jupyter-widgets/controls",
1995
+ "_model_module_version": "1.5.0",
1996
+ "_model_name": "HTMLModel",
1997
+ "_view_count": null,
1998
+ "_view_module": "@jupyter-widgets/controls",
1999
+ "_view_module_version": "1.5.0",
2000
+ "_view_name": "HTMLView",
2001
+ "description": "",
2002
+ "description_tooltip": null,
2003
+ "layout": "IPY_MODEL_4df933c8001c4a27b351c5b66df23eb5",
2004
+ "placeholder": "​",
2005
+ "style": "IPY_MODEL_80071f18a3fd4ddc865a8a536d83cd2e",
2006
+ "value": " 110000/110000 [00:03&lt;00:00, 39633.11it/s]"
2007
+ }
2008
+ },
2009
+ "7d5629fd92c543bcaafe7fd352c079c6": {
2010
+ "model_module": "@jupyter-widgets/controls",
2011
+ "model_module_version": "1.5.0",
2012
+ "model_name": "HBoxModel",
2013
+ "state": {
2014
+ "_dom_classes": [],
2015
+ "_model_module": "@jupyter-widgets/controls",
2016
+ "_model_module_version": "1.5.0",
2017
+ "_model_name": "HBoxModel",
2018
+ "_view_count": null,
2019
+ "_view_module": "@jupyter-widgets/controls",
2020
+ "_view_module_version": "1.5.0",
2021
+ "_view_name": "HBoxView",
2022
+ "box_style": "",
2023
+ "children": [
2024
+ "IPY_MODEL_a815fb290af249cbb23679e34d1fdaa3",
2025
+ "IPY_MODEL_b042bdc131654d25bac957a4e921f59d",
2026
+ "IPY_MODEL_5510fb4117fd4954aeeb0de2214b1ed4"
2027
+ ],
2028
+ "layout": "IPY_MODEL_4de066fc521f493cab302c8d5ca255e2"
2029
+ }
2030
+ },
2031
+ "80071f18a3fd4ddc865a8a536d83cd2e": {
2032
+ "model_module": "@jupyter-widgets/controls",
2033
+ "model_module_version": "1.5.0",
2034
+ "model_name": "DescriptionStyleModel",
2035
+ "state": {
2036
+ "_model_module": "@jupyter-widgets/controls",
2037
+ "_model_module_version": "1.5.0",
2038
+ "_model_name": "DescriptionStyleModel",
2039
+ "_view_count": null,
2040
+ "_view_module": "@jupyter-widgets/base",
2041
+ "_view_module_version": "1.2.0",
2042
+ "_view_name": "StyleView",
2043
+ "description_width": ""
2044
+ }
2045
+ },
2046
+ "87140c7e325a4fef97173c83ef14dcea": {
2047
+ "model_module": "@jupyter-widgets/base",
2048
+ "model_module_version": "1.2.0",
2049
+ "model_name": "LayoutModel",
2050
+ "state": {
2051
+ "_model_module": "@jupyter-widgets/base",
2052
+ "_model_module_version": "1.2.0",
2053
+ "_model_name": "LayoutModel",
2054
+ "_view_count": null,
2055
+ "_view_module": "@jupyter-widgets/base",
2056
+ "_view_module_version": "1.2.0",
2057
+ "_view_name": "LayoutView",
2058
+ "align_content": null,
2059
+ "align_items": null,
2060
+ "align_self": null,
2061
+ "border": null,
2062
+ "bottom": null,
2063
+ "display": null,
2064
+ "flex": null,
2065
+ "flex_flow": null,
2066
+ "grid_area": null,
2067
+ "grid_auto_columns": null,
2068
+ "grid_auto_flow": null,
2069
+ "grid_auto_rows": null,
2070
+ "grid_column": null,
2071
+ "grid_gap": null,
2072
+ "grid_row": null,
2073
+ "grid_template_areas": null,
2074
+ "grid_template_columns": null,
2075
+ "grid_template_rows": null,
2076
+ "height": null,
2077
+ "justify_content": null,
2078
+ "justify_items": null,
2079
+ "left": null,
2080
+ "margin": null,
2081
+ "max_height": null,
2082
+ "max_width": null,
2083
+ "min_height": null,
2084
+ "min_width": null,
2085
+ "object_fit": null,
2086
+ "object_position": null,
2087
+ "order": null,
2088
+ "overflow": null,
2089
+ "overflow_x": null,
2090
+ "overflow_y": null,
2091
+ "padding": null,
2092
+ "right": null,
2093
+ "top": null,
2094
+ "visibility": null,
2095
+ "width": null
2096
+ }
2097
+ },
2098
+ "88ba2081fc004bbdb6f438c5031458d5": {
2099
+ "model_module": "@jupyter-widgets/controls",
2100
+ "model_module_version": "1.5.0",
2101
+ "model_name": "DescriptionStyleModel",
2102
+ "state": {
2103
+ "_model_module": "@jupyter-widgets/controls",
2104
+ "_model_module_version": "1.5.0",
2105
+ "_model_name": "DescriptionStyleModel",
2106
+ "_view_count": null,
2107
+ "_view_module": "@jupyter-widgets/base",
2108
+ "_view_module_version": "1.2.0",
2109
+ "_view_name": "StyleView",
2110
+ "description_width": ""
2111
+ }
2112
+ },
2113
+ "93a30d3ff5084c58a264f1e1cb6e214c": {
2114
+ "model_module": "@jupyter-widgets/base",
2115
+ "model_module_version": "1.2.0",
2116
+ "model_name": "LayoutModel",
2117
+ "state": {
2118
+ "_model_module": "@jupyter-widgets/base",
2119
+ "_model_module_version": "1.2.0",
2120
+ "_model_name": "LayoutModel",
2121
+ "_view_count": null,
2122
+ "_view_module": "@jupyter-widgets/base",
2123
+ "_view_module_version": "1.2.0",
2124
+ "_view_name": "LayoutView",
2125
+ "align_content": null,
2126
+ "align_items": null,
2127
+ "align_self": null,
2128
+ "border": null,
2129
+ "bottom": null,
2130
+ "display": null,
2131
+ "flex": null,
2132
+ "flex_flow": null,
2133
+ "grid_area": null,
2134
+ "grid_auto_columns": null,
2135
+ "grid_auto_flow": null,
2136
+ "grid_auto_rows": null,
2137
+ "grid_column": null,
2138
+ "grid_gap": null,
2139
+ "grid_row": null,
2140
+ "grid_template_areas": null,
2141
+ "grid_template_columns": null,
2142
+ "grid_template_rows": null,
2143
+ "height": null,
2144
+ "justify_content": null,
2145
+ "justify_items": null,
2146
+ "left": null,
2147
+ "margin": null,
2148
+ "max_height": null,
2149
+ "max_width": null,
2150
+ "min_height": null,
2151
+ "min_width": null,
2152
+ "object_fit": null,
2153
+ "object_position": null,
2154
+ "order": null,
2155
+ "overflow": null,
2156
+ "overflow_x": null,
2157
+ "overflow_y": null,
2158
+ "padding": null,
2159
+ "right": null,
2160
+ "top": null,
2161
+ "visibility": null,
2162
+ "width": null
2163
+ }
2164
+ },
2165
+ "99baec9dffac45eea0d573e9b61f81c1": {
2166
+ "model_module": "@jupyter-widgets/base",
2167
+ "model_module_version": "1.2.0",
2168
+ "model_name": "LayoutModel",
2169
+ "state": {
2170
+ "_model_module": "@jupyter-widgets/base",
2171
+ "_model_module_version": "1.2.0",
2172
+ "_model_name": "LayoutModel",
2173
+ "_view_count": null,
2174
+ "_view_module": "@jupyter-widgets/base",
2175
+ "_view_module_version": "1.2.0",
2176
+ "_view_name": "LayoutView",
2177
+ "align_content": null,
2178
+ "align_items": null,
2179
+ "align_self": null,
2180
+ "border": null,
2181
+ "bottom": null,
2182
+ "display": null,
2183
+ "flex": null,
2184
+ "flex_flow": null,
2185
+ "grid_area": null,
2186
+ "grid_auto_columns": null,
2187
+ "grid_auto_flow": null,
2188
+ "grid_auto_rows": null,
2189
+ "grid_column": null,
2190
+ "grid_gap": null,
2191
+ "grid_row": null,
2192
+ "grid_template_areas": null,
2193
+ "grid_template_columns": null,
2194
+ "grid_template_rows": null,
2195
+ "height": null,
2196
+ "justify_content": null,
2197
+ "justify_items": null,
2198
+ "left": null,
2199
+ "margin": null,
2200
+ "max_height": null,
2201
+ "max_width": null,
2202
+ "min_height": null,
2203
+ "min_width": null,
2204
+ "object_fit": null,
2205
+ "object_position": null,
2206
+ "order": null,
2207
+ "overflow": null,
2208
+ "overflow_x": null,
2209
+ "overflow_y": null,
2210
+ "padding": null,
2211
+ "right": null,
2212
+ "top": null,
2213
+ "visibility": null,
2214
+ "width": null
2215
+ }
2216
+ },
2217
+ "a5482757a62545f19ade21ca6f6c4240": {
2218
+ "model_module": "@jupyter-widgets/base",
2219
+ "model_module_version": "1.2.0",
2220
+ "model_name": "LayoutModel",
2221
+ "state": {
2222
+ "_model_module": "@jupyter-widgets/base",
2223
+ "_model_module_version": "1.2.0",
2224
+ "_model_name": "LayoutModel",
2225
+ "_view_count": null,
2226
+ "_view_module": "@jupyter-widgets/base",
2227
+ "_view_module_version": "1.2.0",
2228
+ "_view_name": "LayoutView",
2229
+ "align_content": null,
2230
+ "align_items": null,
2231
+ "align_self": null,
2232
+ "border": null,
2233
+ "bottom": null,
2234
+ "display": null,
2235
+ "flex": null,
2236
+ "flex_flow": null,
2237
+ "grid_area": null,
2238
+ "grid_auto_columns": null,
2239
+ "grid_auto_flow": null,
2240
+ "grid_auto_rows": null,
2241
+ "grid_column": null,
2242
+ "grid_gap": null,
2243
+ "grid_row": null,
2244
+ "grid_template_areas": null,
2245
+ "grid_template_columns": null,
2246
+ "grid_template_rows": null,
2247
+ "height": null,
2248
+ "justify_content": null,
2249
+ "justify_items": null,
2250
+ "left": null,
2251
+ "margin": null,
2252
+ "max_height": null,
2253
+ "max_width": null,
2254
+ "min_height": null,
2255
+ "min_width": null,
2256
+ "object_fit": null,
2257
+ "object_position": null,
2258
+ "order": null,
2259
+ "overflow": null,
2260
+ "overflow_x": null,
2261
+ "overflow_y": null,
2262
+ "padding": null,
2263
+ "right": null,
2264
+ "top": null,
2265
+ "visibility": null,
2266
+ "width": null
2267
+ }
2268
+ },
2269
+ "a815fb290af249cbb23679e34d1fdaa3": {
2270
+ "model_module": "@jupyter-widgets/controls",
2271
+ "model_module_version": "1.5.0",
2272
+ "model_name": "HTMLModel",
2273
+ "state": {
2274
+ "_dom_classes": [],
2275
+ "_model_module": "@jupyter-widgets/controls",
2276
+ "_model_module_version": "1.5.0",
2277
+ "_model_name": "HTMLModel",
2278
+ "_view_count": null,
2279
+ "_view_module": "@jupyter-widgets/controls",
2280
+ "_view_module_version": "1.5.0",
2281
+ "_view_name": "HTMLView",
2282
+ "description": "",
2283
+ "description_tooltip": null,
2284
+ "layout": "IPY_MODEL_06d6872e3d1f4086a4f16b8caababccb",
2285
+ "placeholder": "​",
2286
+ "style": "IPY_MODEL_88ba2081fc004bbdb6f438c5031458d5",
2287
+ "value": "100%"
2288
+ }
2289
+ },
2290
+ "b042bdc131654d25bac957a4e921f59d": {
2291
+ "model_module": "@jupyter-widgets/controls",
2292
+ "model_module_version": "1.5.0",
2293
+ "model_name": "FloatProgressModel",
2294
+ "state": {
2295
+ "_dom_classes": [],
2296
+ "_model_module": "@jupyter-widgets/controls",
2297
+ "_model_module_version": "1.5.0",
2298
+ "_model_name": "FloatProgressModel",
2299
+ "_view_count": null,
2300
+ "_view_module": "@jupyter-widgets/controls",
2301
+ "_view_module_version": "1.5.0",
2302
+ "_view_name": "ProgressView",
2303
+ "bar_style": "success",
2304
+ "description": "",
2305
+ "description_tooltip": null,
2306
+ "layout": "IPY_MODEL_52f12329e31645888592d0120a05f6ea",
2307
+ "max": 110000,
2308
+ "min": 0,
2309
+ "orientation": "horizontal",
2310
+ "style": "IPY_MODEL_214288b8cb6c453084c3cce01982eff7",
2311
+ "value": 110000
2312
+ }
2313
+ },
2314
+ "b0828f5f2f204222a40c7e37f827b5b8": {
2315
+ "model_module": "@jupyter-widgets/controls",
2316
+ "model_module_version": "1.5.0",
2317
+ "model_name": "DescriptionStyleModel",
2318
+ "state": {
2319
+ "_model_module": "@jupyter-widgets/controls",
2320
+ "_model_module_version": "1.5.0",
2321
+ "_model_name": "DescriptionStyleModel",
2322
+ "_view_count": null,
2323
+ "_view_module": "@jupyter-widgets/base",
2324
+ "_view_module_version": "1.2.0",
2325
+ "_view_name": "StyleView",
2326
+ "description_width": ""
2327
+ }
2328
+ },
2329
+ "d1404cdae4624c619c7793ac55e7917f": {
2330
+ "model_module": "@jupyter-widgets/controls",
2331
+ "model_module_version": "1.5.0",
2332
+ "model_name": "FloatProgressModel",
2333
+ "state": {
2334
+ "_dom_classes": [],
2335
+ "_model_module": "@jupyter-widgets/controls",
2336
+ "_model_module_version": "1.5.0",
2337
+ "_model_name": "FloatProgressModel",
2338
+ "_view_count": null,
2339
+ "_view_module": "@jupyter-widgets/controls",
2340
+ "_view_module_version": "1.5.0",
2341
+ "_view_name": "ProgressView",
2342
+ "bar_style": "success",
2343
+ "description": "",
2344
+ "description_tooltip": null,
2345
+ "layout": "IPY_MODEL_93a30d3ff5084c58a264f1e1cb6e214c",
2346
+ "max": 110000,
2347
+ "min": 0,
2348
+ "orientation": "horizontal",
2349
+ "style": "IPY_MODEL_421f7f03e06546d390c74bd2069883bd",
2350
+ "value": 110000
2351
+ }
2352
+ }
2353
+ }
2354
+ }
2355
+ },
2356
+ "nbformat": 4,
2357
+ "nbformat_minor": 4
2358
+ }