LedZeppe1in commited on
Commit
758b9d4
·
1 Parent(s): 1507360

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -6
README.md CHANGED
@@ -3,21 +3,40 @@ language: ru
3
  tags:
4
  - rutabert
5
  - semantic-table-interpretation
 
6
  license: mit
7
  ---
8
 
9
  # RuTaBERT (base model)
10
 
11
- RuTaBERT is a model solving the problem of Column Type Annotation with pre-trained large language model (BERT), trained on the Russian corpus. The original repo can be found [here](https://github.com/STI-Team/RuTaBERT).
12
 
13
  ## Model description
14
 
15
- ...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## How to Use
18
 
19
- Here is how to use this model in transformers:
 
 
 
 
 
20
 
21
- ```python
22
- #
23
- ```
 
3
  tags:
4
  - rutabert
5
  - semantic-table-interpretation
6
+ - column-type-annotation
7
  license: mit
8
  ---
9
 
10
  # RuTaBERT (base model)
11
 
12
+ RuTaBERT is a model that solves the problem of Column Type Annotation based on a pre-trained language model (BERT), fine-tuned on the Russian table corpus. The original repo can be found [here](https://github.com/STI-Team/RuTaBERT).
13
 
14
  ## Model description
15
 
16
+ RuTaBERT is a fine-tuned 12-layer multilingual BERT ([bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased)) language model for solving the problem of Column Type Annotation (CTA).
17
+
18
+ We trained RuTaBERT on a labeled set of tabular data – [Russian Web Tables (RWT)](https://arxiv.org/abs/2210.06353). RWT was formed based on a Russian-language Wikipedia for September 13, 2021 and contains 1.2 million tables (7.4 million columns). The table labeling of the RWT corpus was carried out automatically based on a set of 356 semantic types (classes, data properties and object properties) taken from the general-purpose knowledge graph [DBpedia](https://www.dbpedia.org/) and translated into Russian. A dataset consisting of 1.441.349 labeled columns was obtained on the stage of table preprocessing. In this case, only 170 semantic types (labels) were used.
19
+
20
+ ## Intended Uses
21
+
22
+ An input table is a two-dimensional array composed of rows and columns. Each cell in the table holds information that can be displayed as text, numbers, dates, and more. You can use the raw vertical tables (e.g., tables in the CSV format) as input data. A vertical table is a data structure organized in vertical columns. Each column may include a header. In such tables, each column can be divided into two types:
23
+ 1. *A named entity (categorical) column* contains entity mentions of some domain (e.g., persons, organizations, events).
24
+ 2. *A literal column* contains some values of simple datatypes (e.g., date, time, cardinal number).
25
+
26
+ **Assumption 1.** *The first row of a source table is a header containing attribute (column) names.*
27
+
28
+ **Assumption 2.** *All values of column cells in a source table have the same entity types and data types.*
29
+
30
+ Thus, RuTaBERT predicts semantic types for each column in a vertical table.
31
 
32
  ## How to Use
33
 
34
+ An example of using this model is given in `rutabert_pipeline` folder:
35
+ * `data` contains a table example in the CSV format for testing;
36
+ * `dataset`, `model`, `sem_types`, and `pipeline` contain the implementation of custom pipeline for RuTaBERT.
37
+ * `inference_example` contains an example of pipeline registration and inference based on it.
38
+
39
+ ## Authors
40
 
41
+ * [Kirill V. Tobola](mailto:kirilltobola@gmail.com)
42
+ * [Nikita O. Dorodnykh](mailto:nikidorny@icc.ru)