aidalab
/

rutabert-base

semantic-table-interpretation

column-type-annotation

Model card Files Files and versions

LedZeppe1in commited on Jun 12, 2024

Commit

758b9d4

·

1 Parent(s): 1507360

Updated README.md

Files changed (1) hide show

README.md +25 -6

README.md CHANGED Viewed

@@ -3,21 +3,40 @@ language: ru
 tags:
 - rutabert
 - semantic-table-interpretation
 license: mit
 ---
 # RuTaBERT (base model)
-RuTaBERT is a model solving the problem of Column Type Annotation with pre-trained large language model (BERT), trained on the Russian corpus. The original repo can be found [here](https://github.com/STI-Team/RuTaBERT).
 ## Model description
-...
 ## How to Use
-Here is how to use this model in transformers:
-```python
-#
-```

 tags:
 - rutabert
 - semantic-table-interpretation
+- column-type-annotation
 license: mit
 ---
 # RuTaBERT (base model)
+RuTaBERT is a model that solves the problem of Column Type Annotation based on a pre-trained language model (BERT), fine-tuned on the Russian table corpus. The original repo can be found [here](https://github.com/STI-Team/RuTaBERT).
 ## Model description
+RuTaBERT is a fine-tuned 12-layer multilingual BERT ([bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased)) language model for solving the problem of Column Type Annotation (CTA).
+We trained RuTaBERT on a labeled set of tabular data – [Russian Web Tables (RWT)](https://arxiv.org/abs/2210.06353). RWT was formed based on a Russian-language Wikipedia for September 13, 2021 and contains 1.2 million tables (7.4 million columns). The table labeling of the RWT corpus was carried out automatically based on a set of 356 semantic types (classes, data properties and object properties) taken from the general-purpose knowledge graph [DBpedia](https://www.dbpedia.org/) and translated into Russian. A dataset consisting of 1.441.349 labeled columns was obtained on the stage of table preprocessing. In this case, only 170 semantic types (labels) were used.
+## Intended Uses
+An input table is a two-dimensional array composed of rows and columns. Each cell in the table holds information that can be displayed as text, numbers, dates, and more. You can use the raw vertical tables (e.g., tables in the CSV format) as input data. A vertical table is a data structure organized in vertical columns. Each column may include a header. In such tables, each column can be divided into two types:
+1.	*A named entity (categorical) column* contains entity mentions of some domain (e.g., persons, organizations, events).
+2.	*A literal column* contains some values of simple datatypes (e.g., date, time, cardinal number).
+**Assumption 1.** *The first row of a source table is a header containing attribute (column) names.*
+**Assumption 2.** *All values of column cells in a source table have the same entity types and data types.*
+Thus, RuTaBERT predicts semantic types for each column in a vertical table.
 ## How to Use
+An example of using this model is given in `rutabert_pipeline` folder:
+* `data` contains a table example in the CSV format for testing;
+* `dataset`, `model`, `sem_types`, and `pipeline` contain the implementation of custom pipeline for RuTaBERT.
+* `inference_example` contains an example of pipeline registration and inference based on it.
+## Authors
+* [Kirill V. Tobola](mailto:kirilltobola@gmail.com)
+* [Nikita O. Dorodnykh](mailto:nikidorny@icc.ru)