File size: 1,230 Bytes
b4c706a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
---
license: apache-2.0
tags:
- math
- ocr
- typst
- latex
size_categories:
- 1M<n<10M
---
# Typst Image Dataset
This dataset was generated with a [fork](https://github.com/JeppeKlitgaard/tex2typ) of [tex2typ] and the [hoang-quoc-trung/fusion-image-to-latex-datasets] dataset, which itself is a compilation of LaTeX labels and images of equations.
The hoang-quoc-trung dataset is difficult to work with in that it has the image data stored in a large compressed RAR archive, which does not permit efficient random read access. Additionally, it appears to have a larger number of corrupted filenames inside the archive, which has been mended in this dataset.
This dataset instead opts to use a WebDataset for convenient and efficient storage of the image files and associated metadata.
The code used to generate this dataset can be found at here: https://github.com/JeppeKlitgaard/DTU-02456-Deep-Learning-Project (this is currently private but should be released after examination. If this is not the case prod me at `huggingface@jeppe.science`)
[tex2typ]: https://github.com/ParaN3xus/tex2typ
[hoang-quoc-trung/fusion-image-to-latex-datasets]: https://huggingface.co/datasets/hoang-quoc-trung/fusion-image-to-latex-datasets
|