File size: 1,230 Bytes
b4c706a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
license: apache-2.0
tags:
- math
- ocr
- typst
- latex
size_categories:
- 1M<n<10M
---

# Typst Image Dataset

This dataset was generated with a [fork](https://github.com/JeppeKlitgaard/tex2typ) of [tex2typ] and the [hoang-quoc-trung/fusion-image-to-latex-datasets] dataset, which itself is a compilation of LaTeX labels and images of equations.

The hoang-quoc-trung dataset is difficult to work with in that it has the image data stored in a large compressed RAR archive, which does not permit efficient random read access. Additionally, it appears to have a larger number of corrupted filenames inside the archive, which has been mended in this dataset.

This dataset instead opts to use a WebDataset for convenient and efficient storage of the image files and associated metadata.

The code used to generate this dataset can be found at here: https://github.com/JeppeKlitgaard/DTU-02456-Deep-Learning-Project (this is currently private but should be released after examination. If this is not the case prod me at `huggingface@jeppe.science`)

[tex2typ]: https://github.com/ParaN3xus/tex2typ
[hoang-quoc-trung/fusion-image-to-latex-datasets]: https://huggingface.co/datasets/hoang-quoc-trung/fusion-image-to-latex-datasets