Improve dataset hygiene for literary corpora

#16
by jbakerx - opened

For Russian Gutenberg/other sources:

Strip editorial notes, footnotes, chapter headers, and license boilerplate
Detect and remove OCR artifacts (e.g., weird hyphenation, broken Cyrillic). This matters more for Russian because OCR noise can be substantial.

Sign up or log in to comment