Upload 65 files
f29d474 verified
Dataset Card: MK-LLM
Sources
- Macedonian Wikipedia
- Selected Macedonian news, government, education, culture, tech sites (see data/process_all_data.py)
Processing
- HTML extraction (BeautifulSoup)
- Language filtering (mk)
- Cleaning: templates, refs, markup removal
- Consolidation into
data/cleaned/mk_combined_data.txt
Licensing
- Respect original site licenses (CC-BY/CC-BY-SA or site terms). Provide attribution where required.
Known issues
- Potential boilerplate leakage; deduplication is recommended for larger crawls.