MK-LLM-Mistral / DATASET_CARD.md
ainow-mk's picture
Upload 65 files
f29d474 verified

Dataset Card: MK-LLM

Sources

  • Macedonian Wikipedia
  • Selected Macedonian news, government, education, culture, tech sites (see data/process_all_data.py)

Processing

  • HTML extraction (BeautifulSoup)
  • Language filtering (mk)
  • Cleaning: templates, refs, markup removal
  • Consolidation into data/cleaned/mk_combined_data.txt

Licensing

  • Respect original site licenses (CC-BY/CC-BY-SA or site terms). Provide attribution where required.

Known issues

  • Potential boilerplate leakage; deduplication is recommended for larger crawls.