GPT2-Medium-BG 2021

Updates 13.1.2025: The dataset was discovered, adding more details about it.

  • GPT2-Medium 345M for Bulgarian
  • The model was created and trained from scratch, using tensorflow in free Google Colab T4. The research experiment started in June 2021, the video on Youtube was published on 17.9.2021.
  • "Last modified" dates of the lastly added data in the corpus is between 24.7.2021-28.7.2021, about 11 MiB texts.
  • The dataset was quite small with a maximum of about 141 MiB UTF8 (148.585 M bytes), it includes some words and texts in other languages (English, Latin?), ~82,48M characters), but IMO the results were decent (subjectively for the size, no systematic formal test).
  • The model is supposed to be applied with the provided code here and in the notebook. Read the comments in gen_comments-1-2023-clean.py
  • That was the biggest, as far as I knew, GPT/Transformer model in Bulgarian at the time, possibly the only one, except one with unknown size, which was demoed for a few seconds in a video in Linkedin* in mid 2019 (more in a footnote)
  • In 2025 I discovered that possibly it was one of the biggest 6-7? models for languages different than English at the time, although trained on a small dataset, as GPT2-SMALL was preferred for experiments. [Chinese CPM-2.6B, 2020; Arabic 1.46B; French 1B; Romanian 774M; Spanish 2021-2022 570M; Japanese 336M: https://artificial-mind.blogspot.com/2025/04/the-worlds-first-ai-strategy-was-published-in-2003-by-an-18-years-old-bulgarian.html ... Search "236" for the table; I discovered the Chinese and Spanish models now.]
  • A method for unlimited length multi-step generation with hidden injections of tokens for directed topic change (but it needed more smoothing etc.). The methods are explained in videos on Youtube.

Dataset, preprocessing and training

  • Various selected books from "My Libary", "Chitanka" ((Моята Библиотека, Читанка) with some cleaning of meta-data marking in the books and notes and footnotes ([34]... etc.), ids, link of the books etc.
  • Various works, books, publications and texts, written by the author himself. The biggest file was the current version of the draft of a big survey book on AGI & Transhumanism, called "The Prophets of the Thinking Machines"*, also several other interdisciplinary books. Many articles and works from the e-zine "The Sacred Computer" (Свещеният сметач).
  • Some poetry by Hristo Botev.
  • A few articles about computers from forums and web pages, a bit in Bulgarian, some machine translated from English to Bulgarian.
  • Some articles from magazines on political topics.
  • Some of the titles of the files can be seen in the training video.
  • During training the dataset and its sampling were incrementally updated or changed after observing the generations and I recognized the source of the style of the outputs. E.g. some books seemed "poisonous" with their patterns and were reduced or removed, e.g. I.Asimov's extensive repetition of the characters' names.
  • Some items were added, others removed, some smaller documents were consumed multiple times, a shorter random section selected from items which were too big, different section in each iteration etc.
  • Due to the usage of a free Colab Notebook and limited range of uninterruptable hours, maybe up to 3 hours or so, sometimes less, occasionally a few hours more, with unknown end, it was impossible to perform a complete iteration on the entire dataset in one part (it may become impossible to fit at once too big dataset due to RAM as well).
  • For that reason the individual training iterations were slicing and shuffling the text, e.g. picking say 200K characters from each long document from the beginning, then from the end, or first half, then second half; or randomly etc. Smaller documents were usually "ingested" completely.
  • For the training process, please refer to the video instructions, as the notebook has draft and intermediate cells which were not cleaned and shouldn't be always invoked. There was also an updated version due to discovered incompatibility of the initial one.
  • Some data "augmentation" was applied: changes of names, besides the removal of repetitive patterns, noticed in the generated text.
  • As the dataset was dynamically changed and there appeared unknown special characters here and there, there were issues with the tokens as some were missing, which resulted in errors in the preparation of the dataset by tensorflow. This was worked around with a hack that was ignoring these fragments as I didn't want to start from scratch.
  • A few hyperparameters can be seen in the instruction video - these were the ones used in some of the late parts of the process: BLOCK_SIZE = 160; BUFFER_SIZE = 3200

Links

The Sacred Computer: Thinking Machines, Creativity and Human Development

...

  • Other Bulgarian autoregressive models: an earlier one was a few seconds display of a generation in Bulgarian by a startup called BAIHUI AI in mid 2019. I've written in my blog 1.5B, but I don't remember if they have mentioned a size and now it seems unlikely and unreasonable, they just showed that they can train a model, a team of 3 people, only one of them a ML engineer. There are a few surviving records: my blog post: https://artificial-mind.blogspot.com/2019/07/baihuiai-baihuiai-new-bulgarian-ai.html and info here: https://www.eu-startups.com/directory/baihui-ai/ The company didn't live long, it was a show-off. Now it seems reasonable that their model was GPT2-SMALL, as that was the usual choice even 4 years later and even the Bulgarian Academy of Science 2023 model was the small one. I found several other GPT2-SMALL models trained later than this one here, one for poetry, the BAS' from 2023 and maybe a few others. I couldn't get info from the ML engineer of the BAIHUI project M.V.
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support