AI & ML interests
Creativity in language is ubiquitous. It is abundantly present in work with an explicit creative intention - such as literary novels or poems - but weighty doses of creativity also pervade everyday language use. We believe that a computational model of creativity that focuses on language will shed light on the enigmatic processes and interactions that come into play when we humans express ourselves in creative ways. Moreover, natural language generation systems - in order to produce realistic utterances - need to be endowed with a certain capacity for creativity. The main goal of this research project is to develop unsupervised models of language that exhibit creativity. In order to do so, we propose an integrated approach that combines a number of important and innovative techniques. First of all, we rely on constructs from linear algebra called tensors in order to express language content according to different parameters. Using tensors, we are able to induce latent semantics from multi-way co-occurrences of textual content, which can subsequently be used for the generation of creative expressions. Secondly, we rely on advanced machine learning techniques, notably neural networks. Neural network techniques have recently shown impressive performance in a number of natural language processing tasks. Yet, these techniques are mainly mimicking human language production, and thus are showing little creativity in language generation; by adapting neural network approaches in various ways, as well as integrating them with our tensor-based approach, we expect to develop algorithms that are able to grasp the meaning of textual content in a more profound and elaborate way, and at the same time are able to express it with creative intent. The project has the potential for groundbreaking results, not only because it would deepen our understanding of creativity, but also because of practical applications within the field of natural language processing.
- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.
It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
---
๐ data: BramVanroy/CommonCrawl-CreativeCommons
๐งฐ software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the
head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. ๐ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
๐ More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
GEITje 7B Ultra: A Conversational Model for Dutch (2412.04092)
While the paper discusses the model a little bit, I especially wanted to write about the datasets, which to this day seem an important asset for Dutch LLM training (SFT and preference tuning). We have a long way to go for Dutch, but publishing transparent and reproducible artefacts seems an important step to me, alongside having open discussions about data, bias, architectures.
In that spirit, thanks are in order for the creation of GEITje 7B Ultra and all related datasets:
- Michiel Buisman and UWV for providing the means to create the datasets
- Flemish Supercomputer Center (VSC) for the compute
- The Hugging Face Fellows and rest of the team for their discussions and insights
- The Dutch NLP community, notably @Rijgersberg for building the base GEITje model and the fruitful discussions we've had
More to come, step by step!
BramVanroy/geitje-7b-ultra-65c1ee010ad80fd1f6a8f208
I have been struggling mentally for many months now with the OpenAI terms of use that indicate that their model outputs cannot be used to build "competing models". This leads to many questions:
- what is the definition of competing? Is it the same as "commercial"?
- since this is part of the terms of use between OpenAI and the API user, can a third party still use the generated dataset to build competing models?
- are such restrictions even legal in the first place?
Trying to "follow the rules" as much as possible despite wanting to be as open as possible, I kept releasing my datasets under non-commercial licenses (which are too restrictive anyhow - nothing should prevent you from using the data in non-LM commercial settings), just like models trained on these datasets. This has put me at a competitive disadvantage compared to creators who do not follow the same approach and release their data/models on apache 2.0 despite the OpenAI "restrictions". Moreover, I fear (https://twitter.com/BramVanroy/status/1780220420316164246) that my approach blocks adaptation of my data/models for (commercial) applications/integrations.
Thankfully @Rijgersberg noted that these OpenAI terms of use are NOT explicit in the Azure OpenAI API (https://twitter.com/E_Rijgersberg/status/1780308971762450725). Since my latest datasets were created via Azure, this comes as a relief. As far as I can tell after digging through Azure docs, this allows me to change all recent GPT4-generated datasets to apache 2.0! ๐ฅณ
- BramVanroy/ultrachat_200k_dutch
- BramVanroy/orca_dpo_pairs_dutch
- BramVanroy/ultra_feedback_dutch
- BramVanroy/ultra_feedback_dutch_cleaned
- BramVanroy/no_robots_dutch
I will have to mull over what I'll do for the older GPT3.5 datasets. What do you think that I should do?
**tl;dr: do not depend on benchmark leaderboards to choose your "chatbot" model! (Especially for non-English languages.)**
First of all, I'm discontinuing the Open #Dutch #LLM Leaderboard (https://lnkd.in/eFnsaFR6). It will stay online for now, but I urge the use of the ScandEval leaderboard instead (https://scandeval.com/dutch-nlg/) by @saattrupdan . It contains more tasks, has better reproducibility and statistics (CI) and a flexible back-end library (
scandeval) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.An important note here is that benchmark leaderboards are not a golden truth. Especially evaluating generative models is hard. You run into issues like prompt engineering (and sensitivity of models to one or other prompt), structured output generation, and - quite simply - "how to automatically evaluate open-ended generation".
๐ก Another important but under-discussed facet is the discrepancy between models' capability of understanding vs. generating *in different languages* (so the NLU part of NLG benchmarking). In other words: some of the listed models score really well on, e.g., MCQ benchmarks but are not suitable to use as DUTCH chat bots. Interestingly, some of these models seem to understand questions in Dutch and are able to pick the right answer (because they have good knowledge or reasoning skills), but generating fluent and grammatical Dutch is something else entirely! This is perhaps also true for humans: it's easier to sort-of grasp the meaning of a new language and answer with "Yes" or "No", but answering fluently in the language is much harder! Yet, your language production fluency does not necessarily say anything about your knowledge and reasoning skills.
Hopefully we can get a chat arena for Dutch some day - user feedback is the most powerful metric!
Any prior experience that you can share or suggestions to improve throughout?
After being in touch with HPLT folks, I've transfered the data to their org. That only makes sense. You can find it below.
HPLT/hplt_monolingual_v1_2
https://huggingface.co/datasets/BramVanroy/hplt_monolingual_v1_2
In December of last year, HPLT (https://hplt-project.org/) released version 1.2 of their dataset. It covers web-crawled data of 75 languages!, in the raw format as well as deduplicated and cleaned sections. In total, we're talking about over 40TB of data! This data was already accessible via their website but I figured the accessibility could be improved by an integration with Hugging Face tooling. ๐ค So I added the dataset here to the Hugging Face hub, enabing direct use in your conventional training pipelines for LLMs or other language technologies. The data will automatically be downloaded and optimised with just one line of code:
load_dataset("BramVanroy/hplt_mono_v1_2", "nl_cleaned")Let's use this big blob of data to build something awesome in our languages! ๐ฅณ
After teasing for a while, I am finally releasing **GEITje 7B Ultra**, building upon the great GEITje 7B by @Rijgersberg . New contributions include: large new datasets for SFT (instruction/chat), two datasets for DPO training (i.e. RLAIF), and an SFT and DPO version of GEITje. The READMEs describe everything well (I hope), and I'll also share more info on social medias tomorrow.
For me this is a huge release, the datasets more so than the models. I'm especially pleased with UltraChat, which I created with the intent of having a diverse dataset - the model must be able to communicate with different types of users. So the user questions are created as if they were written by different personas, e.g. language learners, young children, experts, critics, etc. The focus with this is "building a good communication bot that is accessible and can handle different kinds of user input".
I wish I could find the time to also write a paper to get some "academic recognition" but that'll have to wait for now. I just want to bring it to the public so that others can play with it and use it to build new, cool stuff!
I hope that you can all appreciate the work. Let's build some cool stuff with it!
Models:
- Demo: https://huggingface.co/spaces/BramVanroy/GEITje-7B-ultra
- DPO Model: BramVanroy/GEITje-7B-ultra
- SFT model (not recommended): BramVanroy/GEITje-7B-ultra-sft
Datasets with GPT-4 turbo completions:
- No robots (~10k instructions): BramVanroy/no_robots_dutch
- UltraChat (~200k instructions): BramVanroy/ultrachat_200k_dutch
- UltraFeedback (DPO with GPT4+GEITje chat, ~50k): BramVanroy/ultra_feedback_dutch
- Orca DPO Pairs (DPO with GPT4+GEITje chat, ~10k): BramVanroy/orca_dpo_pairs_dutch
In my previous post (https://huggingface.co/posts/BramVanroy/633544255876795), I indicated how despite high reward accuracies and low losses, my model would sometimes just output repeating random tokens (
/*****/). There were some useful brainstorms in that thread. I think the dataset is relatively easy for the model, leading it to quickly overfit when the beta is very small, which allows the model to step away further from its initially outputs.So, I ran a hyperparameter search for learning rate (1e-7 v 5e-7), batch size (32, 64, 96, 128) and most importantly, beta (0.01, 0.1, 0.2, 0.5). You can have a look at the results for yourself here: https://wandb.ai/bramvanroy/dpo-geitje-ultra-hyperparams
Interpreting the result, I'd think that the beta=0.5 is the better choice for this dataset. Reasons:
- markedly higher rewards margins compared to all other betas
- better balance between positive chosen and negative rejected rewards
- log probabilities are not as superbly low as for beta=0.01, which seems too low for this dataset
Of course, that is just purely looking at numbers without running any benchmarks. However, I am hesitant to evaluate all the models on benchmarks and, therefore, literally optimising my hyperparameters on a test set (which is very bad!). So I will just play with some of the most promising models and see which one feels "best" qualitatively.
If you have other insights, thoughts, or opinions, let me know!
I have a dataset that is gpt-4-turbo as chosen and a lower performing model as rejected. The objective should therefore be fairly easy because the two are easy to discern. As a consequence, the model achieves very low losses (0.021 train; 0.013 validation) and high reward accuracies (0.995). **However**, when using the model in practice, it often detoriates after the first one or two tokens and continuously outputs sequences of
/*****/. So despite the good performance on the DPO objective and strong scores on the validation set (no overfitting), something seems to go wrong. Perhaps the outputs are too different and the task is too easy, in which case DPO is not useful. But why then would the model start hallucinating and repeating the same token over and over again?Any thoughts? Any suggestions to get around this? All discussions are welcome!