Bram Vanroy's picture

Bram Vanroy PRO

BramVanroy

·

https://bramvanroy.github.io/

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

upvoted a paper 43 minutes ago

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

upvoted a collection 43 minutes ago

updated a dataset about 4 hours ago

BramVanroy/finewiki-nl-sections-propella

View all activity

Organizations

upvoted a paper 43 minutes ago

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Paper • 2603.19223 • Published Mar 19 • 36

upvoted a collection 43 minutes ago

F2LLM

23 items • Updated Mar 20 • 7

upvoted an article 13 days ago

Article

Granite 4.1 LLMs: How They’re Built

ibm-granite

•

Apr 29

• 81

upvoted an article 14 days ago

Article

MTEB Leaderboard: From a slow demo to feature-rich leaderboard

Samoed

•

14 days ago

• 22

upvoted a paper 14 days ago

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Paper • 2602.15547 • Published Feb 17 • 31

upvoted a changelog 16 days ago

Hugging Face Changelog

Publish models from CI without HF_TOKEN

18 days ago

• 108

upvoted an article 29 days ago

Article

Introduction to Trimming ✂

lbourdois

•

29 days ago

• 41

upvoted a paper about 1 month ago

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Paper • 2602.12414 • Published Feb 12 • 4

upvoted a collection 2 months ago

MTEB-NL

Massive Text Embedding Benchmark for Dutch. Check https://github.com/nikolay-banar/mteb-nl-dev to evaluate your models. • 26 items • Updated Nov 7, 2025 • 4

upvoted a paper 2 months ago

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

Paper • 2604.00920 • Published Apr 1 • 1

upvoted a collection 2 months ago

GPT-NL Pretraining Corpus

Public Corpus data + Private Corpus metadata • 2 items • Updated Nov 18, 2025 • 6

upvoted 2 collections 6 months ago

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability

A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24

Nemotron-Post-Training-v3

Collection of datasets used in the post-training phase of Nemotron Nano, Super, and Ultra v3. • 50 items • Updated 15 days ago • 168

upvoted an article 7 months ago

Article

We Got Claude to Fine-Tune an Open Source LLM

burtenshaw, evalstate

•

Dec 4, 2025

• 630

upvoted a paper 7 months ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15, 2025 • 9

upvoted a paper 8 months ago

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Paper • 2506.01732 • Published Jun 2, 2025 • 6

upvoted a collection 8 months ago

Leesplank Wim

18 items • Updated Nov 12, 2025 • 1

upvoted a paper 8 months ago

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 124

upvoted a collection 9 months ago

Qwen3

84 items • Updated Dec 31, 2025 • 1.82k

upvoted an article 10 months ago

Article

mmBERT: ModernBERT goes Multilingual

+4

mmarone, orionweller, will-fleshman, eugene-yang, dlawrie, vandurme

•

Sep 9, 2025

• 148