wzh

hg2wzh

·

AI & ML interests

None yet

Recent Activity

liked a Space 14 days ago

HuggingFaceM4/encoder-free-vlm

liked a model about 1 month ago

reacted to erikkaum's post with ❤️ about 2 months ago

Releasing my first kernel 🔥 MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA. The result is 3–5× speedup compared to naive PyTorch baseline 🔥 Benchmarks: - SmallRerank (B=32, C=10): up to 3.2× (M3 Pro) / 2.8× (A100) - HeavyRerank (B=32, C=100): up to 3.8× (M3 Pro) / 5.3× (A100) - LongDocStress (Ld=1024): up to 6.2× (L4) Try it out 👇 https://huggingface.co/kernels/erikkaum/maxsim

View all activity

Organizations

None yet

upvoted 2 collections about 2 months ago

jina-embeddings-v5-omni

Multimodal (text + image + video + audio) embedding models aligned with jina-embeddings-v5-text-*. Two sizes, four task variants each. • 27 items • Updated May 12 • 36

GenLIP

Model weights of paper "Let ViT Speak: Generative Language-Image Pre-training" • 6 items • Updated May 5 • 8

upvoted 3 collections 3 months ago

TIPSv2

TIPSv2 foundational vision-language models. Webpage: https://gdm-tipsv2.github.io/ • 9 items • Updated Apr 14 • 38

Qwen-Image

14 items • Updated Dec 31, 2025 • 115

Falcon Perception

Falcon-Perception and Falcon-OCR model: early-fusion, natively multimodal, dense Autoregressive Transformer models. • 5 items • Updated Apr 6 • 14

upvoted a collection 6 months ago

GME Models

General Multimodal Embedding Models Released by Tongyi Lab of Alibaba Group • 3 items • Updated Dec 24, 2024 • 10

upvoted 2 collections 7 months ago

Searching for Better ViT Baselines

Exploring ViT hparams and model shapes for the GPU poor (between tiny and base). • 36 items • Updated Jan 28 • 20

MM Grounding DINO

8 items • Updated Aug 1, 2025 • 6

upvoted 2 articles 10 months ago

Article

SmolVLM2: Bringing Video Understanding to Every Device

+5

orrzohar, mfarre, andito, merve, pcuenq, cyrilzakka, Xenova

•

Feb 20, 2025

• 344

Article

SmolVLM - small yet mighty Vision Language Model

+3

andito, merve, mfarre, eliebak, pcuenq

•

Nov 26, 2024

• 422

upvoted 2 articles 11 months ago

Article

Vision Language Model Alignment in TRL ⚡️

+3

sergiopaniego, merve, qgallouedec, kashif, ariG23498

•

Aug 7, 2025

• 112

Article

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

+4

toslali-ibm, mirinflim, qgallouedec, esnible, rganti, mudhakar

•

Jun 3, 2025

• 101

upvoted a collection over 1 year ago

VideoLLaMA3

Frontier Multimodal Foundation Models for Video Understanding • 9 items • Updated Mar 2 • 16

upvoted a paper over 1 year ago

Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 50

upvoted a collection over 1 year ago

Idefics2 🐶

Idefics2-8B is a foundation vision-language model. In this collection, you will find the models, datasets and demo related to its creation. • 11 items • Updated May 6, 2024 • 93

upvoted an article over 1 year ago

Article

Deriving DPO's Loss

hba123

•

Dec 24, 2024

• 30

upvoted 4 collections over 1 year ago

ModernBERT

Bringing BERT into modernity via both architecture changes and scaling • 3 items • Updated Dec 19, 2024 • 159

Embedding

4 items • Updated Mar 19 • 1

DeepSeek-VL2

5 items • Updated Nov 27, 2025 • 82

Qwen2.5

Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. • 43 items • Updated Mar 2 • 729