Speech Recognition Community Event Version 2

non-profit

Activity Feed

AI & ML interests

Multi-Lingual Speech Recognition

Recent Activity

gagan3012 authored a paper 7 days ago

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

gagan3012 submitted a paper 7 days ago

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

gagan3012 authored a paper about 2 months ago

Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

View all activity

flozi00

posted an update 6 months ago

Post

665

We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wide—it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

The hardware catch? It is not matrix splitting or pipeline bubbles—it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

In this breakdown, we explore:

The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/expert_parallel

1 reply

flozi00

posted an update 6 months ago

Post

3151

We recently discussed how Tensor Parallelism slices matrices to reduce latency within a single node. But what happens when you need to scale beyond that, where the bandwidth drops?

That is where Pipeline Parallelism (PP) takes over.

Instead of slicing the operation, PP slices the model depth. It turns your GPU cluster into an assembly line: GPU 0 handles layers 1-12, GPU 1 handles 13-24, and so on.

The hardware challenge here isn't the interconnect speed—it is the "Pipeline Bubble." In a naive setup, expensive H100s sit idle for most of the cycle waiting for data to flow through the chain.

My latest guide breaks down the scheduling strategies used to minimize this idle silicon time.

In this deep dive, we cover:

The Hardware Mechanics: Vertical Slicing
Unlike TP which requires "chatty" All-Reduce operations, PP relies on lightweight Point-to-Point (Send/Recv) communication. This makes it the only viable strategy for crossing node boundaries over Ethernet or InfiniBand.

Fighting the Bubble: 1F1B vs. GPipe
We analyze the scheduling algorithms that keep the GPUs fed:

GPipe: The "flush and fill" approach. Simple, but memory-intensive.
1F1B (One-Forward-One-Backward): The industry standard. By interleaving forward and backward passes, we aggressively free up memory and reduce the bubble size.
The Math of Efficiency
The "Bubble" is a mathematical inevitability. We look at the efficiency formula
M+N−1
M

to understand why you need massive global batch sizes to make PP worth the effort.

The article includes a conceptual PyTorch implementation of the 1F1B state machine to illustrate exactly how the data is handed off between stages.

Read the full breakdown here:
https://flozi.net/en/guides/ai/scaling/pipeline_parallel

flozi00

posted an update 7 months ago

Post

3591

When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.

My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.

This isn't high-level theory—it is a look at the bare metal implementation.

Here is what is covered in the deep dive:

The Strategies: Column vs. Row Parallelism
We analyze how to split weight matrices (W) and inputs (X).

Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output.
Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results.
The "Megatron-LM" Optimization
Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.

The Hardware Reality: The Bandwidth Wall
In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.

Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency.
Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP.
The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/tensor_parallel

flozi00

posted an update 7 months ago

Post

2777

Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.

We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.

In this guide, you will learn how to:

Calculate your GPU's operational intensity (Ops:Byte Ratio)
Determine your model's arithmetic intensity
Identify whether your workload is memory-bound or compute-bound

Read the full guide here: https://flozi.net/en/guides/ai/llm-inference-math

theainerd

posted an update 7 months ago

Post

5012

Hindi Speech to Text just crossed 20 million downloads. Grateful for everyone using it.

theainerd/Wav2Vec2-large-xlsr-hindi

flozi00

posted an update 7 months ago

Post

311

Struggling with NVIDIA drivers on Ubuntu 24.04?
Can't use your GPUs with CUDA installed, or only half of them work?
Black screen after startup or nvidia-smi fails?

The nokaslr boot option might be the cause—and the solution.
Find out why disabling KASLR can fix these GPU issues until a permanent driver update is available.

https://flozi.net/en/guides/linux/solving-nvidia-driver-issues-on-ubuntu-24-04-with-nokaslr

flozi00

posted an update 7 months ago

Post

1962

I just got asked about the differences between Blackwell systems and Grace Blackwell systems. What's the difference and how much of a performance gap is there between them?

https://flozi.net/en/hardware/nvidia/benchmarks/b200-vs-gb200-efficiency-comparison

Here's a summary of the key points from the article:

GB200 (Grace Blackwell) is a Superchip: It integrates a Grace CPU and two Blackwell GPUs into a single package.
B200 is a GPU-only module: It's designed to be paired with x86 or ARM CPUs in more traditional server setups.

Performance and Efficiency:

Based on MLPerf Training v5.0 benchmarks, the article concludes:

GB200 systems are approximately 42% more efficient than B200 systems on average. This is especially true in large-scale deployments (100+ GPUs), where the GB200's integrated design and high-speed NVLink interconnect provide a significant advantage.

In smaller, single-node systems (e.g., 8 GPUs), the performance difference is much smaller, around 10-15%.

Use Cases:

Choose GB200 for large-scale AI clusters, training massive models, and when maximum efficiency is the top priority.

Choose B200 for smaller deployments, when you need the flexibility to choose your own CPU, or for mixed AI and HPC workloads.

flozi00

posted an update 7 months ago

Post

3206

Some weeks ago, i've just decide its time to leave LinkedIn for me.
It got silent around my open source activities the last year, so i thought something has to change.

That's why my focus will move to share experiences and insights about hardware, drivers, kernels and linux. I won't post about how to use models, built agents or do prompting. I want to share about some deeper layers the actual hypes are built on.

I will start posting summarizations of my articles here on the hub.

English version:
https://flozi.net/en

German translated version:
https://flozi.net/de

Feel free to reach me if you want to read something specific.

2 replies

versae

authored a paper 8 months ago

BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Paper • 2509.24908 • Published Sep 29, 2025 • 3

DrishtiSharma

authored a paper about 1 year ago

Behind Maya: Building a Multilingual Vision Language Model

Paper • 2505.08910 • Published May 13, 2025 • 2

phantomcoder1996

authored a paper about 1 year ago

ConvoGen: Enhancing Conversational AI with Synthetic Data: A Multi-Agent Approach

Paper • 2503.17460 • Published Mar 21, 2025 • 2

DrishtiSharma

authored a paper about 1 year ago

Robust and Fine-Grained Detection of AI Generated Texts

Paper • 2504.11952 • Published Apr 16, 2025 • 12

emre

authored a paper about 1 year ago

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 39

emre

posted an update about 1 year ago

Post

3917

having trouble with auto train
hello there this is the first time i am testing auto train with a 1.8k SFT dataset. Howevery i am not quite sure the training is going smooth. Logs seem quite confusing, token did not match can not auth, generates confusing train splits, do you know how i can check my running job properly?
what is being used for training as data?
any ideas?

1 reply

FremyCompany

posted an update over 1 year ago

Post

1204

🔀 Very cool demo of word-level alignment of paraphrased or cross-lingual sentences, from the new Fairly Multilingual ModernBERT embedding model:

Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment