AI & ML interests

None defined yet.

alvarobartt 
posted an update 1 day ago
view post
Post
1592
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem
alvarobartt 
posted an update 3 months ago
view post
Post
3730
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! 💥

> 🕒 60-minute single-pass processing, no chunking or stitching
> 👤 Customized hotwords to guide recognition on domain-specific content
> 📝 Rich transcription: joint ASR + diarization + timestamping in one pass
> 🌍 50+ languages with automatic detection and code-switching support
> 🤗 Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr
alvarobartt 
posted an update 4 months ago
view post
Post
3256
💥 hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

💡 Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (à la vLLM) manually if preferred.
  • 1 reply
·
jeffboudier 
posted an update 9 months ago
view post
Post
3374
Quick 30s demo of the new Hub > Azure AI integration to deploy HF models in your own Azure account. Now with Py and CLI!

GG @alvarobartt @kramp @pagezyhf
jeffboudier 
posted an update 11 months ago
view post
Post
616
AMD summer hackathons are here!
A chance to get hands-on with MI300X GPUs and accelerate models.
🇫🇷 Paris - Station F - July 5-6
🇮🇳 Mumbai - July 12-13
🇮🇳 Bengaluru - July 19-20

Hugging Face and GPU Mode will be on site and on July 6 in Paris @ror will share lessons learned while building new kernels to accelerate Llama 3.1 405B on ROCm

Register to Paris event: https://lu.ma/fmvdjmur?tk=KeAbiP
All dates: https://lu.ma/calendar/cal-3sxhD5FdxWsMDIz
jeffboudier 
posted an update 11 months ago
view post
Post
1775
Today we launched Training Cluster as a Service, to make the new DGX Cloud Lepton supercloud easily accessible to AI researchers.

Hugging Face will collaborate with NVIDIA to provision and set up GPU training clusters to make them available for the duration of training runs.

Hugging Face organizations can sign up here: https://huggingface.co/training-cluster
jeffboudier 
posted an update 12 months ago
jeffboudier 
posted an update 12 months ago
view post
Post
522
Wrapping up a week of shipping and announcements with Dell Enterprise Hub now featuring AI Applications, on-device models for AI PCs, a new CLI and Python SDK... all you need for building AI on premises!

Blog post has all the details: https://huggingface.co/blog/dell-ai-applications
jeffboudier 
posted an update about 1 year ago
view post
Post
2645
Transcribing 1 hour of audio for less than $0.01 🤯

@mfuntowicz cooked with 8x faster Whisper speech recognition - whisper-large-v3-turbo transcribes at 100x real time on a $0.80/hr L4 GPU!

How they did it: https://huggingface.co/blog/fast-whisper-endpoints

1-click deploy with HF Inference Endpoints: https://endpoints.huggingface.co/new?repository=openai%2Fwhisper-large-v3-turbo&vendor=aws&region=us-east&accelerator=gpu&instance_id=aws-us-east-1-nvidia-l4-x1&task=automatic-speech-recognition&no_suggested_compute=true
jeffboudier 
posted an update about 1 year ago
jeffboudier 
posted an update about 1 year ago
view post
Post
2231
Llama4 is out and Scout is already on the Dell Enterprise Hub to deploy on Dell systems 👉 dell.huggingface.co
jeffboudier 
posted an update about 1 year ago
view post
Post
1618
Enterprise orgs now enable serverless Inference Providers for all members
- includes $2 free usage per org member (e.g. an Enterprise org with 1,000 members share $2,000 free credit each month)
- admins can set a monthly spend limit for the entire org
- works today with Together, fal, Novita, Cerebras and HF Inference.

Here's the doc to bill Inference Providers usage to your org: https://huggingface.co/docs/inference-providers/pricing#organization-billing
  • 2 replies
·
alvarobartt 
posted an update about 1 year ago
view post
Post
3646
🔥 Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
jeffboudier 
posted an update over 1 year ago
view post
Post
769
NVIDIA just announced the Cosmos World Foundation Models, available on the Hub: nvidia/cosmos-6751e884dc10e013a0a0d8e6

Cosmos is a family of pre-trained models purpose-built for generating physics-aware videos and world states to advance physical AI development.
The release includes Tokenizers nvidia/cosmos-tokenizer-672b93023add81b66a8ff8e6

Learn more in this great community article by @mingyuliutw and @PranjaliJoshi https://huggingface.co/blog/mingyuliutw/nvidia-cosmos
  • 1 reply
·
jeffboudier 
posted an update over 1 year ago