google-cloud-partnership (Hugging Face on Google Cloud)

posted an update about 1 month ago

Post

435

Open agents on AWS SageMaker AI with open models from the Hugging Face Hub!

> Deploy an open model from the Hugging Face Hub on SageMaker AI
> Connect the deployed model to Strands Agents
> Add built-in and custom tools for tool calling
> Expose external capabilities through MCP integration
> Bonus: talk to your agent and visualize traces with Gradio

https://alvarobartt.com/agents-on-aws-sagemaker

alvarobartt

posted an update about 1 month ago

Post

3342

Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem

alvarobartt

posted an update 4 months ago

Post

3752

Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! 💥

> 🕒 60-minute single-pass processing, no chunking or stitching
> 👤 Customized hotwords to guide recognition on domain-specific content
> 📝 Rich transcription: joint ASR + diarization + timestamping in one pass
> 🌍 50+ languages with automatic detection and code-switching support
> 🤗 Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API

https://huggingface.co/docs/microsoft-azure/foundry/examples/deploy-vibevoice-asr

alvarobartt

updated a collection 4 months ago

100,000 FineWiki Embeddings with EmbeddingGemma

Collection

Material for the example at https://huggingface.co/docs/google-cloud/examples/cloud-run-finewiki-embeddings-with-embedding-gemma • 3 items • Updated Feb 26

alvarobartt

updated a dataset 4 months ago

google-cloud-partnership/finewiki-en-100k

Viewer • Updated Feb 20 • 100k • 29

alvarobartt

published a dataset 4 months ago

google-cloud-partnership/finewiki-en-100k

Viewer • Updated Feb 20 • 100k • 29

alvarobartt

updated a dataset 4 months ago

google-cloud-partnership/finewiki-en-100k-embeddings

Viewer • Updated Feb 20 • 100k • 21

alvarobartt

published a dataset 4 months ago

google-cloud-partnership/finewiki-en-100k-embeddings

Viewer • Updated Feb 20 • 100k • 21

alvarobartt

updated a dataset 4 months ago

google-cloud-partnership/finewiki-en-100-embeddings

Viewer • Updated Feb 16 • 52 • 11

alvarobartt

published a dataset 4 months ago

google-cloud-partnership/finewiki-en-100-embeddings

Viewer • Updated Feb 16 • 52 • 11

alvarobartt

updated a dataset 4 months ago

google-cloud-partnership/finewiki-en-1m

Viewer • Updated Feb 16 • 1M • 11

alvarobartt

published a dataset 4 months ago

google-cloud-partnership/finewiki-en-1m

Viewer • Updated Feb 16 • 1M • 11

alvarobartt

posted an update 5 months ago

Post

3292

💥 hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!

uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.

💡 Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (à la vLLM) manually if preferred.

1 reply

·

pagezyhf

posted an update 8 months ago

Post

3020

🚀 Big news for AI builders!

We’re thrilled to announce that the Qwen3-VL family of vision-language models is now available on Azure AI Foundry, thanks to our collaboration with Microsoft.

We bring open-source innovation to enterprise-grade AI infrastructure, making it easier than ever for enterprise to deploy and scale the latest and greatest from models from hugging Face securely within Azure.

🔍 Highlights:

- Deploy Qwen3-VL instantly via managed endpoints
- Built-in governance, telemetry, and lifecycle management
- True multimodal reasoning — vision, language, and code understanding
- State-of-the-art performance, outperforming closed-source models like Gemini 2.5 Pro and GPT-5
- Available in both *Instruct* and *Thinking* modes, across 24 model sizes

👉 Get started today: search for Qwen3-VL in the Hugging Face Collection on Azure AI Foundry.