HuggingFaceFW/fineweb
Viewer β’ Updated β’ 52.5B β’ 1.07M β’ 2.85k
This is a multimodal model trained from scratch on text, speech, and vision data.
It supports any-to-any pipelines (e.g., text-to-text, speech-to-text, vision-to-text).
from transformers import AutoTokenizer, AutoModel
model_name = "AarambhAI/gemma-like-multimodal-speech-vision-text"
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Encode text
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)
print(outputs)