Complete HuggingFace Beginner Guide - Everything You Need to Know (Comprehensive Tutorial)
Complete HuggingFace Beginner Guide - Everything You Need to Know
Author: AYI-NEDJIMI | AI & Cybersecurity Consultant
This comprehensive tutorial walks you through the entire HuggingFace ecosystem: from the Hub to Python libraries, community features, and Pro/Enterprise capabilities. Whether you're a beginner or experienced developer, you'll find everything you need to leverage the world's leading open-source AI platform.
1. What is HuggingFace?
HuggingFace is the world's leading platform for open-source artificial intelligence. Founded in 2016, it has become the "GitHub of AI" with over 500,000 models, 100,000 datasets, and 300,000 Spaces hosted.
1.1 The HuggingFace Hub
The Hub is the heart of the ecosystem. It's a collaborative platform where you can:
- Discover pre-trained models for every AI task
- Share your own models, datasets, and applications
- Collaborate with the community via discussions and pull requests
- Deploy AI applications in just a few clicks
# Explore the Hub programmatically
from huggingface_hub import HfApi
api = HfApi()
# List most downloaded models
models = api.list_models(sort="downloads", direction=-1, limit=5)
for model in models:
print(f"{model.id} - {model.downloads:,} downloads")
# List popular datasets
datasets = api.list_datasets(sort="downloads", direction=-1, limit=5)
for ds in datasets:
print(f"{ds.id} - {ds.downloads:,} downloads")
1.2 The Complete Ecosystem
HuggingFace is more than just a hub. It's a complete ecosystem comprising:
| Component | Description |
|---|---|
| Hub | Platform for sharing models, datasets, and Spaces |
| Transformers | Python library for deep learning models |
| Datasets | Large-scale data loading and processing |
| Gradio | Quick creation of web interfaces for your models |
| Accelerate | Simplified distributed training |
| PEFT | Efficient fine-tuning (LoRA, QLoRA) |
| TRL | Reinforcement learning training for LLMs |
| Inference API | Hosted inference API |
1.3 The Community
With over 5 million users, HuggingFace is the world's largest AI community. You'll find:
- Researchers from Google, Meta, Microsoft, OpenAI
- Innovative AI startups
- Independent developers
- Students and educators
2. Creating Your Account and Setting Up
2.1 Registration
- Go to huggingface.co
- Click "Sign Up"
- Fill in your information (email, username, password)
- Confirm your email
2.2 Configure Your Profile
A good HuggingFace profile is essential for visibility:
- Profile picture: add a professional photo
- Bio: describe your areas of expertise
- Links: add your website, GitHub, LinkedIn
- Organizations: join or create organizations
2.3 API Tokens
Tokens are required to interact with the Hub via the API:
# Method 1: Login via CLI
# In your terminal:
# huggingface-cli login
# Method 2: Token as environment variable
import os
os.environ["HF_TOKEN"] = "hf_your_token_here"
# Method 3: Direct token in code
from huggingface_hub import HfApi
api = HfApi(token="hf_your_token_here")
# Method 4: Programmatic login
from huggingface_hub import login
login(token="hf_your_token_here")
Token types:
- Read: read private repos
- Write: push models/datasets
- Fine-grained: specific permissions per repo
2.4 Installing Libraries
# Complete HuggingFace ecosystem installation
pip install transformers datasets huggingface_hub gradio accelerate peft trl
pip install torch torchvision torchaudio # PyTorch
pip install evaluate scikit-learn # Evaluation
# Verify installations
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
python -c "import datasets; print(f'Datasets: {datasets.__version__}')"
python -c "import gradio; print(f'Gradio: {gradio.__version__}')"
3. Navigating the Hub
3.1 Exploring Models
The Hub provides powerful filters to find the right model:
- Task (pipeline_tag): text-generation, text-classification, image-classification...
- Library: PyTorch, TensorFlow, JAX, ONNX
- Language: en, fr, multilingual
- License: MIT, Apache 2.0, CC-BY
- Size: number of parameters
from huggingface_hub import HfApi
api = HfApi()
# Search for text generation models
models = api.list_models(
filter="text-generation",
sort="downloads",
direction=-1,
limit=10
)
for m in models:
print(f" {m.id} ({m.downloads:,} DL)")
3.2 Exploring Datasets
# Search for popular datasets
datasets_list = api.list_datasets(
sort="downloads",
direction=-1,
limit=10
)
for ds in datasets_list:
print(f" {ds.id}")
3.3 Exploring Spaces
Spaces are web applications hosted for free. Available types:
- Gradio: interactive ML interfaces (most popular)
- Streamlit: data dashboards and apps
- Docker: custom containers
- Static: static websites (HTML/CSS/JS)
# List popular Spaces
spaces = api.list_spaces(sort="likes", direction=-1, limit=10)
for s in spaces:
print(f" {s.id} - {s.likes} likes")
3.4 Exploring Papers
HuggingFace now integrates AI research papers with:
- Direct links to associated models and datasets
- Community discussions on each paper
- Daily Papers: the best articles of the day voted by the community
4. HuggingFace Python Libraries in Detail
4.1 Transformers - The Core of the Ecosystem
from transformers import pipeline
# Text classification pipeline
classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
result = classifier("This product is excellent, I highly recommend it!")
print(result)
# [{'label': '5 stars', 'score': 0.73}]
# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")
text = generator("Artificial intelligence will", max_length=50)
print(text[0]['generated_text'])
# Translation pipeline
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("HuggingFace is the best platform for AI.")
print(result[0]['translation_text'])
# Question answering pipeline
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="What is HuggingFace?",
context="HuggingFace is an open-source AI platform founded in 2016.")
print(result['answer'])
4.2 Datasets - Load and Process Data
from datasets import load_dataset
# Load a popular dataset
dataset = load_dataset("squad_v2")
print(dataset)
print(dataset['train'][0])
# Load with streaming (for large datasets)
dataset = load_dataset("wikipedia", "20220301.en", streaming=True)
for example in dataset['train']:
print(example['title'])
break
# Filter and transform
dataset = load_dataset("imdb")
small_dataset = dataset['train'].select(range(100))
filtered = small_dataset.filter(lambda x: x['label'] == 1)
print(f"Positive reviews: {len(filtered)}")
4.3 HuggingFace Hub - Interact with the Hub
from huggingface_hub import HfApi
api = HfApi()
# Download a file
api.hf_hub_download(repo_id="gpt2", filename="config.json")
# List files in a repo
files = api.list_repo_files("gpt2")
print(files[:5])
# Get model info
info = api.model_info("gpt2")
print(f"Model: {info.id}")
print(f"Downloads: {info.downloads:,}")
print(f"Likes: {info.likes}")
4.4 Gradio - Create Web Interfaces
import gradio as gr
def greet(name):
return f"Hello {name}! Welcome to HuggingFace."
demo = gr.Interface(
fn=greet,
inputs=gr.Textbox(label="Your name"),
outputs=gr.Textbox(label="Message"),
title="My First Gradio Space",
description="A simple application to discover Gradio"
)
# demo.launch() # Launch locally
# demo.launch(share=True) # Launch with public link
4.5 Accelerate - Distributed Training
from accelerate import Accelerator
accelerator = Accelerator()
# Accelerate automatically handles:
# - Multi-GPU
# - Mixed precision (fp16, bf16)
# - DeepSpeed
# - FSDP
# Without changing your training code!
# model, optimizer, dataloader = accelerator.prepare(
# model, optimizer, dataloader
# )
4.6 PEFT - Efficient Fine-tuning
from peft import LoraConfig, get_peft_model
# LoRA configuration for efficient fine-tuning
lora_config = LoraConfig(
r=16, # Decomposition rank
lora_alpha=32, # Scale factor
target_modules=["q_proj", "v_proj"], # Target modules
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to the model
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Output: "trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622%"
4.7 TRL - Reinforcement Learning Training
from trl import SFTTrainer, SFTConfig
# SFTTrainer for Supervised Fine-Tuning
# trainer = SFTTrainer(
# model=model,
# train_dataset=dataset,
# args=SFTConfig(output_dir="./results", max_seq_length=512),
# peft_config=lora_config,
# )
# trainer.train()
5. Social and Community Features
5.1 Likes and Bookmarks
- Like: show your appreciation for a model/dataset/Space
- Bookmark: save for easy retrieval later
# Like a model programmatically
api.like("meta-llama/Llama-3.1-8B")
# View your likes
likes = api.list_liked_repos("AYI-NEDJIMI")
5.2 Following Users and Organizations
Follow your favorite researchers and organizations to stay informed about their publications.
5.3 Organizations
Organizations allow you to:
- Group models/datasets under a single entity
- Manage team permissions
- Have a professional public profile
5.4 Collections
Collections let you organize and share curated sets of repos:
# Create a collection
# api.create_collection(
# title="My NLP Models",
# description="Curated collection of NLP models"
# )
Discover our CyberSec AI collection: CyberSec AI Portfolio
6. HuggingFace Pro and Enterprise
6.1 HuggingFace Pro ($9/month)
- Inference API: increased limits
- Spaces: improved hardware (GPU)
- Private repos: unlimited
- Early access: new features before everyone else
- Pro badge on your profile
6.2 HuggingFace Enterprise
For businesses, HuggingFace offers:
- Private Hub: dedicated Hub instance
- SSO/SAML: enterprise authentication
- Audit logs: complete traceability
- SLA: availability guarantee
- Premium support: dedicated team
- Inference Endpoints: production deployment
- GPU hardware: A100, H100, etc.
6.3 Spaces Pricing
| Tier | Hardware | Price |
|---|---|---|
| Free | 2 vCPU, 16 GB RAM | Free |
| CPU Upgrade | 8 vCPU, 32 GB RAM | $0.03/h |
| T4 small | Nvidia T4, 4 GB VRAM | $0.06/h |
| T4 medium | Nvidia T4, 16 GB VRAM | $0.09/h |
| A10G small | Nvidia A10G, 24 GB VRAM | $0.10/h |
| A100 large | Nvidia A100, 80 GB VRAM | $4.13/h |
7. Practical First Steps
7.1 Your First Model
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load a model and its tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
input_text = "Artificial intelligence"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
7.2 Your First Dataset
from datasets import load_dataset, Dataset
# Create a dataset from a dictionary
data = {
"question": ["What is HuggingFace?", "What is LoRA?"],
"answer": ["Open-source AI platform", "Efficient fine-tuning method"]
}
dataset = Dataset.from_dict(data)
print(dataset)
print(dataset[0])
# Save and load
dataset.save_to_disk("./my_dataset")
# dataset.push_to_hub("my-username/my-dataset")
7.3 Your First Space
- Go to huggingface.co/new-space
- Choose a name and SDK (Gradio recommended)
- Edit
app.pydirectly in the browser:
import gradio as gr
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
def analyze(text):
result = classifier(text)
return f"Sentiment: {result[0]['label']} (confidence: {result[0]['score']:.2%})"
demo = gr.Interface(
fn=analyze,
inputs=gr.Textbox(placeholder="Enter text..."),
outputs="text",
title="Sentiment Analysis",
)
demo.launch()
8. Best Practices and Tips
8.1 For Beginners
- Start with pipelines: they abstract away complexity
- Explore the Hub: filter by task and language
- Read model cards: they explain usage and limitations
- Join the community: Discord, forums, discussions
- Follow the courses: huggingface.co/learn
8.2 For Developers
- Use fine-grained tokens for security
- Enable caching to avoid re-downloading models
- Use streaming for large datasets
- Document your repos with complete model/dataset cards
- Version your models with git branches
8.3 Learning Resources
- HuggingFace Course: huggingface.co/learn/nlp-course
- Documentation: huggingface.co/docs
- Blog: huggingface.co/blog
- YouTube: official HuggingFace channel
- Discord: active and welcoming community
Conclusion
HuggingFace is much more than just a platform: it's a complete ecosystem that democratizes AI. With its open-source libraries, collaborative Hub, and active community, it offers everything you need to start, experiment, and deploy AI solutions.
Feel free to explore our portfolio of models and Spaces dedicated to cybersecurity: CyberSec AI Portfolio
Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant
For more resources, check out our other tutorials in this series.