Complete HuggingFace Beginner Guide - Everything You Need to Know (Comprehensive Tutorial)

#2
by AYI-NEDJIMI - opened

Complete HuggingFace Beginner Guide - Everything You Need to Know

Author: AYI-NEDJIMI | AI & Cybersecurity Consultant

This comprehensive tutorial walks you through the entire HuggingFace ecosystem: from the Hub to Python libraries, community features, and Pro/Enterprise capabilities. Whether you're a beginner or experienced developer, you'll find everything you need to leverage the world's leading open-source AI platform.


1. What is HuggingFace?

HuggingFace is the world's leading platform for open-source artificial intelligence. Founded in 2016, it has become the "GitHub of AI" with over 500,000 models, 100,000 datasets, and 300,000 Spaces hosted.

1.1 The HuggingFace Hub

The Hub is the heart of the ecosystem. It's a collaborative platform where you can:

  • Discover pre-trained models for every AI task
  • Share your own models, datasets, and applications
  • Collaborate with the community via discussions and pull requests
  • Deploy AI applications in just a few clicks
# Explore the Hub programmatically
from huggingface_hub import HfApi

api = HfApi()

# List most downloaded models
models = api.list_models(sort="downloads", direction=-1, limit=5)
for model in models:
    print(f"{model.id} - {model.downloads:,} downloads")

# List popular datasets
datasets = api.list_datasets(sort="downloads", direction=-1, limit=5)
for ds in datasets:
    print(f"{ds.id} - {ds.downloads:,} downloads")

1.2 The Complete Ecosystem

HuggingFace is more than just a hub. It's a complete ecosystem comprising:

Component Description
Hub Platform for sharing models, datasets, and Spaces
Transformers Python library for deep learning models
Datasets Large-scale data loading and processing
Gradio Quick creation of web interfaces for your models
Accelerate Simplified distributed training
PEFT Efficient fine-tuning (LoRA, QLoRA)
TRL Reinforcement learning training for LLMs
Inference API Hosted inference API

1.3 The Community

With over 5 million users, HuggingFace is the world's largest AI community. You'll find:

  • Researchers from Google, Meta, Microsoft, OpenAI
  • Innovative AI startups
  • Independent developers
  • Students and educators

2. Creating Your Account and Setting Up

2.1 Registration

  1. Go to huggingface.co
  2. Click "Sign Up"
  3. Fill in your information (email, username, password)
  4. Confirm your email

2.2 Configure Your Profile

A good HuggingFace profile is essential for visibility:

  • Profile picture: add a professional photo
  • Bio: describe your areas of expertise
  • Links: add your website, GitHub, LinkedIn
  • Organizations: join or create organizations

2.3 API Tokens

Tokens are required to interact with the Hub via the API:

# Method 1: Login via CLI
# In your terminal:
# huggingface-cli login

# Method 2: Token as environment variable
import os
os.environ["HF_TOKEN"] = "hf_your_token_here"

# Method 3: Direct token in code
from huggingface_hub import HfApi
api = HfApi(token="hf_your_token_here")

# Method 4: Programmatic login
from huggingface_hub import login
login(token="hf_your_token_here")

Token types:

  • Read: read private repos
  • Write: push models/datasets
  • Fine-grained: specific permissions per repo

2.4 Installing Libraries

# Complete HuggingFace ecosystem installation
pip install transformers datasets huggingface_hub gradio accelerate peft trl
pip install torch torchvision torchaudio  # PyTorch
pip install evaluate scikit-learn  # Evaluation

# Verify installations
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
python -c "import datasets; print(f'Datasets: {datasets.__version__}')"
python -c "import gradio; print(f'Gradio: {gradio.__version__}')"

3. Navigating the Hub

3.1 Exploring Models

The Hub provides powerful filters to find the right model:

  • Task (pipeline_tag): text-generation, text-classification, image-classification...
  • Library: PyTorch, TensorFlow, JAX, ONNX
  • Language: en, fr, multilingual
  • License: MIT, Apache 2.0, CC-BY
  • Size: number of parameters
from huggingface_hub import HfApi

api = HfApi()

# Search for text generation models
models = api.list_models(
    filter="text-generation",
    sort="downloads",
    direction=-1,
    limit=10
)

for m in models:
    print(f"  {m.id} ({m.downloads:,} DL)")

3.2 Exploring Datasets

# Search for popular datasets
datasets_list = api.list_datasets(
    sort="downloads",
    direction=-1,
    limit=10
)

for ds in datasets_list:
    print(f"  {ds.id}")

3.3 Exploring Spaces

Spaces are web applications hosted for free. Available types:

  • Gradio: interactive ML interfaces (most popular)
  • Streamlit: data dashboards and apps
  • Docker: custom containers
  • Static: static websites (HTML/CSS/JS)
# List popular Spaces
spaces = api.list_spaces(sort="likes", direction=-1, limit=10)
for s in spaces:
    print(f"  {s.id} - {s.likes} likes")

3.4 Exploring Papers

HuggingFace now integrates AI research papers with:

  • Direct links to associated models and datasets
  • Community discussions on each paper
  • Daily Papers: the best articles of the day voted by the community

4. HuggingFace Python Libraries in Detail

4.1 Transformers - The Core of the Ecosystem

from transformers import pipeline

# Text classification pipeline
classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
result = classifier("This product is excellent, I highly recommend it!")
print(result)
# [{'label': '5 stars', 'score': 0.73}]

# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")
text = generator("Artificial intelligence will", max_length=50)
print(text[0]['generated_text'])

# Translation pipeline
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("HuggingFace is the best platform for AI.")
print(result[0]['translation_text'])

# Question answering pipeline
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="What is HuggingFace?",
           context="HuggingFace is an open-source AI platform founded in 2016.")
print(result['answer'])

4.2 Datasets - Load and Process Data

from datasets import load_dataset

# Load a popular dataset
dataset = load_dataset("squad_v2")
print(dataset)
print(dataset['train'][0])

# Load with streaming (for large datasets)
dataset = load_dataset("wikipedia", "20220301.en", streaming=True)
for example in dataset['train']:
    print(example['title'])
    break

# Filter and transform
dataset = load_dataset("imdb")
small_dataset = dataset['train'].select(range(100))
filtered = small_dataset.filter(lambda x: x['label'] == 1)
print(f"Positive reviews: {len(filtered)}")

4.3 HuggingFace Hub - Interact with the Hub

from huggingface_hub import HfApi

api = HfApi()

# Download a file
api.hf_hub_download(repo_id="gpt2", filename="config.json")

# List files in a repo
files = api.list_repo_files("gpt2")
print(files[:5])

# Get model info
info = api.model_info("gpt2")
print(f"Model: {info.id}")
print(f"Downloads: {info.downloads:,}")
print(f"Likes: {info.likes}")

4.4 Gradio - Create Web Interfaces

import gradio as gr

def greet(name):
    return f"Hello {name}! Welcome to HuggingFace."

demo = gr.Interface(
    fn=greet,
    inputs=gr.Textbox(label="Your name"),
    outputs=gr.Textbox(label="Message"),
    title="My First Gradio Space",
    description="A simple application to discover Gradio"
)

# demo.launch()  # Launch locally
# demo.launch(share=True)  # Launch with public link

4.5 Accelerate - Distributed Training

from accelerate import Accelerator

accelerator = Accelerator()

# Accelerate automatically handles:
# - Multi-GPU
# - Mixed precision (fp16, bf16)
# - DeepSpeed
# - FSDP
# Without changing your training code!

# model, optimizer, dataloader = accelerator.prepare(
#     model, optimizer, dataloader
# )

4.6 PEFT - Efficient Fine-tuning

from peft import LoraConfig, get_peft_model

# LoRA configuration for efficient fine-tuning
lora_config = LoraConfig(
    r=16,           # Decomposition rank
    lora_alpha=32,  # Scale factor
    target_modules=["q_proj", "v_proj"],  # Target modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Output: "trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622%"

4.7 TRL - Reinforcement Learning Training

from trl import SFTTrainer, SFTConfig

# SFTTrainer for Supervised Fine-Tuning
# trainer = SFTTrainer(
#     model=model,
#     train_dataset=dataset,
#     args=SFTConfig(output_dir="./results", max_seq_length=512),
#     peft_config=lora_config,
# )
# trainer.train()

5. Social and Community Features

5.1 Likes and Bookmarks

  • Like: show your appreciation for a model/dataset/Space
  • Bookmark: save for easy retrieval later
# Like a model programmatically
api.like("meta-llama/Llama-3.1-8B")

# View your likes
likes = api.list_liked_repos("AYI-NEDJIMI")

5.2 Following Users and Organizations

Follow your favorite researchers and organizations to stay informed about their publications.

5.3 Organizations

Organizations allow you to:

  • Group models/datasets under a single entity
  • Manage team permissions
  • Have a professional public profile

5.4 Collections

Collections let you organize and share curated sets of repos:

# Create a collection
# api.create_collection(
#     title="My NLP Models",
#     description="Curated collection of NLP models"
# )

Discover our CyberSec AI collection: CyberSec AI Portfolio


6. HuggingFace Pro and Enterprise

6.1 HuggingFace Pro ($9/month)

  • Inference API: increased limits
  • Spaces: improved hardware (GPU)
  • Private repos: unlimited
  • Early access: new features before everyone else
  • Pro badge on your profile

6.2 HuggingFace Enterprise

For businesses, HuggingFace offers:

  • Private Hub: dedicated Hub instance
  • SSO/SAML: enterprise authentication
  • Audit logs: complete traceability
  • SLA: availability guarantee
  • Premium support: dedicated team
  • Inference Endpoints: production deployment
  • GPU hardware: A100, H100, etc.

6.3 Spaces Pricing

Tier Hardware Price
Free 2 vCPU, 16 GB RAM Free
CPU Upgrade 8 vCPU, 32 GB RAM $0.03/h
T4 small Nvidia T4, 4 GB VRAM $0.06/h
T4 medium Nvidia T4, 16 GB VRAM $0.09/h
A10G small Nvidia A10G, 24 GB VRAM $0.10/h
A100 large Nvidia A100, 80 GB VRAM $4.13/h

7. Practical First Steps

7.1 Your First Model

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a model and its tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
input_text = "Artificial intelligence"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

7.2 Your First Dataset

from datasets import load_dataset, Dataset

# Create a dataset from a dictionary
data = {
    "question": ["What is HuggingFace?", "What is LoRA?"],
    "answer": ["Open-source AI platform", "Efficient fine-tuning method"]
}
dataset = Dataset.from_dict(data)
print(dataset)
print(dataset[0])

# Save and load
dataset.save_to_disk("./my_dataset")
# dataset.push_to_hub("my-username/my-dataset")

7.3 Your First Space

  1. Go to huggingface.co/new-space
  2. Choose a name and SDK (Gradio recommended)
  3. Edit app.py directly in the browser:
import gradio as gr
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def analyze(text):
    result = classifier(text)
    return f"Sentiment: {result[0]['label']} (confidence: {result[0]['score']:.2%})"

demo = gr.Interface(
    fn=analyze,
    inputs=gr.Textbox(placeholder="Enter text..."),
    outputs="text",
    title="Sentiment Analysis",
)
demo.launch()

8. Best Practices and Tips

8.1 For Beginners

  1. Start with pipelines: they abstract away complexity
  2. Explore the Hub: filter by task and language
  3. Read model cards: they explain usage and limitations
  4. Join the community: Discord, forums, discussions
  5. Follow the courses: huggingface.co/learn

8.2 For Developers

  1. Use fine-grained tokens for security
  2. Enable caching to avoid re-downloading models
  3. Use streaming for large datasets
  4. Document your repos with complete model/dataset cards
  5. Version your models with git branches

8.3 Learning Resources


Conclusion

HuggingFace is much more than just a platform: it's a complete ecosystem that democratizes AI. With its open-source libraries, collaborative Hub, and active community, it offers everything you need to start, experiment, and deploy AI solutions.

Feel free to explore our portfolio of models and Spaces dedicated to cybersecurity: CyberSec AI Portfolio


Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant
For more resources, check out our other tutorials in this series.

Sign up or log in to comment