Spaces:

AYI-NEDJIMI
/

Model-Playground

Paused

App Files Files Community

Complete HuggingFace Beginner Guide - Everything You Need to Know (Comprehensive Tutorial)

by AYI-NEDJIMI - opened Feb 18

Discussion

AYI-NEDJIMI

Owner Feb 18

Complete HuggingFace Beginner Guide - Everything You Need to Know

Author: AYI-NEDJIMI | AI & Cybersecurity Consultant

This comprehensive tutorial walks you through the entire HuggingFace ecosystem: from the Hub to Python libraries, community features, and Pro/Enterprise capabilities. Whether you're a beginner or experienced developer, you'll find everything you need to leverage the world's leading open-source AI platform.

1. What is HuggingFace?

HuggingFace is the world's leading platform for open-source artificial intelligence. Founded in 2016, it has become the "GitHub of AI" with over 500,000 models, 100,000 datasets, and 300,000 Spaces hosted.

1.1 The HuggingFace Hub

The Hub is the heart of the ecosystem. It's a collaborative platform where you can:

Discover pre-trained models for every AI task
Share your own models, datasets, and applications
Collaborate with the community via discussions and pull requests
Deploy AI applications in just a few clicks

# Explore the Hub programmatically
from huggingface_hub import HfApi

api = HfApi()

# List most downloaded models
models = api.list_models(sort="downloads", direction=-1, limit=5)
for model in models:
    print(f"{model.id} - {model.downloads:,} downloads")

# List popular datasets
datasets = api.list_datasets(sort="downloads", direction=-1, limit=5)
for ds in datasets:
    print(f"{ds.id} - {ds.downloads:,} downloads")

1.2 The Complete Ecosystem

HuggingFace is more than just a hub. It's a complete ecosystem comprising:

Component	Description
Hub	Platform for sharing models, datasets, and Spaces
Transformers	Python library for deep learning models
Datasets	Large-scale data loading and processing
Gradio	Quick creation of web interfaces for your models
Accelerate	Simplified distributed training
PEFT	Efficient fine-tuning (LoRA, QLoRA)
TRL	Reinforcement learning training for LLMs
Inference API	Hosted inference API

1.3 The Community

With over 5 million users, HuggingFace is the world's largest AI community. You'll find:

Researchers from Google, Meta, Microsoft, OpenAI
Innovative AI startups
Independent developers
Students and educators

2. Creating Your Account and Setting Up

2.1 Registration

Go to huggingface.co
Click "Sign Up"
Fill in your information (email, username, password)
Confirm your email

2.2 Configure Your Profile

A good HuggingFace profile is essential for visibility:

Profile picture: add a professional photo
Bio: describe your areas of expertise
Links: add your website, GitHub, LinkedIn
Organizations: join or create organizations

2.3 API Tokens

Tokens are required to interact with the Hub via the API:

# Method 1: Login via CLI
# In your terminal:
# huggingface-cli login

# Method 2: Token as environment variable
import os
os.environ["HF_TOKEN"] = "hf_your_token_here"

# Method 3: Direct token in code
from huggingface_hub import HfApi
api = HfApi(token="hf_your_token_here")

# Method 4: Programmatic login
from huggingface_hub import login
login(token="hf_your_token_here")

Token types:

Read: read private repos
Write: push models/datasets
Fine-grained: specific permissions per repo

2.4 Installing Libraries

# Complete HuggingFace ecosystem installation
pip install transformers datasets huggingface_hub gradio accelerate peft trl
pip install torch torchvision torchaudio  # PyTorch
pip install evaluate scikit-learn  # Evaluation

# Verify installations
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
python -c "import datasets; print(f'Datasets: {datasets.__version__}')"
python -c "import gradio; print(f'Gradio: {gradio.__version__}')"

3. Navigating the Hub

3.1 Exploring Models

The Hub provides powerful filters to find the right model:

Task (pipeline_tag): text-generation, text-classification, image-classification...
Library: PyTorch, TensorFlow, JAX, ONNX
Language: en, fr, multilingual
License: MIT, Apache 2.0, CC-BY
Size: number of parameters

from huggingface_hub import HfApi

api = HfApi()

# Search for text generation models
models = api.list_models(
    filter="text-generation",
    sort="downloads",
    direction=-1,
    limit=10
)

for m in models:
    print(f"  {m.id} ({m.downloads:,} DL)")

3.2 Exploring Datasets

# Search for popular datasets
datasets_list = api.list_datasets(
    sort="downloads",
    direction=-1,
    limit=10
)

for ds in datasets_list:
    print(f"  {ds.id}")

3.3 Exploring Spaces

Spaces are web applications hosted for free. Available types:

Gradio: interactive ML interfaces (most popular)
Streamlit: data dashboards and apps
Docker: custom containers
Static: static websites (HTML/CSS/JS)

# List popular Spaces
spaces = api.list_spaces(sort="likes", direction=-1, limit=10)
for s in spaces:
    print(f"  {s.id} - {s.likes} likes")

3.4 Exploring Papers

HuggingFace now integrates AI research papers with:

Direct links to associated models and datasets
Community discussions on each paper
Daily Papers: the best articles of the day voted by the community

4. HuggingFace Python Libraries in Detail

4.1 Transformers - The Core of the Ecosystem

from transformers import pipeline

# Text classification pipeline
classifier = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")
result = classifier("This product is excellent, I highly recommend it!")
print(result)
# [{'label': '5 stars', 'score': 0.73}]

# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")
text = generator("Artificial intelligence will", max_length=50)
print(text[0]['generated_text'])

# Translation pipeline
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("HuggingFace is the best platform for AI.")
print(result[0]['translation_text'])

# Question answering pipeline
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="What is HuggingFace?",
           context="HuggingFace is an open-source AI platform founded in 2016.")
print(result['answer'])

4.2 Datasets - Load and Process Data

from datasets import load_dataset

# Load a popular dataset
dataset = load_dataset("squad_v2")
print(dataset)
print(dataset['train'][0])

# Load with streaming (for large datasets)
dataset = load_dataset("wikipedia", "20220301.en", streaming=True)
for example in dataset['train']:
    print(example['title'])
    break

# Filter and transform
dataset = load_dataset("imdb")
small_dataset = dataset['train'].select(range(100))
filtered = small_dataset.filter(lambda x: x['label'] == 1)
print(f"Positive reviews: {len(filtered)}")

4.3 HuggingFace Hub - Interact with the Hub

from huggingface_hub import HfApi

api = HfApi()

# Download a file
api.hf_hub_download(repo_id="gpt2", filename="config.json")

# List files in a repo
files = api.list_repo_files("gpt2")
print(files[:5])

# Get model info
info = api.model_info("gpt2")
print(f"Model: {info.id}")
print(f"Downloads: {info.downloads:,}")
print(f"Likes: {info.likes}")

4.4 Gradio - Create Web Interfaces

import gradio as gr

def greet(name):
    return f"Hello {name}! Welcome to HuggingFace."

demo = gr.Interface(
    fn=greet,
    inputs=gr.Textbox(label="Your name"),
    outputs=gr.Textbox(label="Message"),
    title="My First Gradio Space",
    description="A simple application to discover Gradio"
)

# demo.launch()  # Launch locally
# demo.launch(share=True)  # Launch with public link

4.5 Accelerate - Distributed Training

from accelerate import Accelerator

accelerator = Accelerator()

# Accelerate automatically handles:
# - Multi-GPU
# - Mixed precision (fp16, bf16)
# - DeepSpeed
# - FSDP
# Without changing your training code!

# model, optimizer, dataloader = accelerator.prepare(
#     model, optimizer, dataloader
# )

4.6 PEFT - Efficient Fine-tuning

from peft import LoraConfig, get_peft_model

# LoRA configuration for efficient fine-tuning
lora_config = LoraConfig(
    r=16,           # Decomposition rank
    lora_alpha=32,  # Scale factor
    target_modules=["q_proj", "v_proj"],  # Target modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to the model
# peft_model = get_peft_model(model, lora_config)
# peft_model.print_trainable_parameters()
# Output: "trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622%"

4.7 TRL - Reinforcement Learning Training

from trl import SFTTrainer, SFTConfig

# SFTTrainer for Supervised Fine-Tuning
# trainer = SFTTrainer(
#     model=model,
#     train_dataset=dataset,
#     args=SFTConfig(output_dir="./results", max_seq_length=512),
#     peft_config=lora_config,
# )
# trainer.train()

5. Social and Community Features

5.1 Likes and Bookmarks

Like: show your appreciation for a model/dataset/Space
Bookmark: save for easy retrieval later

# Like a model programmatically
api.like("meta-llama/Llama-3.1-8B")

# View your likes
likes = api.list_liked_repos("AYI-NEDJIMI")

5.2 Following Users and Organizations

Follow your favorite researchers and organizations to stay informed about their publications.

5.3 Organizations

Organizations allow you to:

Group models/datasets under a single entity
Manage team permissions
Have a professional public profile

5.4 Collections

Collections let you organize and share curated sets of repos:

# Create a collection
# api.create_collection(
#     title="My NLP Models",
#     description="Curated collection of NLP models"
# )

Discover our CyberSec AI collection: CyberSec AI Portfolio

6. HuggingFace Pro and Enterprise

6.1 HuggingFace Pro ($9/month)

Inference API: increased limits
Spaces: improved hardware (GPU)
Private repos: unlimited
Early access: new features before everyone else
Pro badge on your profile

6.2 HuggingFace Enterprise

For businesses, HuggingFace offers:

Private Hub: dedicated Hub instance
SSO/SAML: enterprise authentication
Audit logs: complete traceability
SLA: availability guarantee
Premium support: dedicated team
Inference Endpoints: production deployment
GPU hardware: A100, H100, etc.

6.3 Spaces Pricing

Tier	Hardware	Price
Free	2 vCPU, 16 GB RAM	Free
CPU Upgrade	8 vCPU, 32 GB RAM	$0.03/h
T4 small	Nvidia T4, 4 GB VRAM	$0.06/h
T4 medium	Nvidia T4, 16 GB VRAM	$0.09/h
A10G small	Nvidia A10G, 24 GB VRAM	$0.10/h
A100 large	Nvidia A100, 80 GB VRAM	$4.13/h

7. Practical First Steps

7.1 Your First Model

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load a model and its tokenizer
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
input_text = "Artificial intelligence"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

7.2 Your First Dataset

from datasets import load_dataset, Dataset

# Create a dataset from a dictionary
data = {
    "question": ["What is HuggingFace?", "What is LoRA?"],
    "answer": ["Open-source AI platform", "Efficient fine-tuning method"]
}
dataset = Dataset.from_dict(data)
print(dataset)
print(dataset[0])

# Save and load
dataset.save_to_disk("./my_dataset")
# dataset.push_to_hub("my-username/my-dataset")

7.3 Your First Space

Go to huggingface.co/new-space
Choose a name and SDK (Gradio recommended)
Edit app.py directly in the browser:

import gradio as gr
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def analyze(text):
    result = classifier(text)
    return f"Sentiment: {result[0]['label']} (confidence: {result[0]['score']:.2%})"

demo = gr.Interface(
    fn=analyze,
    inputs=gr.Textbox(placeholder="Enter text..."),
    outputs="text",
    title="Sentiment Analysis",
)
demo.launch()

8. Best Practices and Tips

8.1 For Beginners

Start with pipelines: they abstract away complexity
Explore the Hub: filter by task and language
Read model cards: they explain usage and limitations
Join the community: Discord, forums, discussions
Follow the courses: huggingface.co/learn

8.2 For Developers

Use fine-grained tokens for security
Enable caching to avoid re-downloading models
Use streaming for large datasets
Document your repos with complete model/dataset cards
Version your models with git branches

8.3 Learning Resources

HuggingFace Course: huggingface.co/learn/nlp-course
Documentation: huggingface.co/docs
Blog: huggingface.co/blog
YouTube: official HuggingFace channel
Discord: active and welcoming community

Conclusion

HuggingFace is much more than just a platform: it's a complete ecosystem that democratizes AI. With its open-source libraries, collaborative Hub, and active community, it offers everything you need to start, experiment, and deploy AI solutions.

Feel free to explore our portfolio of models and Spaces dedicated to cybersecurity: CyberSec AI Portfolio

Tutorial written by AYI-NEDJIMI - AI & Cybersecurity Consultant
For more resources, check out our other tutorials in this series.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment