Instructions to use Stefano-M-Community/aixpa_no_ground with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Stefano-M-Community/aixpa_no_ground with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "Stefano-M-Community/aixpa_no_ground")

Transformers

How to use Stefano-M-Community/aixpa_no_ground with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Stefano-M-Community/aixpa_no_ground")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Stefano-M-Community/aixpa_no_ground", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Stefano-M-Community/aixpa_no_ground with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Stefano-M-Community/aixpa_no_ground"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Stefano-M-Community/aixpa_no_ground",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Stefano-M-Community/aixpa_no_ground

SGLang

How to use Stefano-M-Community/aixpa_no_ground with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Stefano-M-Community/aixpa_no_ground" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Stefano-M-Community/aixpa_no_ground",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Stefano-M-Community/aixpa_no_ground" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Stefano-M-Community/aixpa_no_ground",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Stefano-M-Community/aixpa_no_ground with Docker Model Runner:
```
docker model run hf.co/Stefano-M-Community/aixpa_no_ground
```

AiXPA Fine-tuned Llama 3.1 8B Model (No Ground Document)

This model is a fine-tuned version of Meta-Llama-3.1-8B-Instruct, specialized for the AiXPA project in the domain of Italian Public Administration (PA). It was trained using supervised fine-tuning (SFT) with LoRA (Low-Rank Adaptation) techniques on a dialogue dataset between an assistant and a PA user, without reference documents as context.

Model Details

Model Description

This model is based on Meta-Llama-3.1-8B-Instruct and has been fine-tuned using the Stefano-M-Community/final_all_no_ground dataset for Italian Public Administration dialogue tasks. The model uses 4-bit quantization and LoRA adapters for efficient training and inference, making it suitable for deployment on consumer hardware while maintaining strong performance in PA-specific conversations without reference documents as context.

Developed by: LanD (FBK)
Model type: Causal Language Model (Fine-tuned)
Language(s) (NLP): Italian (primarily)
License: Please refer to the original Llama 3.1 license
Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

This model can be used directly for text generation tasks, particularly those related to the domain it was fine-tuned on. The model maintains the instruction-following capabilities of the base Llama 3.1 model while being specialized for specific use cases defined in the training dataset. This variant is particularly suited for scenarios where reference documents are not available as context.

Downstream Use

The model can be further fine-tuned for specific tasks or integrated into larger applications that require text generation capabilities. The LoRA adapters make it easy to switch between different specialized versions. This variant may be particularly useful for applications that need to operate without reference ground truth data.

Out-of-Scope Use

This model should not be used for generating harmful, misleading, or inappropriate content. It may not perform well on tasks significantly different from its training domain without additional fine-tuning. The model is specifically designed for scenarios without ground truth, so it may not be optimal for tasks that heavily rely on reference data.

Bias, Risks, and Limitations

This model inherits the biases and limitations present in the base Llama 3.1 model and may have additional biases introduced through the fine-tuning dataset. Key considerations include:

Domain Specificity: The model has been fine-tuned on a specific dataset and may not generalize well to domains outside its training scope
No Ground Document Dependency: This variant is trained without reference documents as context, which may affect its performance on tasks requiring document-based evaluation
Quantization Effects: 4-bit quantization may introduce minor degradation in model performance compared to full precision
Context Limitations: Maximum context length of 4,200 tokens may limit performance on very long documents
Language Bias: Primarily trained on Italian content, may have limited performance in other languages

Recommendations

Thoroughly evaluate the model on your specific use case before deployment
Consider the potential for biased outputs and implement appropriate safeguards
Monitor model performance and outputs in production environments
Be aware of the model's training domain when applying to new tasks
Consider additional fine-tuning for specialized applications outside the training domain
This variant is particularly suitable for scenarios where reference documents are not available as context

How to Get Started with the Model

Use the code below to get started with the model:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load the base model and tokenizer
base_model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, "Stefano-M-Community/aixpa_no_ground")

# Generate text
prompt = "Ciao, mi aiuti a scrivere un'azione sullo sport?"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training Details

Training Data

The model was fine-tuned on the Stefano-M-Community/final_all_no_ground dataset from Hugging Face, which contains Italian Public Administration dialogue data between an assistant and PA users without reference documents. This dataset was used for both training and evaluation.

Training Procedure

The model was trained using supervised fine-tuning (SFT) with LoRA (Low-Rank Adaptation) techniques. The training utilized 4-bit quantization for memory efficiency and multi-GPU training with 4 processes.

Training Hyperparameters

Training regime: Mixed precision training with 4-bit quantization
LoRA Configuration:
- Rank: 16
- Alpha: 32
- Dropout: 0.0
Sequence Length: 4,200 tokens
Learning Rate: 5e-5
Scheduler: Cosine annealing
Batch Size: 4 (training), 1 (evaluation)
Gradient Accumulation Steps: 2
Number of Epochs: 10
Weight Decay: 0.01
Warmup Ratio: 0.03
Early Stopping Patience: 5 epochs

Training Infrastructure

Hardware: Multi-GPU setup (4 processes)
Framework:
- Accelerate for distributed training
- DeepSpeed for optimization
- PEFT for LoRA implementation
Logging: Weights & Biases (WandB)
Evaluation Frequency: Every 35 steps
Checkpoint Saving: Every 35 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated using the same dataset used for training: Stefano-M-Community/final_all_no_ground. Evaluation was performed every 35 training steps to monitor training progress and prevent overfitting.

Factors

Training Progress: Monitored throughout training with early stopping patience of 5 epochs
Loss Metrics: Custom loss function implementation for supervised fine-tuning
Computational Efficiency: Evaluated performance with 4-bit quantization
No Ground Document Scenarios: Specialized evaluation for scenarios without reference documents as context

Metrics

Training Loss: Monitored during training with logging every 10 steps
Evaluation Loss: Computed every 35 steps on the evaluation dataset
Early Stopping: Implemented with patience of 5 epochs to prevent overfitting

Results

Evaluation results are logged in Weights & Biases during training. The model was trained for up to 10 epochs with early stopping mechanism to ensure optimal performance without overfitting.

Evaluation Loss Performance:

The model (purple line in eval/loss graph) shows a rapid decrease from ~1.23 at step 0 to ~0.86 around step 18-20
Minimum loss achieved: approximately 0.86 around step 18-20
Loss then increases to ~0.97-0.98 between steps 35-40, and ~1.03 at step 43
The model shows signs of overfitting after the minimum point, which is typical for this training approach

Summary

The fine-tuned model demonstrates improved performance on Italian Public Administration dialogue tasks while maintaining the general capabilities of the base Llama 3.1 model. The LoRA adaptation approach allows for efficient fine-tuning while preserving most of the original model's knowledge. This variant is specifically optimized for PA conversations without reference documents as context.

Model Examination

The model uses LoRA (Low-Rank Adaptation) which allows for parameter-efficient fine-tuning. This approach:

Preserves the original model weights while adding small adapter modules
Enables efficient switching between different task-specific adaptations
Reduces memory requirements during training and inference
Maintains interpretability by keeping the base model architecture intact
This variant is specifically designed for Italian language tasks without reference documents as context

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

The environmental impact of this model is reduced compared to training from scratch due to:

Efficient Training: LoRA adaptation requires significantly less compute than full model training
4-bit Quantization: Reduces memory usage and energy consumption during training
Hardware Type: Multi-GPU setup (specific hardware configuration may vary)
Training Approach: Parameter-efficient fine-tuning reduces overall computational requirements

Note: Specific carbon emission calculations would require detailed hardware specifications and training duration measurements.

Technical Specifications

Model Architecture and Objective

Base Architecture: Llama 3.1 (8B parameters)
Adaptation Method: LoRA (Low-Rank Adaptation)
Objective: Supervised Fine-tuning for Italian Public Administration dialogue tasks without reference documents as context
Quantization: 4-bit quantization for efficient training and inference
Maximum Context Length: 4,200 tokens

Compute Infrastructure

Hardware

Training Setup: Multi-GPU configuration (4 processes)
Memory Optimization: 4-bit quantization with LoRA adapters
Distributed Training: Accelerate framework for multi-GPU coordination

Software

Framework: PyTorch with Transformers library
Training Libraries:
- PEFT 0.17.1 (Parameter-Efficient Fine-Tuning)
- Accelerate (distributed training)
- DeepSpeed (optimization)
- TRL (Transformer Reinforcement Learning)
Monitoring: Weights & Biases (WandB)
Configuration Management: DeepSpeed configuration for memory optimization

Citation

BibTeX:

@misc{aixpa_llama31_8b_lora_no_ground,
  title={AiXPA Fine-tuned Llama 3.1 8B Model (No Ground Document)},
  author={LanD (FBK)},
  year={2025},
  howpublished={Hugging Face Model Repository},
  note={Fine-tuned from meta-llama/Meta-Llama-3.1-8B-Instruct using LoRA, trained on Italian Public Administration dialogue data without reference documents}
}

APA:

LanD (FBK). (2025). AiXPA Fine-tuned Llama 3.1 8B Model (No Ground Document). Hugging Face Model Repository. Fine-tuned from meta-llama/Meta-Llama-3.1-8B-Instruct using LoRA, trained on Italian Public Administration dialogue data without reference documents.

Glossary

LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique that adds trainable low-rank matrices to existing model weights
SFT (Supervised Fine-Tuning): Training method using labeled data to improve model performance on specific tasks
4-bit Quantization: Technique to reduce model memory usage by representing weights with 4-bit precision
Multi-GPU Training: Distributed training approach using multiple GPUs to accelerate training
No Ground Document: Training approach that does not rely on reference documents as context