Instructions to use Geraldine/FineLlama-3.2-3B-Instruct-ead with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Geraldine/FineLlama-3.2-3B-Instruct-ead with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Geraldine/FineLlama-3.2-3B-Instruct-ead")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Geraldine/FineLlama-3.2-3B-Instruct-ead")
model = AutoModelForCausalLM.from_pretrained("Geraldine/FineLlama-3.2-3B-Instruct-ead")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Geraldine/FineLlama-3.2-3B-Instruct-ead with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Geraldine/FineLlama-3.2-3B-Instruct-ead"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Geraldine/FineLlama-3.2-3B-Instruct-ead",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Geraldine/FineLlama-3.2-3B-Instruct-ead

SGLang

How to use Geraldine/FineLlama-3.2-3B-Instruct-ead with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Geraldine/FineLlama-3.2-3B-Instruct-ead" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Geraldine/FineLlama-3.2-3B-Instruct-ead",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Geraldine/FineLlama-3.2-3B-Instruct-ead" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Geraldine/FineLlama-3.2-3B-Instruct-ead",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Geraldine/FineLlama-3.2-3B-Instruct-ead with Docker Model Runner:
```
docker model run hf.co/Geraldine/FineLlama-3.2-3B-Instruct-ead
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

FineLlama-3.2-3B-Instruct-ead

This repository contains a fine-tuned version of LLaMa-3.2-3B-Instruct specifically trained to understand and generate EAD (Encoded Archival Description) XML format for archival records description.

Model Description

Base Model: meta-llama/Llama-3.2-3B-Instruct
Training Dataset: Geraldine/Ead-Instruct-38k
Task: Generation of EAD/XML compliant archival descriptions
Training Type: Instruction fine-tuning with PEFT (Parameter Efficient Fine-Tuning) using LoRA

Key Features

Specialized in generating EAD/XML format for archival metadata
Trained on a comprehensive dataset of EAD/XML examples
Optimized for archival description tasks
Memory efficient through 4-bit quantization

Training Details

Technical Specifications

Quantization: 4-bit quantization using bitsandbytes
- NF4 quantization type
- Double quantization enabled
- bfloat16 compute dtype

LoRA Configuration

- r: 256
- alpha: 128
- dropout: 0.05
- target modules: all-linear

Training parameters

- Epochs: 3
- Batch Size: 3
- Gradient Accumulation Steps: 2
- Learning Rate: 2e-4
- Warmup Ratio: 0.03
- Max Sequence Length: 4096
- Scheduler: Constant

Training Infrastructure

Libraries: transformers, peft, trl
Mixed Precision: FP16/BF16 (based on hardware support)
Optimizer: fused adamw

Training Notebook

The training Notebook is available on Kaggle

Usage

Installation

pip install transformers torch bitsandbytes

Loading the model

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
from peft import PeftModel, PeftConfig

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = "Geraldine/FineLlama-3.2-3B-Instruct-ead"

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    quantization_config=bnb_config
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)

Example usage

messages = [
  {"role": "system", "content": "You are an expert in EAD/XML generation for archival records metadata."},
  {"role": "user", "content": "Generate a minimal and compliant <eadheader> template with all required EAD/XML tags"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_dict=True,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(**inputs, 
                         max_new_tokens = 4096, 
                         pad_token_id=tokenizer.eos_token_id,
                         use_cache = True,)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

The model is specifically trained for EAD/XML format and may not perform well on general archival tasks
Performance depends on the quality and specificity of the input prompts
Maximum sequence length is limited to 4096 tokens

Citation [optional]

BibTeX:

@misc{ead-llama,
  author = {Géraldine Geoffroy},
  title = {EAD-XML LLaMa: Fine-tuned LLaMa Model for Archival Description},
  year = {2024},
  publisher = {HuggingFace},
  journal = {HuggingFace Repository},
  howpublished = {\url{https://huggingface.co/Geraldine/FineLlama-3.2-3B-Instruct-ead}}
}

Licence

This model is subject to the same license as the base LLaMa model. Please refer to Meta's LLaMa license for usage terms and conditions.

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

F16

Model tree for Geraldine/FineLlama-3.2-3B-Instruct-ead

Finetunes

1 model

Quantizations