Instructions to use proxectonos/Carballo-Science with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use proxectonos/Carballo-Science with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="proxectonos/Carballo-Science")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("proxectonos/Carballo-Science")
model = AutoModelForCausalLM.from_pretrained("proxectonos/Carballo-Science")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use proxectonos/Carballo-Science with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "proxectonos/Carballo-Science"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "proxectonos/Carballo-Science",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/proxectonos/Carballo-Science

SGLang

How to use proxectonos/Carballo-Science with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "proxectonos/Carballo-Science" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "proxectonos/Carballo-Science",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "proxectonos/Carballo-Science" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "proxectonos/Carballo-Science",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use proxectonos/Carballo-Science with Docker Model Runner:
```
docker model run hf.co/proxectonos/Carballo-Science
```

Carballo-Science / README.md

pablo-rf

Add corpus metadata

9171df1 verified 25 days ago

preview code

raw

history blame contribute delete

4.38 kB

	---
	library_name: transformers
	license: mit
	language:
	- gl
	- es
	base_model:
	- BSC-LT/salamandra-7b-instruct
	datasets:
	- proxectonos/corpus_dominio_cientifico
	---

	# Carballo-Science

	## Table of Contents
	<details>
	<summary>Click to expand</summary>

	- [Carballo-Legal](#carballo-legal)
	- [Table of Contents](#table-of-contents)
	- [Model description](#model-description)
	- [Intended uses and limitations](#intended-uses-and-limitations)
	- [How to use](#how-to-use)
	- [Training](#training)
	- [Tools](#tools)
	- [Training data](#training-data)
	- [Training hyperparameters](#training-hyperparameters)
	- [Framework](#framework)
	- [Evaluation](#evaluation)
	- [Additional information](#additional-information)
	- [Funding](#funding)
	- [Cite this model](#cite-this-model)

	</details>

	## Model description

	Carballo-Science is a specialized 7B-parameter instruction-tuned model designed for scientific text understanding and generation in Galician (GL) and Spanish (ES).

	It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality scientific corpora extracted from diverse sources.


	## Intended uses and limitations

	Intended uses
	- Scientific-oriented text generation (summaries, rephrasing, explanations).
	- Chat-style scientific assistance (non-professional).

	Limitations
	- May produce incomplete or incorrect scientific statements.
	- Not suitable for high-stakes or science decision-making.
	- Works best for GL and ES; other languages are not reinforced in this checkpoint.

	## How to use

	```python
	from datetime import datetime
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "proxectonos/Carballo-Science"

	text = "Qué sabes sobre o Proxecto Nós?"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	message = [ { "role": "user", "content": text } ]
	date_string = datetime.today().strftime('%Y-%m-%d')

	prompt = tokenizer.apply_chat_template(
	message,
	tokenize=False,
	add_generation_prompt=True,
	date_string=date_string
	)

	inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
	outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
	generated_tokens = outputs[0][len(inputs[0]):]
	response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
	response = response.split("<\|reserved_token_1\|>")[0].strip()
	print(response)
	```

	## Training

	### Training data

	The model was trained on a mixture of general instructions and domain-specific legal texts.

	\| Dataset Type \| Languages \| Sources \|
	\|------------------\|---------------\|-------------\|
	\| Instruction set \| GL, ES , PT , CAT , EN \| [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) \|
	\| Scientific corpus \| GL, ES \| Wikipedia, PhD Thesis \|

	### Training hyperparameters

	- epochs: 0.5
	- dtype: bf16
	- block size: 2048
	- total batch size: 128
	- learning rate: 2e-6
	- scheduler: Linear
	- optimizations:
	- gradient checkpointing: True
	- flash attention: True
	- liger kernels: True
	- DeepSpeed stage: 2

	### Framework
	Training was performed at the Galician Supercomputing Center (CESGA) on 2 nodes with 2× NVIDIA A100 40GB each, totaling 4 GPUs, across 2 days.

	## Evaluation

	Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.

	## Additional information

	## Funding
	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

	### Cite this model
	Please cite the model as follows:

	```
	@misc{carballo_legal_2025,
	title = {Carballo-Science: A Science Domain Instruction-Tuned Model for Galician and Spanish},
	author = {Proxecto Nós Team},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Science}},
	}
	```