Text Generation
Transformers
Safetensors
Galician
Spanish
llama
conversational
text-generation-inference
Instructions to use proxectonos/Carballo-Science with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use proxectonos/Carballo-Science with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="proxectonos/Carballo-Science") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("proxectonos/Carballo-Science") model = AutoModelForCausalLM.from_pretrained("proxectonos/Carballo-Science") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use proxectonos/Carballo-Science with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "proxectonos/Carballo-Science" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "proxectonos/Carballo-Science", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/proxectonos/Carballo-Science
- SGLang
How to use proxectonos/Carballo-Science with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "proxectonos/Carballo-Science" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "proxectonos/Carballo-Science", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "proxectonos/Carballo-Science" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "proxectonos/Carballo-Science", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use proxectonos/Carballo-Science with Docker Model Runner:
docker model run hf.co/proxectonos/Carballo-Science
File size: 4,377 Bytes
1f7f09c b4b4ee3 9171df1 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 1f7f09c b4b4ee3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | ---
library_name: transformers
license: mit
language:
- gl
- es
base_model:
- BSC-LT/salamandra-7b-instruct
datasets:
- proxectonos/corpus_dominio_cientifico
---
# Carballo-Science
## Table of Contents
<details>
<summary>Click to expand</summary>
- [Carballo-Legal](#carballo-legal)
- [Table of Contents](#table-of-contents)
- [Model description](#model-description)
- [Intended uses and limitations](#intended-uses-and-limitations)
- [How to use](#how-to-use)
- [Training](#training)
- [Tools](#tools)
- [Training data](#training-data)
- [Training hyperparameters](#training-hyperparameters)
- [Framework](#framework)
- [Evaluation](#evaluation)
- [Additional information](#additional-information)
- [Funding](#funding)
- [Cite this model](#cite-this-model)
</details>
## Model description
**Carballo-Science** is a specialized 7B-parameter instruction-tuned model designed for **scientific text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.
It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality scientific corpora extracted from diverse sources.
## Intended uses and limitations
**Intended uses**
- Scientific-oriented text generation (summaries, rephrasing, explanations).
- Chat-style scientific assistance (non-professional).
**Limitations**
- May produce incomplete or incorrect scientific statements.
- Not suitable for high-stakes or science decision-making.
- Works best for GL and ES; other languages are not reinforced in this checkpoint.
## How to use
```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "proxectonos/Carballo-Science"
text = "Qué sabes sobre o Proxecto Nós?"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True,
date_string=date_string
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)
```
## Training
### Training data
The model was trained on a mixture of general instructions and domain-specific legal texts.
| **Dataset Type** | **Languages** | **Sources** |
|------------------|---------------|-------------|
| Instruction set | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
| Scientific corpus | GL, ES | Wikipedia, PhD Thesis |
### Training hyperparameters
- **epochs:** 0.5
- **dtype:** bf16
- **block size:** 2048
- **total batch size:** 128
- **learning rate:** 2e-6
- **scheduler:** Linear
- **optimizations:**
- gradient checkpointing: True
- flash attention: True
- liger kernels: True
- DeepSpeed stage: 2
### Framework
Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.
## Evaluation
Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.
## Additional information
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
### Cite this model
Please cite the model as follows:
```
@misc{carballo_legal_2025,
title = {Carballo-Science: A Science Domain Instruction-Tuned Model for Galician and Spanish},
author = {Proxecto Nós Team},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Science}},
}
``` |