Building Tucano 2: Open-Source Language Models That Actually Think in Portuguese

Community Article Published March 5, 2026

banner_no_text

What does it take to build truly open AI models for a language spoken by over 250 million people? Portuguese is one of the most widely spoken languages on the planet, yet when it comes to open-source large language models (LLMs), there are some serious gaps in terms of transparency, performance, and reproducibility. At the same time, most multilingual models sprinkle in some Portuguese on their huge English pile and call it a day. We wanted to do better.

Meet Tucano 2 — a family of open-source language models (0.5B to 3.7B parameters) built from the ground up for Portuguese. Everything is out in the open: the training data, the code, the recipes, and the evaluation tools. No black boxes. In this post, we'll walk you through how we built it, what we learned, and why it matters.

TL;DR (what are we releasing?)

  • Large-scale Portuguese corpus. A ~320B-token cleaned corpus with rich educational and toxicity annotations (GigaVerbo-v2), plus 9.3B synthetic tokens for domain coverage (GigaVerbo-v2 Synth).
  • Annotation and filtering models. Educational, toxicity, and instruction-quality classifiers trained from new auxiliary datasets and released for community use.
  • Diverse SFT data. A large supervised fine-tuning collection spanning coding, tool calling, structured output, reasoning, and mathematics (GigaVerbo-v2 SFT).
  • Dual-reasoning preference dataset. A new resource for preference optimization that captures complementary aspects of reasoning quality (GigaVerbo-v2 Preferences).
  • Tucano 2 model family. Open Portuguese foundation models from 0.5B–3.5B parameters that outperform most prior Portuguese models of similar size.
  • Portuguese evaluation suite. Benchmarks covering early- and late-stage training, including long-context evaluation.
  • Fully open release. All datasets, models, training recipes, and evaluation code are publicly released under permissive licenses.

Why Portuguese Needs Its Own Models

You might be wondering: don't the big multilingual models already handle Portuguese? Sort of. Models like Qwen3, Gemma 3, and Falcon 3 do include Portuguese in their massive training corpora. But there's a catch — they're not optimized for it (and that is why almost any fine-tuning on Portuguese data can yield significant improvements). And the Portuguese-specific models that do exist? Many of them don't share their data, their training details, or their evaluation setups. That makes it nearly impossible for other researchers to build on their work.

We set out to change that. With Tucano 2, we're releasing not just the models themselves, but the entire development stack — datasets, filtering tools, training configurations, evaluation benchmarks, and more — under permissive licenses. If you want to reproduce our results, adapt our models, or build something entirely new for Portuguese, you can. This is not about "sovereignty". This is about openness and collaboration.

Starting With the Data: GigaVerbo-v2

Good models start with good data, and getting high-quality Portuguese text at scale is harder than you might think. We built GigaVerbo-v2, a cleaned Portuguese corpus of roughly 320 billion tokens across 372 million documents.

How did we clean all that text? Besides importing the whole FineWeb-2 pipeline (the goat), we used a large language model (Qwen2.5-32B) to rate 700,000 documents for educational quality and toxicity. Then we distilled those judgments into smaller, faster classifiers that could tag the entire corpus. Think of it like having an expert review a sample of essays, then training a teaching assistant to grade the rest (the good old FineWeb-Edu approach).

On top of that, we created GigaVerbo-v2 Synth — 9.3 billion tokens of synthetic (AI-generated) data designed to fill in the gaps. If the web-crawled data was thin on, say, scientific reasoning or structured coding examples, the synthetic data helped patch those holes.

Our ablation experiments made it clear: models trained on a blend of educational and synthetic data significantly outperformed those trained on raw, unfiltered web text.

gigaverbo_v2_ablation_comparison

A Tokenizer That Actually Gets Portuguese

Here's a detail that often gets overlooked: tokenization. Before a language model can read text, it needs to break it into smaller pieces called tokens. Most existing tokenizers were designed with English in mind, so they chop Portuguese words into more pieces than necessary. Or, they are large multilingual tokenizers that will make any GPU fry when you try to materialize those logits.

More pieces means more compute, slower training, and slower inference.

We trained a custom tokenizer on a balanced mix of Portuguese, English, and code (40/40/20 split), with a vocabulary of about 49,000 tokens. When we benchmarked it against tokenizers from Qwen3, Llama 3.2, and SmolLM3 on 600,000 Portuguese words, ours came out on top — the fewest sub-word pieces per word (1.51) and the best compression (2.88 characters per token).

In practice, this translates to roughly 30% compute savings compared to running the same training with Qwen3's tokenizer. That's a big deal when you're processing hundreds of billions of tokens.

Rethinking How We Evaluate Portuguese Models

Training a model is one thing. Knowing whether it's actually getting better? That's surprisingly tricky.

Many existing Portuguese benchmarks are noisy — scores bounce around between checkpoints without showing a clear trend. We dug into this systematically and found that several benchmarks based on open-ended text generation often don't produce meaningful signals until a model has seen over a trillion tokens. That's fine for frontier labs training massive models, but useless if you're running shorter experiments on a tighter budget.

So we designed a two-tier evaluation suite:

  • Easy Set: Benchmarks that give reliable signals early in training (under 200B tokens). Great for ablation studies and quick experiments.
  • Hard Set: Tougher evaluations that only start to differentiate models after 1T+ tokens of training.

We also adapted well-known English benchmarks — IFEval, GSM8K, RULER, and HumanEval — for Portuguese, covering instruction-following, math reasoning, long-context understanding, and coding. All of it is publicly available.

Training: Two Paths to a Great Model

We took two complementary approaches to building Tucano 2.

Path 1: Training From Scratch

Our smallest model, Tucano2-0.6-Base (670M parameters), was trained from scratch on about 408 billion tokens. We used a Llama-style architecture with a hybrid optimizer (AdamW + Muon) and a three-stage training schedule that progressively shifted toward higher-quality data.

learning_curve

The key insight here was the curriculum: in early stages we exposed the model to a broad mix of data, then gradually increased the proportion of educational and reasoning-heavy content. Think of it like starting a student with general reading before moving on to textbooks.

Even with a fraction of the compute budget of comparable models, this approach delivered strong early-training benchmark scores — and used 92% less energy than our first Tucano model (Tucano-2b4).

model_comparison

Path 2: Building on Giants (Continual Pretraining)

For our larger models, training from scratch was too expensive. Instead, we took the already-powerful Qwen3 base models and adapted them to Portuguese through continual pretraining.

The clever part: we swapped out Qwen3's massive 151K-token vocabulary for our lean 49K-token Portuguese tokenizer, cutting the embedding layer by about 68% without losing the knowledge Qwen3 had already learned. Then we continued training on 50--100 billion tokens of purely Portuguese data.

The results speak for themselves. Our largest continually pretrained model, Tucano2-qwen-3.7-Base, scored 59.21 on our aggregate benchmark — beating both Qwen3-4B-Base (57.86) and the much larger Qwen2.5-7B (57.97). And the additional compute cost? Less than 0.13% of what Qwen3 originally required for pretraining.

before_and_after_3.7B

Teaching the Models to Follow Instructions (and Think)

A base model is great at predicting text, but you need more work to make it actually useful. This is where post-training comes in.

We built two Portuguese-language datasets for this:

  • GigaVerbo-v2 SFT: About 4 million examples spanning 12 task types — coding, tool use, structured output, retrieval-augmented generation, math, and more.
  • GigaVerbo-v2 Preferences: Around 28,000 pairs of "better vs. worse" responses, used to teach the model which kinds of answers are better.

Rather than trying to cram everything into one model, we trained two separate families:

  • Instruct models: Optimized for following instructions clearly and accurately.
  • Think models: Designed to reason step-by-step, producing chain-of-thought traces entirely in Portuguese — something most open models cannot do. Try making Qwen3 reason in Portuguese and you'll see what we mean.

Each model went through supervised fine-tuning (SFT) followed by Anchored Preference Optimization (APO), a technique that steers the model toward preferred response styles.

The payoff? Tucano2-qwen-3.7-Instruct achieved the highest Knowledge & Reasoning score among all models in the 3-4B parameter range — beating Qwen3-4B, SmolLM3-3B, and other competitors. And Tucano2-qwen-3.7-Think also topped the reasoning leaderboard (in Knowledge & Reasoning) among thinking models, while doing all its reasoning in Portuguese.

model_comparison_3.7B_instruct

The Energy Bill: Being Honest About Costs

We tracked energy consumption and carbon emissions throughout the project using CodeCarbon. Here's what the numbers look like:

Phase Energy (kWh) CO2 Equivalent (kg)
Synthetic data generation 14,400 5,472
Continual pretraining (all models) 2,326 884
Data ablations 1,600 608
Evaluations (all models) 1,000 380
Post-Training (SFT + APO) 530 201
Pretraining (from scratch) 873 332
Total (all phases) ~20,856 ~7,929

Generating synthetic data was by far the most expensive part — about 73% of total energy. Actual model training was comparatively modest (but we didn't even train for that long, so no surprise there). These numbers are also a lower bound. Things like data preprocessing on CPUs, cluster cooling, and networking overhead aren't captured by CodeCarbon, so the real footprint is higher. We think this kind of transparency is important: if the community is going to make good decisions about when and how to train models, we need honest accounting.

performance_vs_compute

But we didn't stop at carbon. Training AI models doesn't just consume electricity — it also consumes physical materials. The GPUs that power these runs are built from copper, iron, tin, silicon, nickel, and dozens of other elements, and every training run wears down a fraction of that hardware's useful life. Using the mineral composition analysis from Falk et al. (2025) and a FLOP-based resource allocation methodology, we estimated the aggregated material footprint tied to manufacturing the NVIDIA A100 GPUs used across all 10 Tucano 2 training runs (Base, Instruct, and Think models at every scale). The result: 0.307 kg of copper, 0.010 kg of iron, and smaller quantities of tin, silicon, and nickel — with copper dominating due to its large share in the GPU heatsink. To our knowledge, Tucano 2 is the first model release to report an elemental footprint alongside its carbon emissions. And if we account for the upstream Qwen3 base models we built upon, the material cost jumps by roughly two orders of magnitude (e.g., ~106 kg of copper for the three Qwen3 base models alone). These are still lower-bound estimates — they cover only the material content of the manufactured accelerators, not the resource losses during mining, processing, or production. But we believe this kind of reporting should become standard practice. Carbon is only one dimension of the environmental cost of AI; the materials that go into (and eventually out of) training hardware matter too.

What's Next?

Tucano 2 is a big step forward for Portuguese NLP, but there's plenty of room to grow:

  • Smarter synthetic data generation. It works, but at 73% of total energy costs, we need more efficient methods (smaller models!).
  • Bigger preference datasets. Our current 28K pairs are a start — scaling this by 10x and exploring reinforcement learning methods like GRPO could significantly boost reasoning performance.
  • Longer contexts. Extending the models to handle longer documents and multi-turn conversations is a high priority.
  • Agentic capabilities. Training models for improved tool use, multi-step planning, and dynamic interactions would open the door to building native AI agents for Portuguese users.

Peace Out, Tucano 2 In

Everything we built — every dataset, model, training recipe, and benchmark — is publicly available. We hope it serves as both a useful toolkit and a blueprint for anyone working on language models for Portuguese or other languages that don't yet get the attention they deserve.

👉 collections/Polygl0t/tucano2

You can also find all the source code used for this little project in:

👉 github.com/Polygl0t

For a heavily detailed documentation of Tucano 2, please read our paper:

👉 arXiv:2603.03543

Acknowledgments

Tucano 2 was developed as part of Polyglot (Polygl0t). The methodology and findings presented here extend to additional language-specific studies conducted within the same framework, including Bengali (e.g., LilTii) and Hindi (e.g., LilMoo). For further details on these parallel efforts and associated resources, please refer to the Polyglot project page: huggingface.co/Polygl0t.

Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.

We also gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

Citation

@misc{correa2026tucano2cool,
      title={{Tucano 2 Cool: Better Open Source LLMs for Portuguese}},
      author={Nicholas Kluge Corr{\^e}a and Aniket Sen and Shiza Fatimah and Sophia Falk and Lennard Landgraf and Julia Kastner and Lucie Flek},
      year={2026},
      eprint={2603.03543},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03543},
}

Community

Sign up or log in to comment