diff --git "a/app/dist/index.html" "b/app/dist/index.html" --- "a/app/dist/index.html" +++ "b/app/dist/index.html" @@ -12,8 +12,8 @@ document.documentElement.setAttribute("data-theme", theme); } catch {} })(); - -

Maintain the unmaintainable:
1M python loc, 400+ models

+ + + +
+

+
+ + + + + + +
+

+
+ + + + + +
+ + +
+
+ + +
+
+
0%
+
+
+
+
+
+ + + + + +
+

As you can see, the GenerationMixin node is already very heavy. It encompasses all of the utilities around .generate, it is second only to nn.Module. +That means every decision we make to abstract something else has to be extremely careful.

The following Pull request to standardize placeholder masking is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:

    def get_placeholder_mask(
         self,
@@ -780,7 +1221,7 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,
 

But this is within the modeling file, not in the PreTrainedModel base class. It will not move away from it, because it’d break the self-contained logic of the model.

Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to PreTrainedModel. Next: pipeline-level wins that came from PyTorch-first choices (fast processors).

On image processing and processors

-

Choosing to be a torch-first software meant relieving a tremendous amount of support from jax and TensorFlow , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the fast processing of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing torch and torchvisionnative inputs allowed up to speed up massively the processing time for each model.

+

Choosing to be a torch-first software meant relieving a tremendous amount of support from jax and TensorFlow , and it also meant that we could be more lenient into the amount of torch-dependent utilities that we were able to add. One of these is the fast processing of images. Where they were before assumed to be minimal ndarrays, making stronger assumptions and enforcing torch and torchvision native inputs allowed up to speed up massively the processing time for each model.

The gains in performance are immense, up to 20x speed for most models when compiled torchvision ops. Further, it allows to have the whole pipeline solely on GPU.

Fast Image Processors Performance

Thanks Yoni Gozlan for the great work!

@@ -789,11 +1230,11 @@ As you can see, there is a small DETR island, a little llava pocket, and so on,

This is an overall objective: there’s no transformers without its community.

Having a framework means forcing users into it. It restrains flexibility and creativity, which are the fertile soil for new ideas to grow.

Among the most valuable contributions to transformers is of course the addition of new models. Very recently, OpenAI added GPT-OSS, which prompted the addition of many new features to the library in order to support their model.

-

A second one is the ability to fine-tune and pipeline these models into many other softwares. Check here on the hub how many finetunes are registered for gpt-oss 120b, despite its size!

+

A second one is the ability to fine-tune and pipeline these models into many other software. Check here on the hub how many finetunes are registered for gpt-oss 120b, despite its size!

The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest. Next: power tools enabled by a consistent API.

Models popularity

Talking about dependencies, we can take a look at the number of downloads for transformer models popularity. One thing we see is the prominence of encoders: This is because the usage of encoders lies in embeddings, just check out EmbeddingGemma for a modern recap. Hence, it is vital to keep the encoders part viable, usable, fine-tune-able.

-
+
@@ -4681,7 +5122,7 @@ return Plotly; }));
-

As the codebase grows, with our friend codebase Sentence Transformers, we need to maintain this one as well. Retrieval use-cases, smart dbs, like FAISS-based indexing rely on it, and thus indirectly on transformers.

+

As the codebase grows, with our friend codebase Sentence Transformers, we need to maintain this one as well. Retrieval use-cases, smart databases, like FAISS-based indexing rely on it, and thus indirectly on transformers.

In that regard, we DO want to be a modular toolbox, being minimal enough and well documented enough so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.

So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?

Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS). Next: dev tools that leverage unified attention APIs and PyTorch-only internals.

@@ -4689,7 +5130,7 @@ return Plotly;

Attention visualisation

All models have the same API internally for attention computation, thanks to the externalisation of attention classes. it allows us to build cool tools to visualize the inner workings of the attention mechanism.

One particular piece of machinery is the attention mask. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.

-
+
@@ -4742,7 +5183,7 @@ return Plotly;

Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.” Next: CUDA warmup reduces load-time stalls without touching modeling semantics.

Cooking faster CUDA warmups

Having a clean external API allows us to work on the true inner workings of transformers. One of the few recent additions was the CUDA warmup via caching_allocator_warmup which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading, achieving a 7x factor for an 8B model, 6x for a 32B, you can check out the source!

-