what's diff between first and last ?

by andykensin - opened Sep 18, 2025

Discussion

andykensin

Sep 18, 2025

Could you explain the difference between "first" and "last" from the file name?

anton96vice

Owner Sep 22, 2025

@andykensin

MobileClip Model Postfixes: `.first` vs `.last` Explained

Overview

In my TensorFlow Lite MobileClip conversion project, I created models with .first and .last postfixes. These refer to different text pooling strategies I implemented in the text encoder. They determine how my converted models extract the final text representation from the transformer's token sequence.

What I Implemented

`.first` Pooling

I configured the model to use the first token from the text transformer's output sequence
This typically corresponds to special tokens like [CLS] (classification token)
The model relies on learning to aggregate information into this initial position

`.last` Pooling

I set the model to use the last token from the text transformer's output sequence
This leverages the sequential nature of transformer attention
The last token has "seen" the entire input sequence through self-attention

My Technical Implementation

In my notebook code, I controlled this with:

model.text.pool_type = 'last'  # or 'first'

This setting determines which token position my converted model uses as the final 512-dimensional text embedding.

Why I Found `.last` Performs Better

Based on my evaluation results, the .last models consistently outperformed the .first models because:

1. Complete Context Access

The last token has processed the entire text sequence through attention mechanisms
It contains richer contextual information compared to the first token

2. Training Optimization

MobileClip models are specifically trained and optimized for last token pooling
The model architecture expects this pooling strategy for optimal performance

3. Sequential Information Flow

In transformer architectures, information flows from left to right
Later positions naturally accumulate more comprehensive representations

My Experimental Setup

My notebook demonstrates this difference through:

Converting both versions: I created .tflite files for both pooling strategies
Systematic evaluation: I tested on image-text pairs using cosine similarity
Performance comparison: I measured how well each version matches images to their captions

What I Observed

.last models: Achieved higher cosine similarity scores with ground truth captions
.first models: Showed lower performance in image-text matching tasks
File size: Both versions have identical model size (only pooling strategy differs)

My Recommendation

Based on my experiments, use .last pooling for:

Better image-text retrieval performance
More accurate semantic matching
Optimal results with pre-trained MobileClip weights

I kept the .first variants mainly for research comparisons or specific use cases where the first token pooling might be preferred for architectural reasons.

Code Context

You can see this implementation in my conversion loop:

for pool_type in POOLING_TYPES:  # ['last', 'first']
    logging.info(f"--- Configuring for Pooling Type: {pool_type} ---")
    
    if hasattr(model, 'text') and hasattr(model.text, 'pool_type'):
        model.text.pool_type = pool_type
    
    # Convert and save with postfix
    output_filename = f"{model_name.lower().replace('-', '_')}_{pretrained_tag}_{pool_type}.tflite"

This generates the model files you see with both postfixes, allowing direct performance comparison.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment