what's diff between first and last ?

#1
by andykensin - opened

Could you explain the difference between "first" and "last" from the file name?

@andykensin

MobileClip Model Postfixes: .first vs .last Explained

Overview

In my TensorFlow Lite MobileClip conversion project, I created models with .first and .last postfixes. These refer to different text pooling strategies I implemented in the text encoder. They determine how my converted models extract the final text representation from the transformer's token sequence.

What I Implemented

.first Pooling

  • I configured the model to use the first token from the text transformer's output sequence
  • This typically corresponds to special tokens like [CLS] (classification token)
  • The model relies on learning to aggregate information into this initial position

.last Pooling

  • I set the model to use the last token from the text transformer's output sequence
  • This leverages the sequential nature of transformer attention
  • The last token has "seen" the entire input sequence through self-attention

My Technical Implementation

In my notebook code, I controlled this with:

model.text.pool_type = 'last'  # or 'first'

This setting determines which token position my converted model uses as the final 512-dimensional text embedding.

Why I Found .last Performs Better

Based on my evaluation results, the .last models consistently outperformed the .first models because:

1. Complete Context Access

  • The last token has processed the entire text sequence through attention mechanisms
  • It contains richer contextual information compared to the first token

2. Training Optimization

  • MobileClip models are specifically trained and optimized for last token pooling
  • The model architecture expects this pooling strategy for optimal performance

3. Sequential Information Flow

  • In transformer architectures, information flows from left to right
  • Later positions naturally accumulate more comprehensive representations

My Experimental Setup

My notebook demonstrates this difference through:

  1. Converting both versions: I created .tflite files for both pooling strategies
  2. Systematic evaluation: I tested on image-text pairs using cosine similarity
  3. Performance comparison: I measured how well each version matches images to their captions

What I Observed

  • .last models: Achieved higher cosine similarity scores with ground truth captions
  • .first models: Showed lower performance in image-text matching tasks
  • File size: Both versions have identical model size (only pooling strategy differs)

My Recommendation

Based on my experiments, use .last pooling for:

  • Better image-text retrieval performance
  • More accurate semantic matching
  • Optimal results with pre-trained MobileClip weights

I kept the .first variants mainly for research comparisons or specific use cases where the first token pooling might be preferred for architectural reasons.

Code Context

You can see this implementation in my conversion loop:

for pool_type in POOLING_TYPES:  # ['last', 'first']
    logging.info(f"--- Configuring for Pooling Type: {pool_type} ---")
    
    if hasattr(model, 'text') and hasattr(model.text, 'pool_type'):
        model.text.pool_type = pool_type
    
    # Convert and save with postfix
    output_filename = f"{model_name.lower().replace('-', '_')}_{pretrained_tag}_{pool_type}.tflite"

This generates the model files you see with both postfixes, allowing direct performance comparison.

Sign up or log in to comment