kawchar85
/

SmolLM2-1.7B-Instruct-TIFA

+---
+license: apache-2.0
+base_model:
+- unsloth/SmolLM2-1.7B-Instruct
+pipeline_tag: text-generation
+tags:
+  - text-to-image-evaluation
+  - faithfulness
+  - lora
+  - tifa
+  - unsloth
+language: en
+---
+# SmolLM2-1.7B-Instruct-TIFA
+## Model Description
+SmolLM2-1.7B-Instruct-TIFA is a fine-tuned version of [unsloth/SmolLM2-1.7B-Instruct](https://huggingface.co/unsloth/SmolLM2-1.7B-Instruct) specifically trained for **TIFA (Text-to-Image Faithfulness Assessment)**. This model generates structured evaluation questions to assess how faithfully text-to-image models represent given text descriptions. This is the most capable version in my series, with 1.7B parameters, validation-based training, and significantly reduced question duplication issues.
+**Previous versions**: [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA)
+## Intended Use
+This model is designed to automatically generate evaluation questions for text-to-image models by creating four specific types of questions:
+1. **Negative question**: Should have "no" as the answer (testing for contradictory elements)
+2. **Object/attribute identification**: Should have a single word answer directly from the description
+3. **Alternative object/attribute identification**: Should have a different single word answer from the description
+4. **Positive question**: Should have "yes" as the answer (testing for present elements)
+## Model Details
+- **Base Model**: unsloth/SmolLM2-1.7B-Instruct
+- **Model Size**: 1.7B parameters
+- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with enhanced configuration
+- **Training Framework**: Transformers + TRL + PEFT + Unsloth
+- **License**: apache-2.0
+## Training Details
+### Training Configuration
+- **Training Method**: Supervised Fine-Tuning (SFT) with LoRA and validation
+- **Enhanced LoRA Configuration**:
+  - r: 24
+  - lora_alpha: 48
+  - lora_dropout: 0.05
+  - Target modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
+- **Training Parameters**:
+  - Epochs: 5
+  - Learning Rate: 1e-4
+  - Batch Size: 8 (per device)
+  - Gradient Accumulation Steps: 2
+  - Max Sequence Length: 512
+  - Optimizer: AdamW
+  - LR Scheduler: Cosine (improved from linear)
+  - Weight Decay: 0.01
+  - Warmup Steps: 200
+  - **Validation Setup**: 10% holdout with early stopping based on eval_loss
+### Dataset
+The model was trained on the same structured dataset containing 10,000 examples created using Gemini, but with improved training methodology using train/validation split (90%/10%) for better generalization and reduced overfitting.
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
+import torch
+model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA"
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    torch_dtype=torch.float16,
+    trust_remote_code=True,
+    device_map="auto"
+)
+# Create pipeline
+chat_pipe = pipeline(
+    "text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    return_full_text=False,
+)
+def get_message(desc):
+    system_msg = """\
+You are a helpful assistant. Your job is to write exactly four DIFFERENT multiple-choice questions that test if an image matches its description.
+Rules:
+Q1: Focus on something contradictory to the description. Answer must be 'no' (choices: no, yes).
+Q2: Answer must be one exact word from the description; provide 4 UNIQUE choices.
+Q3: Answer must be a DIFFERENT exact word from the description than what was used in Q2; provide 4 UNIQUE choices.
+Q4: Focus on something present in the description. Answer must be 'yes' (choices: no, yes).
+Make each question cover a distinct detail. Ensure all questions are meaningful, valid, and relevant to the description.
+For description "a red car parked near a tall building":
+Q1: Is the car black?
+C: no, yes
+A: no
+Q2: What is the vehicle in the image?
+C: motorcycle, car, bicycle, truck
+A: car
+Q3: What type of structure is near the car?
+C: house, building, garage, tree
+A: building
+Q4: Is there a car in the image?
+C: no, yes
+A: yes
+"""
+    user_msg = f'Create four DIFFERENT multiple-choice questions for this description: "{desc}".'
+    return [
+        {"role": "system", "content": system_msg},
+        {"role": "user", "content": user_msg}
+    ]
+# Generate evaluation questions
+description = "a man sleeping in the park"
+messages = get_message(description)
+output = chat_pipe(
+    messages,
+    max_new_tokens=256,
+    do_sample=False,
+)
+print(output[0]["generated_text"])
+```
+### Example Output
+For the description "a man sleeping in the park", the model generates:
+```
+Q1: Is the man standing up?
+C: no, yes
+A: no
+Q2: What is the person doing?
+C: running, sleeping, walking, eating
+A: sleeping
+Q3: Where is the man located?
+C: beach, park, house, store
+A: park
+Q4: Is there a person in the image?
+C: no, yes
+A: yes
+```
+## Major Improvements Over Previous Versions
+This 1.7B parameter model offers significant advantages over the [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) and [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) versions:
+### Training Improvements
+- **Validation-based training**: 90/10 train/test split with early stopping
+- **Enhanced LoRA**: Higher rank (24) and alpha (48) for better adaptation
+- **Better scheduling**: Cosine learning rate schedule for improved convergence
+- **More training**: 5 epochs with validation monitoring
+### Performance Improvements
+- **Near-zero duplication**: Question duplicate problem is now very rare
+- **Better question diversity**: More varied and contextually appropriate questions
+- **Enhanced consistency**: More reliable adherence to the four-question structure
+- **Improved reasoning**: Better understanding of description nuances
+- **Higher quality**: More natural and meaningful question formulations
+### Technical Improvements
+- **Larger capacity**: 1.7B parameters for better language understanding
+- **Optimized prompting**: Enhanced system prompt emphasizing "DIFFERENT" questions
+- **Better examples**: Improved training examples in the system prompt
+## Limitations
+- The model is specialized for TIFA evaluation and may not perform well on general conversation tasks
+- Limited to generating 4-question evaluation sets in the trained format
+- Requires specific prompt formatting for optimal performance
+## Technical Specifications
+- **Architecture**: Transformer-based language model (1.7B parameters)
+- **Precision**: FP16
+- **Context Length**: 512 tokens
+- **Training**: Validation-based with early stopping
+- **Optimization**: Enhanced LoRA with cosine scheduling
+## Citation
+```bibtex
+@misc{smollm2-1-7b-it-tifa-2025,
+  title={SmolLM2-1.7B-Instruct-TIFA: A Large Fine-tuned Model for Text-to-Image Faithfulness Assessment},
+  author={kawchar85},
+  year={2025},
+  url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA}
+}
+```