kawchar85 commited on
Commit
a7e60b1
·
verified ·
1 Parent(s): 4a9d587

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +206 -3
README.md CHANGED
@@ -1,3 +1,206 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - unsloth/SmolLM2-1.7B-Instruct
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-to-image-evaluation
8
+ - faithfulness
9
+ - lora
10
+ - tifa
11
+ - unsloth
12
+ language: en
13
+ ---
14
+
15
+ # SmolLM2-1.7B-Instruct-TIFA
16
+
17
+ ## Model Description
18
+
19
+ SmolLM2-1.7B-Instruct-TIFA is a fine-tuned version of [unsloth/SmolLM2-1.7B-Instruct](https://huggingface.co/unsloth/SmolLM2-1.7B-Instruct) specifically trained for **TIFA (Text-to-Image Faithfulness Assessment)**. This model generates structured evaluation questions to assess how faithfully text-to-image models represent given text descriptions. This is the most capable version in my series, with 1.7B parameters, validation-based training, and significantly reduced question duplication issues.
20
+
21
+ **Previous versions**: [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) | [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA)
22
+
23
+ ## Intended Use
24
+
25
+ This model is designed to automatically generate evaluation questions for text-to-image models by creating four specific types of questions:
26
+
27
+ 1. **Negative question**: Should have "no" as the answer (testing for contradictory elements)
28
+ 2. **Object/attribute identification**: Should have a single word answer directly from the description
29
+ 3. **Alternative object/attribute identification**: Should have a different single word answer from the description
30
+ 4. **Positive question**: Should have "yes" as the answer (testing for present elements)
31
+
32
+ ## Model Details
33
+
34
+ - **Base Model**: unsloth/SmolLM2-1.7B-Instruct
35
+ - **Model Size**: 1.7B parameters
36
+ - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with enhanced configuration
37
+ - **Training Framework**: Transformers + TRL + PEFT + Unsloth
38
+ - **License**: apache-2.0
39
+
40
+ ## Training Details
41
+
42
+ ### Training Configuration
43
+ - **Training Method**: Supervised Fine-Tuning (SFT) with LoRA and validation
44
+ - **Enhanced LoRA Configuration**:
45
+ - r: 24
46
+ - lora_alpha: 48
47
+ - lora_dropout: 0.05
48
+ - Target modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
49
+
50
+ - **Training Parameters**:
51
+ - Epochs: 5
52
+ - Learning Rate: 1e-4
53
+ - Batch Size: 8 (per device)
54
+ - Gradient Accumulation Steps: 2
55
+ - Max Sequence Length: 512
56
+ - Optimizer: AdamW
57
+ - LR Scheduler: Cosine (improved from linear)
58
+ - Weight Decay: 0.01
59
+ - Warmup Steps: 200
60
+ - **Validation Setup**: 10% holdout with early stopping based on eval_loss
61
+
62
+ ### Dataset
63
+ The model was trained on the same structured dataset containing 10,000 examples created using Gemini, but with improved training methodology using train/validation split (90%/10%) for better generalization and reduced overfitting.
64
+
65
+ ## Usage
66
+
67
+ ### Installation
68
+
69
+ ```bash
70
+ pip install transformers torch
71
+ ```
72
+
73
+ ### Basic Usage
74
+
75
+ ```python
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
77
+ import torch
78
+
79
+ model_path = "kawchar85/SmolLM2-1.7B-Instruct-TIFA"
80
+
81
+ # Load model and tokenizer
82
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ model_path,
85
+ torch_dtype=torch.float16,
86
+ trust_remote_code=True,
87
+ device_map="auto"
88
+ )
89
+
90
+ # Create pipeline
91
+ chat_pipe = pipeline(
92
+ "text-generation",
93
+ model=model,
94
+ tokenizer=tokenizer,
95
+ return_full_text=False,
96
+ )
97
+
98
+ def get_message(desc):
99
+ system_msg = """\
100
+ You are a helpful assistant. Your job is to write exactly four DIFFERENT multiple-choice questions that test if an image matches its description.
101
+ Rules:
102
+ Q1: Focus on something contradictory to the description. Answer must be 'no' (choices: no, yes).
103
+ Q2: Answer must be one exact word from the description; provide 4 UNIQUE choices.
104
+ Q3: Answer must be a DIFFERENT exact word from the description than what was used in Q2; provide 4 UNIQUE choices.
105
+ Q4: Focus on something present in the description. Answer must be 'yes' (choices: no, yes).
106
+ Make each question cover a distinct detail. Ensure all questions are meaningful, valid, and relevant to the description.
107
+
108
+ For description "a red car parked near a tall building":
109
+ Q1: Is the car black?
110
+ C: no, yes
111
+ A: no
112
+ Q2: What is the vehicle in the image?
113
+ C: motorcycle, car, bicycle, truck
114
+ A: car
115
+ Q3: What type of structure is near the car?
116
+ C: house, building, garage, tree
117
+ A: building
118
+ Q4: Is there a car in the image?
119
+ C: no, yes
120
+ A: yes
121
+ """
122
+
123
+ user_msg = f'Create four DIFFERENT multiple-choice questions for this description: "{desc}".'
124
+ return [
125
+ {"role": "system", "content": system_msg},
126
+ {"role": "user", "content": user_msg}
127
+ ]
128
+
129
+ # Generate evaluation questions
130
+ description = "a man sleeping in the park"
131
+ messages = get_message(description)
132
+
133
+ output = chat_pipe(
134
+ messages,
135
+ max_new_tokens=256,
136
+ do_sample=False,
137
+ )
138
+
139
+ print(output[0]["generated_text"])
140
+ ```
141
+
142
+ ### Example Output
143
+
144
+ For the description "a man sleeping in the park", the model generates:
145
+
146
+ ```
147
+ Q1: Is the man standing up?
148
+ C: no, yes
149
+ A: no
150
+ Q2: What is the person doing?
151
+ C: running, sleeping, walking, eating
152
+ A: sleeping
153
+ Q3: Where is the man located?
154
+ C: beach, park, house, store
155
+ A: park
156
+ Q4: Is there a person in the image?
157
+ C: no, yes
158
+ A: yes
159
+ ```
160
+
161
+ ## Major Improvements Over Previous Versions
162
+
163
+ This 1.7B parameter model offers significant advantages over the [360M](https://huggingface.co/kawchar85/SmolLM2-360M-Instruct-TIFA) and [135M](https://huggingface.co/kawchar85/SmolLM2-135M-Instruct-TIFA) versions:
164
+
165
+ ### Training Improvements
166
+ - **Validation-based training**: 90/10 train/test split with early stopping
167
+ - **Enhanced LoRA**: Higher rank (24) and alpha (48) for better adaptation
168
+ - **Better scheduling**: Cosine learning rate schedule for improved convergence
169
+ - **More training**: 5 epochs with validation monitoring
170
+
171
+ ### Performance Improvements
172
+ - **Near-zero duplication**: Question duplicate problem is now very rare
173
+ - **Better question diversity**: More varied and contextually appropriate questions
174
+ - **Enhanced consistency**: More reliable adherence to the four-question structure
175
+ - **Improved reasoning**: Better understanding of description nuances
176
+ - **Higher quality**: More natural and meaningful question formulations
177
+
178
+ ### Technical Improvements
179
+ - **Larger capacity**: 1.7B parameters for better language understanding
180
+ - **Optimized prompting**: Enhanced system prompt emphasizing "DIFFERENT" questions
181
+ - **Better examples**: Improved training examples in the system prompt
182
+
183
+ ## Limitations
184
+
185
+ - The model is specialized for TIFA evaluation and may not perform well on general conversation tasks
186
+ - Limited to generating 4-question evaluation sets in the trained format
187
+ - Requires specific prompt formatting for optimal performance
188
+
189
+ ## Technical Specifications
190
+
191
+ - **Architecture**: Transformer-based language model (1.7B parameters)
192
+ - **Precision**: FP16
193
+ - **Context Length**: 512 tokens
194
+ - **Training**: Validation-based with early stopping
195
+ - **Optimization**: Enhanced LoRA with cosine scheduling
196
+
197
+ ## Citation
198
+
199
+ ```bibtex
200
+ @misc{smollm2-1-7b-it-tifa-2025,
201
+ title={SmolLM2-1.7B-Instruct-TIFA: A Large Fine-tuned Model for Text-to-Image Faithfulness Assessment},
202
+ author={kawchar85},
203
+ year={2025},
204
+ url={https://huggingface.co/kawchar85/SmolLM2-1.7B-Instruct-TIFA}
205
+ }
206
+ ```