similarity between embedding
Hello,
when comparing the Two embeddings from the original T5 and the distilled one how much similarity i would expect, and if there are similarity between the Two embeddings would doing Lora on the student model to finetune it be beneficial ?
I didn't calculate the similarity between the embeddings, but they should be rather different. The student model learns a local minima that works well for Flux but does not extend to other diffusion models.
Thank you so much for replying, I have to many questions and very interested.
Hope I’m not asking too many questions 😅
can i finetune the student with this loss for a specific domain which i observed it not as good in terms of prompt adherence, avoiding doing the VLoss, because i have limited compute resources
Loss = MSE(teacher_embeddings, student_embeddings) + λ * MSE(student_embeddings, current_student_embeddings(original_prompts))
where
teacher_embeddings: T5-XXL embeddings for furnishings prompts
student_embeddings: Your distilled T5 embeddings for same prompts
original_student_embeddings: the original student
λ : regularization value
You are welcome to ask any questions. I think the regularization term may help mitigate mode collapse, but it could also limit the student model’s ability to fully capture the capacity of T5-XXL. In addition, since T5 is trained using a cross-entropy loss, you may also consider experimenting with that objective.