Indeed. If you want lengthier text, I would divide it into 64-token chunks.... (perhaps even overlapping), embed each one separately, and then either average.... the ends or, depending on your use case, dot each one with the picture..... embedding and calculate the maximum or average score.
Actually, I'm wondering what kinds of searches longer than 64 tokens you deal with. Almost every Siglip.... use case that comes to mind falls well below 64.