Nice work & Recommendation

#1
by tomaarsen - opened

Hello!

I've just read through your model card, and this looks very impressive. I'm always glad to see low-resource languages get dedicated models!

I also have a recommendation: with modern sentence-transformers versions, you can use model.encode_query and model.encode_document, which automatically use the "query" and "document" prompt names. This means that users can do this instead:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("LocalDoc/LocRet-small")

queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"]
passages = [
    "Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
    "Gəncə Azərbaycanın ikinci böyük şəhəridir.",
]

query_embeddings = model.encode_query(queries)
passage_embeddings = model.encode_document(passages)

similarities = model.similarity(query_embeddings, passage_embeddings)
print(similarities)

Once you update these lines (https://huggingface.co/LocalDoc/LocRet-small/blob/main/config_sentence_transformers.json#L8-L11) to:

  "prompts": {
    "query": "query: ",
    "document": "passage: "
  },

Optionally, you can also set default_prompt_name to "document", which means that "passage: " will always be prepended if the user uses model.encode without a prompt or prompt_name. There's some more details here: https://sbert.net/examples/sentence_transformer/applications/computing-embeddings/README.html#prompt-templates

  • Tom Aarsen
LocalDoc org

Hello!

I've just read through your model card, and this looks very impressive. I'm always glad to see low-resource languages get dedicated models!

I also have a recommendation: with modern sentence-transformers versions, you can use model.encode_query and model.encode_document, which automatically use the "query" and "document" prompt names. This means that users can do this instead:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("LocalDoc/LocRet-small")

queries = ["Azərbaycanın paytaxtı hansı şəhərdir?"]
passages = [
    "Bakı Azərbaycan Respublikasının paytaxtı və ən böyük şəhəridir.",
    "Gəncə Azərbaycanın ikinci böyük şəhəridir.",
]

query_embeddings = model.encode_query(queries)
passage_embeddings = model.encode_document(passages)

similarities = model.similarity(query_embeddings, passage_embeddings)
print(similarities)

Once you update these lines (https://huggingface.co/LocalDoc/LocRet-small/blob/main/config_sentence_transformers.json#L8-L11) to:

  "prompts": {
    "query": "query: ",
    "document": "passage: "
  },

Optionally, you can also set default_prompt_name to "document", which means that "passage: " will always be prepended if the user uses model.encode without a prompt or prompt_name. There's some more details here: https://sbert.net/examples/sentence_transformer/applications/computing-embeddings/README.html#prompt-templates

  • Tom Aarsen

Hi Tom

Thank you for the kind words and the excellent recommendation
I've updated config_sentence_transformers.json with the query and document prompt names, and set default_prompt_name to "document".
The model card usage example is also updated to use encode_query / encode_document.

Sign up or log in to comment