--- language: - zh - en tags: - sentence-transformers - sentence-similarity - feature-extraction - transformers pipeline_tag: sentence-similarity library_name: sentence-transformers license: apache-2.0 --- Here is the CodeR model trained on both text-only data and the full code data. ## Usage ### Using FlagEmbedding ``` git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding pip install -e . ``` ```python from FlagEmbedding import FlagLLMModel queries = [ "Delete the record with ID 4 from the 'Staff' table.", 'Delete all records in the "Livestock" table where age is greater than 5' ] documents = [ "DELETE FROM Staff WHERE StaffID = 4;", "DELETE FROM Livestock WHERE age > 5;" ] model = FlagLLMModel('nebula2025/CodeR-synthetic', query_instruction_format="{}\n{}", query_instruction_for_retrieval="Given a question in text, retrieve SQL queries that are appropriate responses to the question.", trust_remote_code=True, use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation embeddings_1 = model.encode_queries(queries) embeddings_2 = model.encode_corpus(documents) similarity = embeddings_1 @ embeddings_2.T print(similarity) ``` By default, FlagLLMModel will use all available GPUs when encoding. Please set `os.environ["CUDA_VISIBLE_DEVICES"]` to select specific GPUs. You also can set `os.environ["CUDA_VISIBLE_DEVICES"]=""` to make all GPUs unavailable. ### Using Sentence Transformers ```python from sentence_transformers import SentenceTransformer import torch # Load the model, optionally in float16 precision for faster inference model = SentenceTransformer("nebula2025/CodeR-synthetic", model_kwargs={"torch_dtype": torch.float16, "trust_remote_code": True}, tokenizer_kwargs={"trust_remote_code": True}) # Prepare a prompt given an instruction instruction = 'Given a question in text, retrieve SQL queries that are appropriate responses to the question.' prompt = f'{instruction}\n' # Prepare queries and documents queries = [ "Delete the record with ID 4 from the 'Staff' table.", 'Delete all records in the "Livestock" table where age is greater than 5' ] documents = [ "DELETE FROM Staff WHERE StaffID = 4;", "DELETE FROM Livestock WHERE age > 5;" ] # Compute the query and document embeddings query_embeddings = model.encode(queries, prompt=prompt) document_embeddings = model.encode(documents) # Compute the cosine similarity between the query and document embeddings similarities = model.similarity(query_embeddings, document_embeddings) print(similarities) ``` ### Using HuggingFace Transformers ```python import torch import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] def get_detailed_instruct(task_description: str, query: str) -> str: return f'{task_description}\n{query}' instruction = 'Given a question in text, retrieve SQL queries that are appropriate responses to the question.' queries = [ "Delete the record with ID 4 from the 'Staff' table.", 'Delete all records in the "Livestock" table where age is greater than 5' ] documents = [ "DELETE FROM Staff WHERE StaffID = 4;", "DELETE FROM Livestock WHERE age > 5;" ] input_texts = queries + documents tokenizer = AutoTokenizer.from_pretrained('nebula2025/CodeR-synthetic', trust_remote_code=True) model = AutoModel.from_pretrained('nebula2025/CodeR-synthetic', trust_remote_code=True) model.eval() max_length = 4096 # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8) with torch.no_grad(): outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask']) # normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) * 100 print(scores.tolist()) ```