BiCA
Collection
6 items • Updated • 3
This is BiCA-Base a SOTA dense retriever finetuned from thenlper/gte-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'Phylogenetic analysis of mitochondrial genes in Macquarie perch from three river basins',
'Genetic variation in mitochondrial genes could underlie metabolic adaptations because mitochondrially encoded proteins are directly involved in a pathway supplying energy to metabolism. Macquarie perch from river basins exposed to different climates differ in size and growth rate, suggesting potential presence of adaptive metabolic differences. We used complete mitochondrial genome sequences to build a phylogeny, estimate lineage divergence times and identify signatures of purifying and positive selection acting on mitochondrial genes for 25 Macquarie perch from three basins: Murray-Darling Basin (MDB), Hawkesbury-Nepean Basin (HNB) and Shoalhaven Basin (SB). Phylogenetic analysis resolved basin-level clades, supporting incipient speciation previously inferred from differentiation in allozymes, microsatellites and mitochondrial control region. The estimated time of lineage divergence suggested an early- to mid-Pleistocene split between SB and the common ancestor of HNB+MDB, followed by mid-to-late Pleistocene splitting between HNB and MDB. These divergence estimates are more recent than previous ones. Our analyses suggested that evolutionary drivers differed between inland MDB and coastal HNB. In the cooler and more climatically variable MDB, mitogenomes evolved under strong purifying selection, whereas in the warmer and more climatically stable HNB, purifying selection was relaxed. Evidence for relaxed selection in the HNB includes elevated transfer RNA and 16S ribosomal RNA polymorphism, presence of potentially mildly deleterious mutations and a codon (ATP6',
'An improved Bayesian method is presented for estimating phylogenetic trees using DNA sequence data. The birth-death process with species sampling is used to specify the prior distribution of phylogenies and ancestral speciation times, and the posterior probabilities of phylogenies are used to estimate the maximum posterior probability (MAP) tree. Monte Carlo integration is used to integrate over the ancestral speciation times for particular trees. A Markov Chain Monte Carlo method is used to generate the set of trees with the highest posterior probabilities. Methods are described for an empirical Bayesian analysis, in which estimates of the speciation and extinction rates are used in calculating the posterior probabilities, and a hierarchical Bayesian analysis, in which these parameters are removed from the model by an additional integration. The Markov Chain Monte Carlo method avoids the requirement of our earlier method for calculating MAP trees to sum over all possible topologies (which limited the number of taxa in an analysis to about five). The methods are applied to analyze DNA sequences for nine species of primates, and the MAP tree, which is identical to a maximum-likelihood estimate of topology, has a probability of approximately 95%.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.9449, 0.8056],
# [0.9449, 1.0000, 0.7868],
# [0.8056, 0.7868, 1.0000]])
per_device_train_batch_size: 16per_device_eval_batch_size: 16num_train_epochs: 1max_steps: 20multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 1max_steps: 20lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{sinha2025bicaeffectivebiomedicaldense,
title={BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives},
author={Aarush Sinha and Pavan Kumar S and Roshan Balaji and Nirav Pravinbhai Bhatt},
year={2025},
eprint={2511.08029},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.08029},
}
Base model
thenlper/gte-base
from sentence_transformers import SentenceTransformer model = SentenceTransformer("bisectgroup/BiCA-base") sentences = [ "Molecular phylogenetic resolution of the mega-diverse clade Apoditrysia", "In a previous study of higher-level arthropod phylogeny, analyses of nucleotide sequences from 62 protein-coding nuclear genes for 80 panarthopod species yielded significantly higher bootstrap support for selected nodes than did amino acids. This study investigates the cause of that discrepancy. The hypothesis is tested that failure to distinguish the serine residues encoded by two disjunct clusters of codons (TCN, AGY) in amino acid analyses leads to this discrepancy. In one test, the two clusters of serine codons (Ser1, Ser2) are conceptually translated as separate amino acids. Analysis of the resulting 21-amino-acid data matrix shows striking increases in bootstrap support, in some cases matching that in nucleotide analyses. In a second approach, nucleotide and 20-amino-acid data sets are artificially altered through targeted deletions, modifications, and replacements, revealing the pivotal contributions of distinct Ser1 and Ser2 codons. We confirm that previous methods of coding nonsynonymous nucleotide change are robust and computationally efficient by introducing two new degeneracy coding methods. We demonstrate for degeneracy coding that neither compositional heterogeneity at the level of nucleotides nor codon usage bias between Ser1 and Ser2 clusters of codons (or their separately coded amino acids) is a major source of non-phylogenetic signal. The incongruity in support between amino-acid and nucleotide analyses of the forementioned arthropod data set is resolved by showing that \"standard\" 20-amino-acid analyses yield lower node support specifically when serine provides crucial signal. Separate coding of Ser1 and Ser2 residues yields support commensurate with that found by degenerated nucleotides, without introducing phylogenetic artifacts. While exclusion of all serine data leads to reduced support for serine-sensitive nodes, these nodes are still recovered in the ML topology, indicating that the enhanced signal from Ser1 and Ser2 is not qualitatively different from that of the other amino acids.", "Recent molecular phylogenetic studies of the insect order Lepidoptera have robustly resolved family-level divergences within most superfamilies, and most divergences among the relatively species-poor early-arising superfamilies. In sharp contrast, relationships among the superfamilies of more advanced moths and butterflies that comprise the mega-diverse clade Apoditrysia (ca. 145,000 spp.) remain mostly poorly supported. This uncertainty, in turn, limits our ability to discern the origins, ages and evolutionary consequences of traits hypothesized to promote the spectacular diversification of Apoditrysia. Low support along the apoditrysian \"backbone\" probably reflects rapid diversification. If so, it may be feasible to strengthen resolution by radically increasing the gene sample, but case studies have been few. We explored the potential of next-generation sequencing to conclusively resolve apoditrysian relationships. We used transcriptome RNA-Seq to generate 1579 putatively orthologous gene sequences across a broad sample of 40 apoditrysians plus four outgroups, to which we added two taxa from previously published data. Phylogenetic analysis of a 46-taxon, 741-gene matrix, resulting from a strict filter that eliminated ortholog groups containing any apparent paralogs, yielded dramatic overall increase in bootstrap support for deeper nodes within Apoditrysia as compared to results from previous and concurrent 19-gene analyses. High support was restricted mainly to the huge subclade Obtectomera broadly defined, in which 11 of 12 nodes subtending multiple superfamilies had bootstrap support of 100%. The strongly supported nodes showed little conflict with groupings from previous studies, and were little affected by changes in taxon sampling, suggesting that they reflect true signal rather than artifacts of massive gene sampling. In contrast, strong support was seen at only 2 of 11 deeper nodes among the \"lower\", non-obtectomeran apoditrysians. These represent a much harder phylogenetic problem, for which one path to resolution might include further increase in gene sampling, together with improved orthology assignments. ", "One of the major challenges in cell implantation therapies is to promote integration of the microcirculation between the implanted cells and the host. We used adipose-derived stromal vascular fraction (SVF) cells to vascularize a human liver cell (HepG2) implant. We hypothesized that the SVF cells would form a functional microcirculation via vascular assembly and inosculation with the host vasculature. Initially, we assessed the extent and character of neovasculatures formed by freshly isolated and cultured SVF cells and found that freshly isolated cells have a higher vascularization potential. Generation of a 3D implant containing fresh SVF and HepG2 cells formed a tissue in which HepG2 cells were entwined with a network of microvessels. Implanted HepG2 cells sequestered labeled LDL delivered by systemic intravascular injection only in SVF-vascularized implants demonstrating that SVF cell-derived vasculatures can effectively integrate with host vessels and interface with parenchymal cells to form a functional tissue mimic. " ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]