Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,178 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- sentence-transformers/all-nli
|
| 5 |
+
- sentence-transformers/stsb
|
| 6 |
+
base_model:
|
| 7 |
+
- rootxhacker/arthemis-instruct
|
| 8 |
+
tags:
|
| 9 |
+
- bert
|
| 10 |
+
- embedding
|
| 11 |
+
---
|
| 12 |
+
# rootxhacker/arthemis-embedding
|
| 13 |
+
|
| 14 |
+
This is a text embedding model finetuned from **arthemislm-base** on the **all-nli-pair**, **all-nli-pair-class**, **all-nli-pair-score**, **all-nli-triplet**, **stsb**, **quora** and **natural-questions** datasets. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
| 15 |
+
|
| 16 |
+
The **Arthemis Embedding** model is a 155.8M parameter text embedding model that incorporates **Spiking Neural Networks (SNNs)** and **Liquid Time Constants (LTCs)** for enhanced temporal dynamics and semantic representation learning. This neuromorphic architecture provides unique advantages in classification tasks while maintaining competitive performance across various text understanding benchmarks.
|
| 17 |
+
|
| 18 |
+
This embedding model performs on par with jinaai/jina-embeddings-v2-base-en on MTEB
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
**Model Type**: Text Embedding
|
| 23 |
+
**Supported Languages**: English
|
| 24 |
+
**Number of Parameters**: 155.8M
|
| 25 |
+
**Context Length**: 1024 tokens
|
| 26 |
+
**Embedding Dimension**: 768
|
| 27 |
+
**Base Model**: arthemislm-base
|
| 28 |
+
**Training Data**: all-nli-pair, all-nli-pair-class, all-nli-pair-score, all-nli-triplet, stsb, quora, natural-questions
|
| 29 |
+
|
| 30 |
+
### Architecture Features
|
| 31 |
+
- **Spiking Neural Networks** in attention mechanisms for temporal processing
|
| 32 |
+
- **Liquid Time Constants** in feed-forward layers for adaptive dynamics
|
| 33 |
+
- **12-layer transformer backbone** with neuromorphic enhancements
|
| 34 |
+
- **RoPE positional encoding** for sequence understanding
|
| 35 |
+
- **Surrogate gradient training** for differentiable spike computation
|
| 36 |
+
|
| 37 |
+
## Usage (Python)
|
| 38 |
+
|
| 39 |
+
Using this model with the custom implementation:
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
from transformers import AutoTokenizer
|
| 43 |
+
import torch
|
| 44 |
+
import numpy as np
|
| 45 |
+
|
| 46 |
+
# Load model (using the custom MTEBLlamaSNNLTCEncoder)
|
| 47 |
+
from mteb_benchmark_snn_ltc import MTEBLlamaSNNLTCEncoder
|
| 48 |
+
|
| 49 |
+
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')
|
| 50 |
+
|
| 51 |
+
# Encode sentences
|
| 52 |
+
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 53 |
+
embeddings = model.encode(sentences, task_name="similarity")
|
| 54 |
+
|
| 55 |
+
print(f"Embeddings shape: {embeddings.shape}") # (2, 768)
|
| 56 |
+
print(f"Embedding dimension: {embeddings.shape[1]}")
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Usage (Custom Implementation)
|
| 60 |
+
|
| 61 |
+
For direct usage with the neuromorphic architecture:
|
| 62 |
+
|
| 63 |
+
```python
|
| 64 |
+
import torch
|
| 65 |
+
import torch.nn as nn
|
| 66 |
+
from transformers import AutoTokenizer
|
| 67 |
+
|
| 68 |
+
# Initialize tokenizer
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
|
| 70 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 71 |
+
|
| 72 |
+
# Load the model
|
| 73 |
+
model = MTEBLlamaSNNLTCEncoder('rootxhacker/arthemis-embedding')
|
| 74 |
+
|
| 75 |
+
# Process text
|
| 76 |
+
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 77 |
+
embeddings = model.encode(sentences, task_name="embedding_task")
|
| 78 |
+
|
| 79 |
+
# Use embeddings for similarity
|
| 80 |
+
from scipy.spatial.distance import cosine
|
| 81 |
+
similarity = 1 - cosine(embeddings[0], embeddings[1])
|
| 82 |
+
print(f"Cosine similarity: {similarity:.4f}")
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## Evaluation
|
| 86 |
+
|
| 87 |
+
The model has been evaluated on 41 tasks from the **MTEB (Massive Text Embedding Benchmark)**:
|
| 88 |
+
|
| 89 |
+
### MTEB Performance
|
| 90 |
+
|
| 91 |
+
| Task Type | Average Score | Tasks Count | Best Individual Score |
|
| 92 |
+
|-----------|---------------|-------------|----------------------|
|
| 93 |
+
| **Classification** | **42.78** | 8 | Amazon Counterfactual: 65.43 |
|
| 94 |
+
| **STS** | **39.96** | 8 | STS17: 58.48 |
|
| 95 |
+
| **Clustering** | **28.54** | 8 | ArXiv Hierarchical: 49.82 |
|
| 96 |
+
| **Retrieval** | **12.41** | 5 | Twitter URL: 53.78 |
|
| 97 |
+
| **Other** | **13.07** | 12 | Ask Ubuntu: 43.56 |
|
| 98 |
+
|
| 99 |
+
**Overall MTEB Score: 27.05** (across 41 tasks)
|
| 100 |
+
|
| 101 |
+
### Notable Individual Results
|
| 102 |
+
|
| 103 |
+
| Task | Score | Task Type |
|
| 104 |
+
|------|-------|-----------|
|
| 105 |
+
| Amazon Counterfactual Classification | 65.43 | Classification |
|
| 106 |
+
| STS17 | 58.48 | Semantic Similarity |
|
| 107 |
+
| Toxic Conversations Classification | 55.54 | Classification |
|
| 108 |
+
| IMDB Classification | 51.69 | Classification |
|
| 109 |
+
| SICK-R | 49.24 | Semantic Similarity |
|
| 110 |
+
| ArXiv Hierarchical Clustering | 49.82 | Clustering |
|
| 111 |
+
| Banking77 Classification | 29.98 | Classification |
|
| 112 |
+
| STSBenchmark | 36.82 | Semantic Similarity |
|
| 113 |
+
|
| 114 |
+
## Model Strengths
|
| 115 |
+
|
| 116 |
+
- **Classification Excellence**: Superior performance on text classification tasks with 42.78% average
|
| 117 |
+
- **Semantic Understanding**: Strong semantic textual similarity capabilities (39.96% average)
|
| 118 |
+
- **Neuromorphic Advantages**: Unique spiking neural architecture provides enhanced pattern recognition
|
| 119 |
+
- **Temporal Processing**: Liquid time constants enable adaptive sequence processing
|
| 120 |
+
- **Robust Embeddings**: 768-dimensional vectors capture rich semantic representations
|
| 121 |
+
|
| 122 |
+
## Applications
|
| 123 |
+
|
| 124 |
+
- **Text Classification**: Financial intent detection, sentiment analysis, content moderation
|
| 125 |
+
- **Semantic Search**: Document retrieval and similarity matching
|
| 126 |
+
- **Clustering**: Automatic text organization and topic discovery
|
| 127 |
+
- **Content Safety**: Toxic content detection and content moderation
|
| 128 |
+
- **Question Answering**: Similarity-based answer retrieval
|
| 129 |
+
- **Paraphrase Mining**: Finding semantically equivalent text pairs
|
| 130 |
+
- **Semantic Textual Similarity**: Measuring text similarity for various applications
|
| 131 |
+
|
| 132 |
+
## Training Details
|
| 133 |
+
|
| 134 |
+
The model was finetuned from the **arthemislm-base** foundation model using multiple high-quality datasets:
|
| 135 |
+
|
| 136 |
+
- **all-nli-pair**: Natural Language Inference pair datasets
|
| 137 |
+
- **all-nli-pair-class**: Classification variants of NLI pairs
|
| 138 |
+
- **all-nli-pair-score**: Scored NLI pairs for similarity learning
|
| 139 |
+
- **all-nli-triplet**: Triplet learning from NLI data
|
| 140 |
+
- **stsb**: Semantic Textual Similarity Benchmark
|
| 141 |
+
- **quora**: Quora Question Pairs for paraphrase detection
|
| 142 |
+
- **natural-questions**: Google's Natural Questions dataset
|
| 143 |
+
|
| 144 |
+
The neuromorphic enhancements were integrated during training to provide:
|
| 145 |
+
- Spiking neuron dynamics in attention layers
|
| 146 |
+
- Liquid time constant adaptation in feed-forward networks
|
| 147 |
+
- Surrogate gradient optimization for spike-based learning
|
| 148 |
+
- Enhanced temporal pattern recognition capabilities
|
| 149 |
+
|
| 150 |
+
## Technical Specifications
|
| 151 |
+
|
| 152 |
+
```
|
| 153 |
+
Architecture: Transformer with SNN/LTC enhancements
|
| 154 |
+
Hidden Size: 768
|
| 155 |
+
Intermediate Size: 2048
|
| 156 |
+
Attention Heads: 12
|
| 157 |
+
Layers: 12
|
| 158 |
+
Max Position Embeddings: 1024
|
| 159 |
+
Vocabulary Size: 50,257
|
| 160 |
+
Spiking Threshold: 1.0
|
| 161 |
+
LTC Hidden Size: 256
|
| 162 |
+
Training Precision: FP32
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
## Citation
|
| 166 |
+
|
| 167 |
+
```bibtex
|
| 168 |
+
@misc{arthemis-embedding-2024,
|
| 169 |
+
title={Arthemis Embedding: A Neuromorphic Text Embedding Model},
|
| 170 |
+
author={rootxhacker},
|
| 171 |
+
year={2024},
|
| 172 |
+
howpublished={\url{https://huggingface.co/rootxhacker/arthemis-embedding}}
|
| 173 |
+
}
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
## License
|
| 177 |
+
|
| 178 |
+
Please refer to the model files for licensing information.
|