Socrates-embedding / README.md

imbue2025

Update README.md

f6b0e47 verified 6 days ago

preview code

raw

history blame contribute delete

6.54 kB

metadata

license: openrail
language:
  - en
  - ja
  - de
  - zh
model-index:
  - name: Socrates-embedding
    results:
      - task:
          type: classification
          name: Multilingual Classification
        dataset:
          name: AmazonCounterfactualClassification
          type: mteb/amazon_counterfactual
        metrics:
          - name: Accuracy (Japanese)
            type: accuracy
            value: 54.83
          - name: Accuracy (German)
            type: accuracy
            value: 52.57
          - name: Accuracy (English)
            type: accuracy
            value: 49.7
          - name: Accuracy (English-Ext)
            type: accuracy
            value: 49.15
      - task:
          type: clustering
          name: Clustering
        dataset:
          name: StackExchangeClustering
          type: mteb/stackexchange_clustering
        metrics:
          - name: V-measure x 100
            type: v_measure
            value: 8.92
co2_footprint:
  emissions: 0.17
  source: >-
    Estimated based on 1.2 hours of training on a single NVIDIA RTX 6000 (TDP
    ~300W).
  training_type: from_scratch
  geographical_location: Zaozhuang, China
  hardware_used: 1 x NVIDIA RTX 6000
  training_duration: 1.2

Model Card for Socrates-embedding

Note: This model was stopped during the training process after one epoch because we were unable to afford the cost of AutoDL. The current version shows the result after training with nil for 18,000 steps.

Model Details

Socrates-embedding is a lightweight, high-density text embedding model. Unlike contemporary models that rely on massive parameter counts to brute-force semantic understanding, Socrates-embedding leverages Low-Rank Decay (LoRD) to achieve high-quality vector representations with minimal computational overhead.

This model is part of the Chunjiang Intelligence edge-computing initiative, aiming to bring retrieval-augmented generation (RAG) and semantic search capabilities to consumer-grade hardware.

Developed by: Chunjiang Intelligence
Model Type: Dual-Encoder Transformer.
Language: English, Japanese, German, Chinese

The model was evaluated on the AmazonCounterfactualClassification dataset across multiple languages.

Language	Accuracy
Japanese (ja)	54.83
German (de)	52.57
English (en)	49.70
English-Ext (en-ext)	49.15

To put the model's efficiency into perspective, we compare its single-task score on Japanese classification against the overall MTEB average scores of much larger models. (Our budget is insufficient to cover the bill for the GPU used for the ongoing tests.)

Figure 1: Our 83M model's score on a single challenging task rivals the average performance of models up to 85x larger.

Clustering performance was evaluated using the V-measure score (multiplied by 100) on the StackExchangeClustering task.

We compared Socrates-embedding against other popular lightweight models (<110M params).

Model	Parameters	Clustering Score (V-measure x 100)
Socrates-embedding	83M	8.92 🏆
`snowflake-arctic-embed-m`	109M	7.25
`KartonBERT-USE-base-v1`	104M	6.93
`jina-embedding-s-en-v1`	35M	6.64
`all-MiniLM-L6-v2`	23M	6.62

Observation: Our model achieves the highest clustering score in its weight class, demonstrating a superior vector space structure compared to established baselines.

Figure 2: Leading clustering performance among lightweight embedding models.

Model Architecture

The model utilizes a custom Transformer Encoder architecture optimized for inference latency on Apple MPS and NVIDIA TensorRT backends.

Parameter	Value
`vocab_size`	50295
`hidden_size`	768
`embedding_dim`	512
`n_layer`	12
`n_head`	6
`n_kv_head`	2
`max_seq_len`	512
`pooling`	Mean

Model Size & Efficiency

Metric	Value
Total Parameters	83.23 M
Trainable Parameters	83.23 M
Model File Size	328.99 MB

Environmental Impact & Carbon Footprint

At Chunjiang Intelligence, we are committed to sustainable AGI development.

The training of Socrates-embedding was conducted with extreme energy discipline.

Hardware: 1 $\times$ NVIDIA RTX 6000
Training Duration: 1.2 Hours
Compute Region: Zaozhuang, China
Total Energy Consumption: ~0.36 kWh

Carbon Emissions

Estimated CO₂ Emissions: 0.17 kg (equivalent to driving a Tesla for 0.8 miles).

Carbon Offset Strategy (Net Zero Achievement)

To strictly adhere to our carbon-neutral commitment, the following offset measures were implemented during the 1.2-hour training session:

Optical Energy Conservation: All illumination devices within the laboratory (i.e., the bedroom) were deactivated. The researcher operated solely under the photon emission from the terminal display.
Biological Metabolism Suppression: The lead researcher voluntarily reduced their respiratory frequency by approximately 15% during the backpropagation phase to minimize biological CO₂ exhalation.
Thermal Regulation: No air conditioning was used; the ambient temperature was regulated solely by the waste heat generated by the GPU fan and the researcher's anxiety.

Based on these rigorous countermeasures, we certify this model as Carbon Negative by a margin of 0.02 grams.

Intended Use

Semantic Search: Efficiently indexing personal knowledge bases on local devices.
RAG Pipelines: Providing vector retrieval for Socrates-Nano generation.
Edge Deployment: Running on mobile phones, Raspberry Pis, or browser-based WASM environments.

Out-of-Scope Use

Planetary-Scale Indexing: Please do not use this model to index the entire internet; it's 86M parameters, have some mercy.
Heating: This model is too efficient to generate significant heat during inference. If you are cold, please buy a heater or run LLaMA-70B instead.