davidldahl commited on
Commit
c417f72
·
verified ·
1 Parent(s): 4fd9bfd

Update README with 512 token model information

Browse files
Files changed (1) hide show
  1. README.md +15 -98
README.md CHANGED
@@ -1,109 +1,26 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - coreml
5
- - sentence-embeddings
6
- - multilingual
7
- - ios
8
- - macos
9
- - sentence-transformers
10
- language:
11
- - multilingual
12
- - en
13
- - de
14
- - fr
15
- - es
16
- - it
17
- - pt
18
- - nl
19
- - pl
20
- - ru
21
- - zh
22
- - ja
23
- - ko
24
- - ar
25
- - tr
26
- library_name: coreml
27
- pipeline_tag: sentence-similarity
28
- ---
29
 
30
- # Contex.st Multilingual Embeddings (CoreML)
31
-
32
- This repository contains CoreML-converted versions of popular multilingual sentence embedding models for use in iOS and macOS applications.
33
 
34
  ## Models
35
 
36
- ### 1. Paraphrase Multilingual MiniLM L12 v2
37
- - **Original model**: [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
38
- - **File**: `sentence_transformers_paraphrase_multilingual_MiniLM_L12_v2.mlmodel`
39
- - **Size**: 447.6 MB
40
- - **Dimensions**: 384
41
- - **Languages**: 50+ languages
42
-
43
- ### 2. DistilUSE Base Multilingual Cased
44
- - **Original model**: [sentence-transformers/distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)
45
- - **File**: `sentence_transformers_distiluse_base_multilingual_cased.mlmodel`
46
- - **Size**: 512.8 MB
47
- - **Dimensions**: 512
48
- - **Languages**: 15 languages
49
-
50
- ## Usage
51
-
52
- These models are designed for use in the [Contex.st](https://contex.st) iOS app but can be used in any iOS/macOS application that supports CoreML.
53
-
54
- ### iOS/macOS Integration
55
-
56
- ```swift
57
- import CoreML
58
-
59
- // Load the model
60
- let modelURL = // Path to downloaded .mlmodel file
61
- let model = try MLModel(contentsOf: modelURL)
62
-
63
- // Prepare input
64
- let input = // Tokenized text as MLMultiArray
65
 
66
- // Get embeddings
67
- let output = try model.prediction(from: input)
68
- ```
69
 
70
- ## Model Details
 
71
 
72
- ### Conversion Process
 
73
 
74
- These models were converted from PyTorch to CoreML format using:
75
- - Python 3.13
76
- - PyTorch 2.7.1
77
- - CoreMLTools
78
- - Sentence Transformers
79
-
80
- The conversion maintains the original model architecture while optimizing for Apple devices.
81
-
82
- ### Performance
83
-
84
- - Optimized for Apple Neural Engine (ANE)
85
- - Support for CPU fallback
86
- - Batch processing capable
87
- - Real-time inference on modern iOS devices
88
-
89
- ## License
90
-
91
- These converted models maintain the original Apache 2.0 license from the source models.
92
-
93
- ## Citation
94
-
95
- If you use these models, please cite the original sentence-transformers work:
96
 
97
- ```bibtex
98
- @inproceedings{reimers-2019-sentence-bert,
99
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
100
- author = "Reimers, Nils and Gurevych, Iryna",
101
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
102
- year = "2019",
103
- publisher = "Association for Computational Linguistics",
104
- }
105
- ```
106
 
107
- ## Contact
108
 
109
- For issues or questions about these CoreML conversions, please open an issue in this repository.
 
 
1
+ # Contex.st Multilingual Embeddings
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ CoreML models for multilingual text embeddings in iOS apps.
 
 
4
 
5
  ## Models
6
 
7
+ ### 512 Token Versions (RECOMMENDED)
8
+ These models support the full 512 token context window for high-quality embeddings:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ - `paraphrase-multilingual-MiniLM-L12-v2-512tokens.mlmodel` - 384 dimensions, ~449 MB
11
+ - `distiluse-base-multilingual-cased-512tokens.mlmodel` - 768 dimensions, ~514 MB
 
12
 
13
+ ### Legacy 32 Token Versions (NOT RECOMMENDED)
14
+ These models only support 32 tokens and will produce lower quality embeddings:
15
 
16
+ - `sentence_transformers_paraphrase_multilingual_MiniLM_L12_v2.mlmodel` - 32 tokens only
17
+ - `sentence_transformers_distiluse_base_multilingual_cased.mlmodel` - 32 tokens only
18
 
19
+ ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ Use the 512 token versions for production. The 32 token versions are kept for backward compatibility only.
 
 
 
 
 
 
 
 
22
 
23
+ ## Source Models
24
 
25
+ - [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)
26
+ - [sentence-transformers/distiluse-base-multilingual-cased](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased)