FronyAI
/

frony-embed-medium-ko-v1

@@ -9,39 +9,73 @@ tags:
 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
-# FronyAI/frony-embed-medium-ko-v1
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for Retrieval.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Maximum Sequence Length:** 512 tokens
-- **Output Dimensionality:** 1024 dimensions
 - **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
 - **Languages:** ko, en
 - **License:** apache-2.0
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
-  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
 ## Usage
@@ -58,88 +92,18 @@ Then you can load this model and run inference.
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v1")
 # Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 1024]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.10.16
-- Sentence Transformers: 4.0.2
-- Transformers: 4.47.1
-- PyTorch: 2.5.1+cu121
-- Accelerate: 1.2.1
-- Datasets: 2.21.0
-- Tokenizers: 0.21.0
-## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
+base_model:
+- klue/roberta-large
 ---
+# FronyAI Embedding (medium)
+This is a lightweight and efficient embedding model designed specifically for the Korean language.<br>
+It has been trained on a diverse set of data sources, including **AI 허브**, to ensure robust performance in a wide range of retrieval tasks.<br>
+The model demonstrates strong retrieval capabilities across:<br>
+* Korean–Korean
+* Korean–English
+* English–Korean
+To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.<br>
+All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.<br>
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base Model:** klue/roberta-large
 - **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 1024 / 512 dimensions
 - **Similarity Function:** Cosine Similarity
 - **Languages:** ko, en
 - **License:** apache-2.0
+### Datasets
+This model is trained from many sources data including **AI 허브**.<br>
+Total trained query and document pair is 100,000.<br>
+### Training Details
+The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
+Training was divided into two stages: Pre-training and Post-training.<br>
+In the pre-training stage, the model was trained using in-batch negatives.<br>
+In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.<br>
+Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
+The types of data augmentation applied are as follows:<br>
+| Augmentation* | Description |
+-----------|-----------|
+| Pair concatenation | Multi-query & Multi-passage |
+| Language transfer | Korean to English on query & passage |
+| Style transfer | Plain sentences to Markdown description |
+**Augmentation was carried out using the Gemma-3-12B*
+### Evaluation
+The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.<br>
+Three groups are subsets extracted from **AI 허브** datasets.<br>
+One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.<br>
+The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
+The following table presents the average retrieval performance across five dataset groups.<br>
+| Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
+|--------------|-----------|-----------|-----------|------------|------------|-------------|
+| frony-embed-medium | **Open** | 337M | 0.6649 | **0.8040** | 0.8458 | 0.8876 |
+| frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
+| frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
+| frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
+| frony-embed-tiny | Open | 21M* | 0.5084 | 0.6757 | 0.7278 | 0.7845 |
+| frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
+| bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
+| multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
+| snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
+| jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
+| upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
+| openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
+**Transformer blocks only*
 ## Usage
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
+model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
 # Run inference
+# '<Q>' is special token for query.
+queries = [
+    '<Q>안녕하세요',
+]
+embeddings = model.encode(queries)
+# '<P>' is special token for passage.
+passages = [
+    '<P>반갑습니다',
+]
+embeddings = model.encode(passages)
+```