FronyAI
/

frony-embed-tiny-ko-v1

@@ -1,41 +1,65 @@
----
-language:
-- ko
-- en
-license: apache-2.0
-tags:
-- sentence-transformers
-- sentence-similarity
-- feature-extraction
-pipeline_tag: sentence-similarity
-library_name: sentence-transformers
----
 # FronyAI Embedding (tiny)
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
 - **Base Model:** microsoft/Multilingual-MiniLM-L12-H384
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 384 / 192 dimensions
 - **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
 - **Languages:** ko, en
 - **License:** apache-2.0
 ### Datasets
-This model is trained from many sources data including **AI 허브**.
-Total trained query and document pair is 100,000.
 ### Evaluation
-The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
-Three groups are subsets extracted from **AI 허브** datasets.
-One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
-The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.
-The following table presents the average retrieval performance across five dataset groups.
 | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
 |--------------|-----------|-----------|-----------|------------|------------|-------------|
@@ -43,16 +67,15 @@ The following table presents the average retrieval performance across five datas
 | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
 | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
 | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
-| frony-embed-tiny | **Open** | 0.5084 | **0.6757** | 0.7278 | 0.7845 |
-| frony-embed-tiny (half dim) | Open | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
-| bge-m3 | **Open** | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
-| multilingual-e5-large | Open | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
-| snowflake-arctic-embed-l-v2.0 | Open | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
-| jina-embeddings-v3 | Open | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
-| upstage-large | **Closed** | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
-| openai-text-embedding-3-large | Closed | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
-## Training
 ## Usage
@@ -71,86 +94,16 @@ from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
 model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
 # Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 384]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.10.16
-- Sentence Transformers: 4.0.2
-- Transformers: 4.47.1
-- PyTorch: 2.5.1+cu121
-- Accelerate: 1.2.1
-- Datasets: 2.21.0
-- Tokenizers: 0.21.0
-## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

+---
+language:
+- ko
+- en
+license: apache-2.0
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+base_model:
+- microsoft/Multilingual-MiniLM-L12-H384
+---
 # FronyAI Embedding (tiny)
+This is a lightweight and efficient embedding model designed specifically for the Korean language.<br>
+It has been trained on a diverse set of data sources, including **AI 허브**, to ensure robust performance in a wide range of retrieval tasks.<br>
+The model demonstrates strong retrieval capabilities across:<br>
+* Korean–Korean
+* Korean–English
+* English–Korean
+To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.<br>
+All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.<br>
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
 - **Base Model:** microsoft/Multilingual-MiniLM-L12-H384
 - **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 384 / 192 dimensions
 - **Similarity Function:** Cosine Similarity
 - **Languages:** ko, en
 - **License:** apache-2.0
 ### Datasets
+This model is trained from many sources data including **AI 허브**.<br>
+Total trained query and document pair is 100,000.<br>
+### Training Details
+The overall training process was conducted with reference to **snowflake-arctic-2.0**.<br>
+Training was divided into two stages: Pre-training and Post-training.<br>
+In the pre-training stage, the model was trained using in-batch negatives.<br>
+In the post-training stage, we utilized the multilingual-e5-large model to identify hard negatives—specifically, the top 4 samples with a similarity score below a **99% threshold**.<br>
+Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
+The types of data augmentation applied are as follows:<br>
+| Augmentation* | Description |
+-----------|-----------|
+| Pair concatenation | Multi-query & Multi-passage |
+| Language transfer | Korean to English on query & passage |
+| Style transfer | Plain sentences to Markdown description |
+**Augmentation was carried out using the Gemma-3-12B*
 ### Evaluation
+The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.<br>
+Three groups are subsets extracted from **AI 허브** datasets.<br>
+One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.<br>
+The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
+The following table presents the average retrieval performance across five dataset groups.<br>
 | Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
 |--------------|-----------|-----------|-----------|------------|------------|-------------|
 | frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
 | frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
 | frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
+| frony-embed-tiny | **Open** | 21M* | 0.5084 | **0.6757** | 0.7278 | 0.7845 |
+| frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
+| bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
+| multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
+| snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
+| jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
+| upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
+| openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
+**Transformer blocks only*
 ## Usage
 # Download from the 🤗 Hub
 model = SentenceTransformer("FronyAI/frony-embed-tiny-ko-v1")
 # Run inference
+# '<Q>' is special token for query.
+queries = [
+    '<Q>안녕하세요',
+]
+embeddings = model.encode(queries)
+# '<P>' is special token for passage.
+passages = [
+    '<P>반갑습니다',
+]
+embeddings = model.encode(passages)
+```