FronyAI
/

frony-embed-medium-ko-v2

@@ -1,4 +1,7 @@
 ---
 license: apache-2.0
 tags:
 - sentence-transformers
@@ -6,39 +9,78 @@ tags:
 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
-# FronyAI/frony-embed-medium-ko-v2
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Maximum Sequence Length:** 512 tokens
-- **Output Dimensionality:** 1024 dimensions
 - **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
 - **License:** apache-2.0
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
-SentenceTransformer(
-  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
-  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
-  (2): Normalize()
-)
-```
 ## Usage
@@ -55,88 +97,22 @@ Then you can load this model and run inference.
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v2")
 # Run inference
-sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
-]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 1024]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.10.16
-- Sentence Transformers: 4.0.2
-- Transformers: 4.47.1
-- PyTorch: 2.5.1+cu121
-- Accelerate: 1.2.1
-- Datasets: 2.21.0
-- Tokenizers: 0.21.0
-## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
+language:
+- ko
+- en
 license: apache-2.0
 tags:
 - sentence-transformers
 - feature-extraction
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
+base_model:
+- klue/roberta-large
 ---
+# Frony Embed V2 (medium)
+This is an efficient embedding model designed specifically for the Korean language.
+It has been trained on a diverse set of data sources, including AI 허브, to ensure robust performance in a wide range of retrieval tasks.
+The model demonstrates strong retrieval capabilities across:<br>
+* Korean–Korean
+* Korean–English
+* English–Korean
+To support resource-constrained environments, the model also provides compatibility with Matryoshka Embeddings, enabling retrieval even at reduced dimensions **(e.g., half of the original size)** without significant performance loss.
+All training and data preprocessing were performed on **a single GPU (46VRAM)**, showcasing not only the model’s effectiveness but also its efficiency.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base Model:** klue/roberta-large
 - **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 1024 / 512 dimensions
 - **Similarity Function:** Cosine Similarity
+- **Languages:** ko, en
 - **License:** apache-2.0
+### Datasets
+This model is trained from many sources data including **AI 허브**.<br>
+Total trained query and document pair is 500,000.<br>
+### Training Details
+The overall training process was conducted with reference to snowflake-arctic-2.0.<br>
+In V2, a three-stage training process was introduced as a key component of the overall learning strategy.
+The training process consisted of three stages: Adaptation-training, Pre-training, and Post-training.
+* In the adaptation-training stage, we observed through preliminary experiments that multi-vector Retrieval consistently outperformed standard dense retrieval. To reflect this, we first trained the model using a multi-vector Retrieval objective.
+* In the pre-training stage, we introduced knowledge distillation, **where the multi-vector retrieval loss was distilled into the dense retrieval loss**. This allowed the model to capture fine-grained token-level similarity signals while being trained with in-batch negatives.
+* In the post-training stage, we utilized the multilingual-e5-large model to mine hard negatives—specifically, the top 4 samples with a similarity score below a 99% threshold—and fine-tuned the model further using these examples.
+Given the increasing prevalence of LLM-generated content, we also converted existing data into Markdown-style passages to improve retrieval performance on such formats.<br>
+The types of data augmentation applied are as follows:
+| Augmentation* | Description |
+-----------|-----------|
+| Pair concatenation | Multi-query & Multi-passage |
+| Language transfer | Korean to English on query & passage |
+| Style transfer | Plain sentences to Markdown description |
+**Augmentation was carried out using the Gemma-3-12B*
+### Evaluation
+The evaluation consists of five dataset groups, and the results in the table represent the average retrieval performance across these five groups.
+Three groups are subsets extracted from AI 허브 datasets.
+One group is based on a specific sports regulation PDF, for which synthetic query and **markdown-style passage** pairs were generated using GPT-4o-mini.
+The final group is a concatenation of all four aforementioned groups, providing a comprehensive mixed set.<br>
+The following table presents the average retrieval performance across five dataset groups.
+| Models | Open/Closed | Size | Accuracy@1 | Accuracy@3 | Accuracy@5 | Accuracy@10 |
+|--------------|-----------|-----------|-----------|------------|------------|-------------|
+| frony-embed-medium | **Open** | 337M | 0.6649 | **0.8040** | 0.8458 | 0.8876 |
+| frony-embed-medium (half dim) | Open | 337M | 0.6520 | 0.7923 | 0.8361 | 0.8796 |
+| frony-embed-small | Open | 111M | 0.6152 | 0.7616 | 0.8056 | 0.8559 |
+| frony-embed-small (half dim) | Open | 111M | 0.5988 | 0.7478 | 0.7984 | 0.8461 |
+| frony-embed-tiny | Open | 21M* | 0.5084 | 0.6757 | 0.7278 | 0.7845 |
+| frony-embed-tiny (half dim) | Open | 21M* | 0.4710 | 0.6390 | 0.6933 | 0.7596 |
+| bge-m3 | **Open** | 560M | 0.5852 | **0.7763** | 0.8418 | 0.8987 |
+| multilingual-e5-large | Open | 560M | 0.5764 | 0.7630 | 0.8267 | 0.8891 |
+| snowflake-arctic-embed-l-v2.0 | Open | 568M | 0.5726 | 0.7591 | 0.8232 | 0.8917 |
+| jina-embeddings-v3 | Open | 572M | 0.5270 | 0.7246 | 0.7953 | 0.8649 |
+| upstage-large | **Closed** | - | 0.6334 | **0.8527** | 0.9065 | 0.9478 |
+| openai-text-embedding-3-large | Closed | - | 0.4907 | 0.6617 | 0.7311 | 0.8148 |
+**Transformer blocks only*
 ## Usage
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
+model = SentenceTransformer("FronyAI/frony-embed-medium-ko-v1")
 # Run inference
+# '<Q>' is special token for query.
+queries = [
+    '<Q>안녕하세요',
+]
+embeddings = model.encode(queries)
+# '<P>' is special token for passage.
+passages = [
+    '<P>반갑습니다',
+]
+embeddings = model.encode(passages)
+```
+## Contact
+Feel free to open an issue or pull request if you have any questions or suggestions about this project.
+You also can email (flash659@gmail.com).