surazbhandari commited on Feb 15

Commit

e190deb

verified ·

1 Parent(s): adc0ea3

Update to Hugging Face standard model format

Browse files

Files changed (23) hide show

.gitattributes +2 -0
LICENSE +0 -21
MODEL_CARD.md +0 -164
README.md +107 -158
models/mini/config.json → config.json +0 -0
data/sample_data.jsonl +0 -10
demo.py +0 -510
examples/basic_usage.py +0 -85
examples/clustering.py +0 -109
examples/semantic_search.py +0 -108
models/mini/model.pt → model.pt +0 -0
models/mini/model.safetensors → model.safetensors +0 -0
models/large/README.md +0 -5
models/medium/README.md +0 -5
models/product/README.md +0 -5
models/small/README.md +0 -5
requirements.txt +0 -14
src/__pycache__/__init__.cpython-313.pyc +0 -0
src/__pycache__/inference.cpython-313.pyc +0 -0
src/__pycache__/model.cpython-313.pyc +0 -0
src/__pycache__/tokenizer.cpython-313.pyc +0 -0
models/mini/tokenizer.json → tokenizer.json +0 -0
models/mini/training_info.json → training_info.json +0 -0

.gitattributes CHANGED Viewed

@@ -1,2 +1,4 @@
 models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
 models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text

 models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
 models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text
+model.pt filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text

LICENSE DELETED Viewed

@@ -1,21 +0,0 @@
-MIT License
-Copyright (c) 2024
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.

MODEL_CARD.md DELETED Viewed

@@ -1,164 +0,0 @@
----
-language: en
-license: mit
-tags:
-  - text-embedding
-  - sentence-similarity
-  - semantic-search
-  - product-matching
-  - transformer
-  - pytorch
-  - from-scratch
-library_name: pytorch
-pipeline_tag: sentence-similarity
-model-index:
-  - name: MiniEmbed-Mini
-    results: []
----
-# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
-**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
-**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
-| Spec | Value |
-|---|---|
-| Parameters | ~10.8M |
-| Model Size | ~42 MB |
-| Embedding Dim | 256 |
-| Vocab Size | 30,000 |
-| Max Seq Length | 128 tokens |
-| Architecture | 4-layer Transformer Encoder |
-| Pooling | Mean Pooling + L2 Normalization |
-| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
-| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
-## Quick Start
-```bash
-pip install torch numpy scikit-learn huggingface_hub
-```
-```python
-from huggingface_hub import snapshot_download
-# Download model (one-time)
-model_dir = snapshot_download("surazbhandari/miniembed")
-# Add src to path
-import sys
-sys.path.insert(0, model_dir)
-from src.inference import EmbeddingInference
-# Load -- just like sentence-transformers!
-model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
-# Similarity
-score = model.similarity("Machine learning is great", "AI is wonderful")
-print(f"Similarity: {score:.4f}")  # 0.4287
-# Semantic Search
-docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
-results = model.search("deep learning frameworks", docs, top_k=2)
-for r in results:
-    print(f"  [{r['score']:.3f}] {r['text']}")
-# [0.498] Neural networks learn patterns
-# [0.413] Python is great for AI
-# Clustering
-result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
-# Cluster 1: ['Pizza is food']
-# Cluster 2: ['ML is cool', 'AI rocks']
-```
-## Also Available via GitHub
-```bash
-git clone https://github.com/bhandarisuraz/miniembed.git
-cd miniembed
-pip install -r requirements.txt
-python -c "
-from src.inference import EmbeddingInference
-model = EmbeddingInference.from_pretrained('models/mini')
-print(model.similarity('hello world', 'hi there'))
-"
-```
-## Capabilities
-- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
-- **Re-Ranking** -- Sort candidates by true semantic relevance.
-- **Clustering** -- Group texts into logical categories automatically.
-- **Product Matching** -- Match items across platforms with messy titles.
-## Architecture
-Custom 4-layer Transformer encoder built from first principles:
-- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
-- 4x Pre-LayerNorm Transformer Encoder Layers
-- Multi-Head Self-Attention (4 heads, d_k=64)
-- Position-wise Feed-Forward (GELU activation, d_ff=1024)
-- Mean Pooling over non-padded tokens
-- L2 Normalization (unit hypersphere projection)
-## Training
-Trained on ~3.8 million text pairs from public datasets:
-| Dataset | Type |
-|---|---|
-| Natural Questions (NQ) | Q&A / General |
-| GooAQ | Knowledge Search |
-| WDC Product Matching | E-commerce |
-| ECInstruct | E-commerce Tasks |
-| MS MARCO | Web Search |
-**Training details:**
-- Training time: ~49 hours
-- Final loss: 0.0748
-- Optimizer: AdamW
-- Batch size: 256
-## Files
-```
-surazbhandari/miniembed
-|-- README.md           # This model card
-|-- config.json         # Architecture config
-|-- model.safetensors   # Pre-trained weights (Safe & Fast)
-|-- model.pt            # Pre-trained weights (Legacy PyTorch)
-|-- tokenizer.json      # 30K word-level vocabulary
-|-- training_info.json  # Training metadata
-|-- src/
-    |-- __init__.py
-    |-- model.py        # Full architecture code
-    |-- tokenizer.py    # Tokenizer implementation
-    |-- inference.py    # High-level API (supports HF auto-download)
-```
-## Limitations
-- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
-- 128 token max sequence length
-- Trained primarily on English text
-- Best suited for short-form text (queries, product titles, sentences)
-## Citation
-```bibtex
-@software{Bhandari_MiniEmbed_2026,
-  author  = {Bhandari, Suraj},
-  title   = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
-  url     = {https://github.com/bhandarisuraz/miniembed},
-  version = {1.0.0},
-  year    = {2026}
-}
-```
-## License
-MIT

README.md CHANGED Viewed

@@ -1,206 +1,159 @@
-# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
-**MiniEmbed** is a research-grade toolkit for training and deploying ultra-compact text embedding models (Bi-Encoders) built entirely from scratch in PyTorch. While the industry chases billion-parameter giants, MiniEmbed proves that a **~42 MB / 10.8M parameter** model can deliver production-grade semantic intelligence for specialized domains.
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
-[![Python 3.8+](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://python.org)
-[![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org)
-[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-orange)](https://huggingface.co/surazbhandari/miniembed)
 ---
-## What Can MiniEmbed Do?
-| Capability | Description |
-|---|---|
-| **Semantic Search** | Find meaning, not just keywords. Understands that *"kitten"* is similar to *"cat"*. |
-| **Re-Ranking** | Sort candidates by true semantic relevance. Eliminates false positives. |
-| **Clustering** | Group thousands of texts into logical categories automatically. |
-| **Product Matching** | Match identical items across stores, even with messy or inconsistent titles. |
-| **Text Encoding** | Convert any text into a dense 256-dimensional vector for downstream tasks. |
 ---
-## Project Structure
-```
-miniembed/
-|-- README.md               # You are here
-|-- LICENSE                  # MIT License
-|-- requirements.txt         # Python dependencies
-|-- demo.py                  # Interactive Streamlit demo
-|-- src/                     # Core library
-|   |-- __init__.py
-|   |-- model.py             # Transformer architecture (from scratch)
-|   |-- tokenizer.py         # Custom word-level tokenizer
-|   |-- inference.py         # High-level API for encoding & search
-|-- models/
-|   |-- mini/                # Pre-trained Mini model
-|       |-- model.safetensors # Pre-trained weights (Safe & Fast)
-|       |-- model.pt         # Pre-trained weights (Legacy)
-|       |-- config.json      # Architecture blueprint
-|       |-- tokenizer.json   # 30K vocabulary
-|       |-- training_info.json  # Training metadata
-|-- examples/                # Ready-to-run scripts
-|   |-- basic_usage.py       # Encoding & similarity
-|   |-- semantic_search.py   # Document retrieval
-|   |-- clustering.py        # Text clustering with K-Means
-|-- data/
-    |-- sample_data.jsonl    # 10-pair demo dataset
-```
-> **Note:** Pre-trained weights (`model.safetensors` / `model.pt`, ~42 MB) are included in this repository. Clone and use immediately. `.safetensors` is recommended for security and faster loading.
----
 ## Quick Start
-### 1. Install Dependencies
 ```bash
-git clone https://github.com/bhandarisuraz/miniembed.git
-cd miniembed
-pip install -r requirements.txt
 ```
-### 2. Use the Model
-The pre-trained Mini model is included in `models/mini/`. Alternatively, you can load it directly from Hugging Face:
 ```python
-from src.inference import EmbeddingInference
-# Option A: From local files
-model = EmbeddingInference.from_pretrained("models/mini")
-# Option B: Direct from Hugging Face (auto-downloads)
-model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
-```
-### 3. Try It Instantly
-```python
 from src.inference import EmbeddingInference
-model = EmbeddingInference.from_pretrained("models/mini")
-# Similarity
 score = model.similarity("Machine learning is great", "AI is wonderful")
 print(f"Similarity: {score:.4f}")  # 0.4287
-# Semantic Search
 docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
 results = model.search("deep learning frameworks", docs, top_k=2)
 for r in results:
     print(f"  [{r['score']:.3f}] {r['text']}")
-```
-For full Hugging Face integration, ensure you have `huggingface_hub` installed:
-```bash
-pip install huggingface_hub
 ```
----
-## Interactive Demo (`demo.py`)
-A full-featured Streamlit dashboard for exploring the model's capabilities without writing code:
-- **Similarity** -- Real-time cosine similarity between any two texts.
-- **Semantic Search** -- Rank a custom document set against your query.
-- **Clustering** -- Automatically categorize items using K-Means.
-- **Text Encoding** -- Inspect raw 256-D vectors and their statistics.
-- **CSV Matcher** -- Match records between two CSV files for deduplication or cross-platform product mapping.
 ```bash
-streamlit run demo.py
-```
----
-## Architecture
-MiniEmbed uses a **custom 4-layer Transformer encoder** built from scratch -- no HuggingFace, no pre-trained weights:
-| Component | Specification |
-|---|---|
-| Embedding Dimension | 256 |
-| Attention Heads | 4 |
-| Transformer Layers | 4 |
-| Feed-Forward Dimension | 1,024 |
-| Vocabulary Size | 30,000 |
-| Max Sequence Length | 128 tokens |
-| Total Parameters | ~10.8M |
-| Model Size on Disk | ~42 MB |
-| Pooling Strategy | Mean Pooling + L2 Normalization |
-### Training Objective
-Training uses **Multiple Negatives Ranking Loss (MNRL)**, the industry-standard contrastive objective for Bi-Encoders:
-$$\mathcal{L} = -\sum_{i=1}^{n} \log \frac{e^{sim(q_i, p_i) / \tau}}{\sum_{j=1}^{n} e^{sim(q_i, p_j) / \tau}}$$
-All embeddings are **L2-normalized**, projecting text onto a unit hypersphere where cosine similarity equals dot product -- enabling ultra-fast nearest-neighbor search.
----
-## Training Data Sources
-The pre-trained model was trained on ~3.8 million text pairs from the following open-source datasets:
-| Dataset | Type | Source |
-|---|---|---|
-| **Natural Questions (NQ)** | Q&A / General | [HuggingFace](https://huggingface.co/datasets/google-research-datasets/natural_questions) |
-| **GooAQ** | Knowledge Search | [HuggingFace](https://huggingface.co/datasets/sentence-transformers/gooaq) |
-| **WDC Product Matching** | E-commerce | [HuggingFace](https://huggingface.co/datasets/wdc/products-2017) |
-| **ECInstruct** | E-commerce Tasks | [HuggingFace](https://huggingface.co/datasets/NingLab/ECInstruct) |
-| **MS MARCO** | Web Search | [HuggingFace](https://huggingface.co/datasets/microsoft/ms_marco) |
-> **Legal Disclaimer**: These public datasets belong to their respective stakeholders and creators. Any copyright, licensing, or legal usage constraints must be consulted with the original authors individually.
----
-## Performance
-Results from the pre-trained Mini model:
-| Metric | Value |
 |---|---|
-| **Training Loss** | 0.0748 (final) |
-| **Training Samples** | 3,817,707 pairs |
-| **Throughput** | ~1,000 samples/sec |
-| **Encoding Latency** | ~3-5 ms per text |
-| **Training Epochs** | 10 |
----
-## Examples
-Ready-to-run scripts in the `examples/` folder:
-```bash
-cd examples
-# Basic encoding and similarity
-python basic_usage.py
-# Document retrieval
-python semantic_search.py
-# Text clustering with K-Means
-python clustering.py
 ```
----
-## Roadmap
-- **mini-product** -- A further fine-tuned version of the Mini model, specialized for high-accuracy **product matching** is Coming soon...
----
 ## Citation
-If you use MiniEmbed in your research, please cite:
 ```bibtex
 @software{Bhandari_MiniEmbed_2026,
   author  = {Bhandari, Suraj},
@@ -211,10 +164,6 @@ If you use MiniEmbed in your research, please cite:
 }
 ```
----
 ## License
-MIT License. See [LICENSE](LICENSE) for details.
-Explore, learn, and build smaller, smarter AI.

 ---
+language: en
+license: mit
+tags:
+  - text-embedding
+  - sentence-similarity
+  - semantic-search
+  - product-matching
+  - transformer
+  - pytorch
+  - from-scratch
+library_name: pytorch
+pipeline_tag: sentence-similarity
+model-index:
+  - name: MiniEmbed-Mini
+    results: []
 ---
+# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
+**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
+**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
+| Spec | Value |
+|---|---|
+| Parameters | ~10.8M |
+| Model Size | ~42 MB |
+| Embedding Dim | 256 |
+| Vocab Size | 30,000 |
+| Max Seq Length | 128 tokens |
+| Architecture | 4-layer Transformer Encoder |
+| Pooling | Mean Pooling + L2 Normalization |
+| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
+| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
 ## Quick Start
 ```bash
+pip install torch numpy scikit-learn huggingface_hub
 ```
 ```python
+from huggingface_hub import snapshot_download
+# Download model (one-time)
+model_dir = snapshot_download("surazbhandari/miniembed")
+# Add src to path
+import sys
+sys.path.insert(0, model_dir)
 from src.inference import EmbeddingInference
+# Load -- just like sentence-transformers!
+model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
+# 1. Similarity
 score = model.similarity("Machine learning is great", "AI is wonderful")
 print(f"Similarity: {score:.4f}")  # 0.4287
+# 2. Normal Embeddings
+embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
+import numpy as np
+manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
+# 3. Semantic Search
 docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
 results = model.search("deep learning frameworks", docs, top_k=2)
 for r in results:
     print(f"  [{r['score']:.3f}] {r['text']}")
+# [0.498] Neural networks learn patterns
+# [0.413] Python is great for AI
+# 4. Clustering
+result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
+# Cluster 1: ['Pizza is food']
+# Cluster 2: ['ML is cool', 'AI rocks']
 ```
+## Also Available via GitHub
 ```bash
+git clone https://github.com/bhandarisuraz/miniembed.git
+cd miniembed
+pip install -r requirements.txt
+python -c "
+from src.inference import EmbeddingInference
+model = EmbeddingInference.from_pretrained('models/mini')
+print(model.similarity('hello world', 'hi there'))
+"
+```
+## Capabilities
+- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
+- **Re-Ranking** -- Sort candidates by true semantic relevance.
+- **Clustering** -- Group texts into logical categories automatically.
+- **Product Matching** -- Match items across platforms with messy titles.
+## Architecture
+Custom 4-layer Transformer encoder built from first principles:
+- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
+- 4x Pre-LayerNorm Transformer Encoder Layers
+- Multi-Head Self-Attention (4 heads, d_k=64)
+- Position-wise Feed-Forward (GELU activation, d_ff=1024)
+- Mean Pooling over non-padded tokens
+- L2 Normalization (unit hypersphere projection)
+## Training
+Trained on ~3.8 million text pairs from public datasets:
+| Dataset | Type |
 |---|---|
+| Natural Questions (NQ) | Q&A / General |
+| GooAQ | Knowledge Search |
+| WDC Product Matching | E-commerce |
+| ECInstruct | E-commerce Tasks |
+| MS MARCO | Web Search |
+**Training details:**
+- Training time: ~49 hours
+- Final loss: 0.0748
+- Optimizer: AdamW
+- Batch size: 256
+## Files
+```
+surazbhandari/miniembed
+|-- README.md           # This model card
+|-- config.json         # Architecture config
+|-- model.safetensors   # Pre-trained weights (Safe & Fast)
+|-- model.pt            # Pre-trained weights (Legacy PyTorch)
+|-- tokenizer.json      # 30K word-level vocabulary
+|-- training_info.json  # Training metadata
+|-- src/
+    |-- __init__.py
+    |-- model.py        # Full architecture code
+    |-- tokenizer.py    # Tokenizer implementation
+    |-- inference.py    # High-level API (supports HF auto-download)
 ```
+## Limitations
+- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
+- 128 token max sequence length
+- Trained primarily on English text
+- Best suited for short-form text (queries, product titles, sentences)
 ## Citation
 ```bibtex
 @software{Bhandari_MiniEmbed_2026,
   author  = {Bhandari, Suraj},
 }
 ```
 ## License
+MIT

models/mini/config.json → config.json RENAMED Viewed

File without changes

data/sample_data.jsonl DELETED Viewed

@@ -1,10 +0,0 @@
-{"query": "how to train an embedding model", "passage": "Training an embedding model involves using contrastive learning on query-passage pairs.", "source": "sample"}
-{"query": "what is a transformer", "passage": "The Transformer is a deep learning model that uses self-attention mechanisms to process sequence data.", "source": "sample"}
-{"query": "nike air max 90", "passage": "Men's Nike Air Max 90 Casual Shoes in Black and White.", "source": "sample"}
-{"query": "samsung galaxy s21", "passage": "Samsung Galaxy S21 5G 128GB Unlocked Smartphone - Phantom Gray.", "source": "sample"}
-{"query": "best winter coats", "passage": "The North Face Gotham Jacket III is one of the warmest winter parkas for heavy snow.", "source": "sample"}
-{"query": "python programming for beginners", "passage": "Learn Python with this comprehensive guide covering variables, loops, and functions.", "source": "sample"}
-{"query": "benefits of meditation", "passage": "Meditation can reduce stress, improve concentration, and increase happiness.", "source": "sample"}
-{"query": "how to bake chocolate cake", "passage": "Whisk eggs and sugar, then fold in flour and melted chocolate for a perfect moist cake.", "source": "sample"}
-{"query": "what is machine learning", "passage": "Machine learning is a field of AI that allows systems to learn patterns from data without explicit programming.", "source": "sample"}
-{"query": "running shoes for flat feet", "passage": "Brooks Adrenaline GTS 22 provides excellent stability and support for runners with low arches.", "source": "sample"}

demo.py DELETED Viewed

@@ -1,510 +0,0 @@
-"""
-MiniEmbed - Interactive Demo
-================================
-Explore the embedding model's capabilities through a Streamlit dashboard.
-Features:
-  - Pairwise text similarity (cosine distance)
-  - Semantic document search with ranked results
-  - Unsupervised text clustering via K-Means
-  - Raw embedding vector inspection and visualization
-  - Bulk CSV-to-CSV record matching
-Run: streamlit run demo.py
-"""
-import streamlit as st
-import numpy as np
-import pandas as pd
-import os
-import sys
-import io
-# Add src to path
-sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
-from src.inference import EmbeddingInference, EmbeddingModelManager
-# ============================================================================
-# PAGE CONFIG
-# ============================================================================
-st.set_page_config(
-    page_title="MiniEmbed Demo",
-    page_icon="M",
-    layout="wide"
-)
-# Custom CSS
-st.markdown("""
-<style>
-    .main-header {
-        font-size: 2.5rem;
-        font-weight: 700;
-        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
-        -webkit-background-clip: text;
-        -webkit-text-fill-color: transparent;
-        text-align: center;
-        margin-bottom: 1rem;
-    }
-    .sub-header {
-        text-align: center;
-        color: #888;
-        margin-bottom: 2rem;
-    }
-    .result-box {
-        background: rgba(100, 100, 100, 0.1);
-        border-radius: 10px;
-        padding: 1rem;
-        margin: 0.5rem 0;
-        color: inherit;
-    }
-    .high-score { border-left: 4px solid #28a745; background: rgba(40, 167, 69, 0.1); }
-    .medium-score { border-left: 4px solid #ffc107; background: rgba(255, 193, 7, 0.1); }
-    .low-score { border-left: 4px solid #dc3545; background: rgba(220, 53, 69, 0.1); }
-    .score-text { font-weight: bold; }
-</style>
-""", unsafe_allow_html=True)
-# ============================================================================
-# LOAD MODEL
-# ============================================================================
-@st.cache_resource
-def load_model(model_name):
-    """Load the embedding model from disk."""
-    model_dir = f"models/{model_name}"
-    if model_name == "Legacy (model/)":
-        model_dir = "model"
-    return EmbeddingInference.from_pretrained(model_dir)
-# Header
-st.markdown('<h1 class="main-header">MiniEmbed Demo</h1>', unsafe_allow_html=True)
-st.markdown('<p class="sub-header">Explore semantic similarity, search, clustering, and bulk matching</p>', unsafe_allow_html=True)
-# -----------------------------------------------------------------------------
-# Model Selection
-# -----------------------------------------------------------------------------
-available_models = EmbeddingModelManager.list_models()
-if os.path.exists("model/model.pt"):
-    available_models.append("Legacy (model/)")
-if not available_models:
-    st.error("No models found. Train a model first or place weights in models/mini/model.pt.")
-    st.info("Models should be located in the `models/` directory (e.g., `models/mini/`).")
-    st.stop()
-selected_model_name = st.sidebar.selectbox(
-    "Select Model",
-    available_models,
-    index=0,
-    help="Select which trained model to load for inference."
-)
-model = load_model(selected_model_name)
-if model is None:
-    st.error("Model not found. Please train the model first.")
-    st.stop()
-# Model info
-with st.expander("Model Info", expanded=False):
-    st.markdown("""
-    This panel shows the architecture of the currently loaded model.
-    - **Embedding Dim**: The size of each output vector (higher = more expressive).
-    - **Layers**: Number of Transformer encoder layers stacked in the model.
-    - **Vocab Size**: Total number of unique tokens the model can recognize.
-    """)
-    col1, col2, col3 = st.columns(3)
-    with col1:
-        st.metric("Embedding Dim", model.model.d_model)
-    with col2:
-        st.metric("Layers", len(model.model.layers))
-    with col3:
-        st.metric("Vocab Size", len(model.tokenizer.word_to_id))
-# ============================================================================
-# TABS
-# ============================================================================
-tab1, tab2, tab3, tab4, tab5 = st.tabs([
-    "Similarity",
-    "Semantic Search",
-    "Clustering",
-    "Encode Text",
-    "CSV Matcher"
-])
-# ============================================================================
-# TAB 1: SIMILARITY
-# ============================================================================
-with tab1:
-    st.markdown("### Pairwise Text Similarity")
-    st.markdown("""
-    Enter two texts to compute their **cosine similarity** (range: 0 to 1).
-    The model encodes each text into a 256-dimensional vector and measures
-    the angular distance between them. A score close to 1.0 means the texts
-    are semantically equivalent; a score near 0.0 means they are unrelated.
-    """)
-    col1, col2 = st.columns(2)
-    with col1:
-        text1 = st.text_area(
-            "Text 1",
-            "Machine learning is a branch of artificial intelligence",
-            height=100,
-            key="sim_text1"
-        )
-    with col2:
-        text2 = st.text_area(
-            "Text 2",
-            "AI systems can learn patterns from data",
-            height=100,
-            key="sim_text2"
-        )
-    if st.button("Compute Similarity", type="primary", key="sim_btn"):
-        if text1 and text2:
-            with st.spinner("Computing..."):
-                similarity = model.similarity(text1, text2)
-            if similarity > 0.7:
-                color = "#28a745"
-                label = "Very Similar"
-            elif similarity > 0.4:
-                color = "#ffc107"
-                label = "Somewhat Similar"
-            else:
-                color = "#dc3545"
-                label = "Not Similar"
-            st.markdown(f"""
-            <div style="text-align: center; padding: 2rem;">
-                <div style="font-size: 4rem; font-weight: bold; color: {color};">
-                    {similarity:.3f}
-                </div>
-                <div style="font-size: 1.2rem; color: {color};">
-                    {label}
-                </div>
-            </div>
-            """, unsafe_allow_html=True)
-    # Example pairs
-    st.markdown("---")
-    st.markdown("#### Example Pairs")
-    st.markdown("These pairs demonstrate how the model distinguishes related from unrelated content:")
-    examples = [
-        ("Python is a programming language", "Java is used for software development"),
-        ("The cat sat on the mat", "A feline rested on the rug"),
-        ("Machine learning is fascinating", "I love eating pizza"),
-    ]
-    for t1, t2 in examples:
-        similarity = model.similarity(t1, t2)
-        if similarity > 0.5:
-            css_class = "high-score"
-        elif similarity > 0.3:
-            css_class = "medium-score"
-        else:
-            css_class = "low-score"
-        st.markdown(f"""
-        <div class="result-box {css_class}">
-            <strong>{similarity:.3f}</strong> | "{t1}" vs "{t2}"
-        </div>
-        """, unsafe_allow_html=True)
-# ============================================================================
-# TAB 2: SEMANTIC SEARCH
-# ============================================================================
-with tab2:
-    st.markdown("### Semantic Document Search")
-    st.markdown("""
-    Enter a natural-language query. The model encodes your query and all
-    documents into the same vector space, then ranks documents by cosine
-    similarity. This finds **meaning-based** matches, not just keyword overlap.
-    """)
-    default_docs = """Python is a high-level programming language
-Machine learning algorithms learn patterns from data
-The weather today is sunny and warm
-Neural networks are inspired by the human brain
-JavaScript is used for web development
-Deep learning has transformed computer vision
-Cats are popular pets around the world
-TensorFlow and PyTorch are ML frameworks
-The stock market had a volatile day
-Natural language processing understands text"""
-    query = st.text_input(
-        "Search Query",
-        "How do AI systems learn from examples?",
-        key="search_query"
-    )
-    documents_text = st.text_area(
-        "Documents (one per line)",
-        default_docs,
-        height=200,
-        key="search_docs"
-    )
-    top_k = st.slider("Number of results", 1, 10, 5, key="search_topk")
-    if st.button("Search", type="primary", key="search_btn"):
-        documents = [d.strip() for d in documents_text.split('\n') if d.strip()]
-        if query and documents:
-            with st.spinner("Searching..."):
-                results = model.search(query, documents, top_k=top_k)
-            st.markdown("### Results")
-            st.markdown("Documents ranked by semantic relevance to your query:")
-            for r in results:
-                score = r['score']
-                if score > 0.6:
-                    indicator = "[HIGH]"
-                    css_class = "high-score"
-                elif score > 0.4:
-                    indicator = "[MED]"
-                    css_class = "medium-score"
-                else:
-                    indicator = "[LOW]"
-                    css_class = "low-score"
-                st.markdown(f"""
-                <div class="result-box {css_class}">
-                    <strong>{indicator} #{r['rank']}</strong> (score: {score:.4f})<br>
-                    {r['text']}
-                </div>
-                """, unsafe_allow_html=True)
-# ============================================================================
-# TAB 3: CLUSTERING
-# ============================================================================
-with tab3:
-    st.markdown("### Unsupervised Text Clustering")
-    st.markdown("""
-    The model encodes each text into a dense vector. K-Means clustering
-    then groups these vectors by proximity in the embedding space.
-    Texts that are semantically similar end up in the same cluster,
-    even if they share no common words.
-    """)
-    default_cluster_texts = """Python programming language
-Machine learning algorithms
-Deep learning neural networks
-JavaScript web development
-Cats and dogs as pets
-Pizza and pasta Italian food
-Sunny weather today
-Rainy day forecast
-Stock market trends
-Financial news update"""
-    cluster_texts = st.text_area(
-        "Texts to cluster (one per line)",
-        default_cluster_texts,
-        height=200,
-        key="cluster_texts"
-    )
-    n_clusters = st.slider("Number of clusters", 2, 10, 3, key="n_clusters")
-    if st.button("Run Clustering", type="primary", key="cluster_btn"):
-        texts = [t.strip() for t in cluster_texts.split('\n') if t.strip()]
-        if len(texts) >= n_clusters:
-            with st.spinner("Clustering..."):
-                result = model.cluster_texts(texts, n_clusters=n_clusters)
-            st.markdown("### Cluster Assignments")
-            st.markdown("Each group contains texts that the model considers semantically related:")
-            colors = ["#667eea", "#28a745", "#ffc107", "#dc3545", "#17a2b8",
-                     "#6f42c1", "#fd7e14", "#20c997", "#e83e8c", "#6c757d"]
-            for cluster_id in sorted(result['texts_by_cluster'].keys()):
-                cluster_texts_list = result['texts_by_cluster'][cluster_id]
-                color = colors[cluster_id % len(colors)]
-                st.markdown(f"""
-                <div style="background: {color}15; border-left: 4px solid {color};
-                            padding: 1rem; border-radius: 5px; margin: 0.5rem 0;">
-                    <strong style="color: {color};">Cluster {cluster_id + 1}</strong>
-                    ({len(cluster_texts_list)} texts)
-                </div>
-                """, unsafe_allow_html=True)
-                for text in cluster_texts_list:
-                    st.markdown(f"  - {text}")
-        else:
-            st.warning(f"Need at least {n_clusters} texts to create {n_clusters} clusters.")
-# ============================================================================
-# TAB 4: ENCODE TEXT
-# ============================================================================
-with tab4:
-    st.markdown("### Raw Embedding Inspector")
-    st.markdown("""
-    Convert any text into its dense vector representation. The output is a
-    256-dimensional float vector that is **L2-normalized** (unit length = 1.0).
-    This is the same representation used internally for similarity and search.
-    """)
-    encode_text = st.text_area(
-        "Text to encode",
-        "Machine learning is a fascinating field of study.",
-        height=100,
-        key="encode_text"
-    )
-    if st.button("Encode", type="primary", key="encode_btn"):
-        if encode_text:
-            with st.spinner("Encoding..."):
-                embedding = model.encode(encode_text)
-            st.markdown("### Embedding Vector")
-            col1, col2, col3 = st.columns(3)
-            with col1:
-                st.metric("Dimensions", embedding.shape[1])
-            with col2:
-                st.metric("L2 Norm", f"{np.linalg.norm(embedding[0]):.4f}")
-            with col3:
-                st.metric("Mean Value", f"{embedding[0].mean():.4f}")
-            st.markdown("#### First 20 values:")
-            st.code(str(embedding[0][:20].round(4).tolist()))
-            st.markdown("#### Value Distribution")
-            st.markdown("A well-trained model produces a roughly Gaussian distribution centered near zero:")
-            import plotly.express as px
-            fig = px.histogram(
-                x=embedding[0],
-                nbins=50,
-                title="Embedding Value Distribution",
-                labels={'x': 'Value', 'y': 'Count'}
-            )
-            fig.update_layout(showlegend=False)
-            st.plotly_chart(fig, width="stretch")
-# ============================================================================
-# TAB 5: CSV MATCHER
-# ============================================================================
-with tab5:
-    st.markdown("### Bulk CSV Record Matcher")
-    st.markdown("""
-    Upload two CSV files and match rows across them using semantic similarity.
-    This is useful for:
-    - **Product deduplication** across e-commerce platforms
-    - **Record linkage** between databases with inconsistent naming
-    - **Cross-platform mapping** (e.g., matching supplier catalogs to your inventory)
-    The model encodes the selected text column from each CSV, then ranks
-    every row in CSV 2 against each row in CSV 1 by cosine similarity.
-    """)
-    col1, col2 = st.columns(2)
-    with col1:
-        st.markdown("#### Upload CSV 1 (Queries)")
-        file1 = st.file_uploader("Upload primary CSV", type=['csv'], key="csv_file_1")
-    with col2:
-        st.markdown("#### Upload CSV 2 (Knowledge Base)")
-        file2 = st.file_uploader("Upload secondary CSV", type=['csv'], key="csv_file_2")
-    if file1 and file2:
-        df1 = pd.read_csv(file1)
-        df2 = pd.read_csv(file2)
-        st.markdown("---")
-        col_m1, col_m2 = st.columns(2)
-        with col_m1:
-            col1_name = st.selectbox("Select column to match from CSV 1", df1.columns, key="col1_sel")
-        with col_m2:
-            col2_name = st.selectbox("Select column to search in CSV 2", df2.columns, key="col2_sel")
-        col_p1, col_p2 = st.columns(2)
-        with col_p1:
-            top_n_candidates = st.slider("Step 1: Top candidates to fetch", 1, 50, 10, help="Initial semantic search depth")
-        with col_p2:
-            top_m_final = st.slider("Step 2: Top matches to keep", 1, 10, 3, help="Final number of matches per row")
-        if st.button("Start Bulk Matching", type="primary"):
-            progress_bar = st.progress(0)
-            status_text = st.empty()
-            queries = df1[col1_name].fillna("").astype(str).tolist()
-            corpus = df2[col2_name].fillna("").astype(str).tolist()
-            status_text.text("Encoding search corpus (CSV 2)...")
-            corpus_embs = model.encode(corpus, batch_size=128)
-            progress_bar.progress(20)
-            status_text.text("Encoding queries (CSV 1)...")
-            query_embs = model.encode(queries, batch_size=128)
-            progress_bar.progress(50)
-            status_text.text("Computing similarities and mapping...")
-            similarities = np.dot(query_embs, corpus_embs.T)
-            progress_bar.progress(80)
-            all_results = []
-            for i in range(len(queries)):
-                row_scores = similarities[i]
-                top_indices = np.argsort(row_scores)[::-1][:top_m_final]
-                res_row = df1.iloc[i].to_dict()
-                for rank, idx in enumerate(top_indices, 1):
-                    res_row[f'Match_{rank}_{col2_name}'] = corpus[idx]
-                    res_row[f'Match_{rank}_Score'] = round(float(row_scores[idx]), 4)
-                all_results.append(res_row)
-            res_df = pd.DataFrame(all_results)
-            progress_bar.progress(100)
-            status_text.text("Matching complete.")
-            st.markdown("### Results Preview")
-            st.dataframe(res_df.head(50), width="stretch")
-            output = io.StringIO()
-            res_df.to_csv(output, index=False)
-            csv_string = output.getvalue()
-            st.download_button(
-                label="Download Full Results CSV",
-                data=csv_string,
-                file_name="semantic_matching_results.csv",
-                mime="text/csv",
-            )
-    else:
-        st.info("Upload both CSV files to begin matching.")
-# ============================================================================
-# FOOTER
-# ============================================================================
-st.markdown("---")
-st.markdown("""
-<div style="text-align: center; color: #666; padding: 1rem;">
-    <strong>MiniEmbed</strong> | Lightweight Text Embeddings |
-    <a href="https://github.com/bhandarisuraz/miniembed">GitHub</a>
-</div>
-""", unsafe_allow_html=True)

examples/basic_usage.py DELETED Viewed

@@ -1,85 +0,0 @@
-"""
-Basic Usage Example
-===================
-Demonstrates encoding texts and computing similarity using MiniEmbed.
-This script shows the three core operations:
-  1. Encoding raw text into dense vectors
-  2. Computing pairwise similarity between two texts
-  3. Building a full similarity matrix across sets of texts
-"""
-import sys
-sys.path.insert(0, '..')
-from src.inference import EmbeddingInference
-def main():
-    print("=" * 60)
-    print("MiniEmbed - Basic Usage Example")
-    print("=" * 60)
-    # Load the model
-    print("\nLoading model...")
-    model = EmbeddingInference.from_pretrained("../models/mini")
-    print("Model loaded.\n")
-    # -------------------------------------------------------------------------
-    # Example 1: Encode texts
-    # -------------------------------------------------------------------------
-    print("-" * 40)
-    print("Example 1: Encoding Texts")
-    print("-" * 40)
-    texts = [
-        "Machine learning is a branch of artificial intelligence",
-        "Deep learning uses neural networks with many layers",
-        "I love eating pizza on weekends"
-    ]
-    embeddings = model.encode(texts)
-    print(f"Input: {len(texts)} texts")
-    print(f"Output: {embeddings.shape}")  # (3, 256)
-    # -------------------------------------------------------------------------
-    # Example 2: Compute similarity
-    # -------------------------------------------------------------------------
-    print("\n" + "-" * 40)
-    print("Example 2: Computing Similarity")
-    print("-" * 40)
-    pairs = [
-        ("Machine learning is great", "AI is wonderful"),
-        ("Machine learning is great", "I love pizza"),
-        ("The cat sat on the mat", "A feline rested on the rug"),
-    ]
-    for text1, text2 in pairs:
-        similarity = model.similarity(text1, text2)
-        tag = "MATCH" if similarity > 0.5 else "  LOW"
-        print(f"  [{tag}] {similarity:.4f} | '{text1}' vs '{text2}'")
-    # -------------------------------------------------------------------------
-    # Example 3: Pairwise similarity matrix
-    # -------------------------------------------------------------------------
-    print("\n" + "-" * 40)
-    print("Example 3: Pairwise Similarity Matrix")
-    print("-" * 40)
-    texts_a = ["Machine learning", "Deep learning", "Natural language"]
-    texts_b = ["AI models", "Neural networks", "Text processing"]
-    similarity_matrix = model.pairwise_similarity(texts_a, texts_b)
-    print("\nSimilarity Matrix:")
-    print("              ", "  ".join(f"{t[:10]:>10}" for t in texts_b))
-    for i, text in enumerate(texts_a):
-        row = "  ".join(f"{similarity_matrix[i, j]:>10.4f}" for j in range(len(texts_b)))
-        print(f"{text[:12]:>12}: {row}")
-    print("\nDone.")
-if __name__ == "__main__":
-    main()

examples/clustering.py DELETED Viewed

@@ -1,109 +0,0 @@
-"""
-Text Clustering Example
-=======================
-Demonstrates how to cluster texts by semantic similarity using MiniEmbed.
-The model encodes each text into a dense vector. K-Means clustering then
-groups these vectors by proximity in the embedding space, even if the texts
-share no common words.
-"""
-import sys
-sys.path.insert(0, '..')
-from src.inference import EmbeddingInference
-def main():
-    print("=" * 60)
-    print("MiniEmbed - Text Clustering Example")
-    print("=" * 60)
-    # Load the model
-    print("\nLoading model...")
-    model = EmbeddingInference.from_pretrained("../models/mini")
-    print("Model loaded.\n")
-    # -------------------------------------------------------------------------
-    # Text collection (mixed topics)
-    # -------------------------------------------------------------------------
-    texts = [
-        # Technology
-        "Python is a versatile programming language",
-        "Machine learning models learn from data",
-        "JavaScript is used for web development",
-        "Neural networks process information like the brain",
-        "Software engineering involves designing systems",
-        # Food
-        "Pizza is my favorite Italian dish",
-        "Sushi is a traditional Japanese cuisine",
-        "Tacos are delicious Mexican street food",
-        "Pasta with marinara sauce is comforting",
-        "Ramen noodles are popular in Japan",
-        # Sports
-        "Football is the most popular sport worldwide",
-        "Basketball requires teamwork and skill",
-        "Tennis is an exciting individual sport",
-        "Swimming is great for cardiovascular health",
-        "Soccer World Cup attracts billions of viewers",
-        # Nature
-        "Mountains offer breathtaking scenic views",
-        "Oceans cover most of the Earth's surface",
-        "Forests are home to diverse wildlife",
-        "Rivers provide fresh water to ecosystems",
-        "Deserts have extreme temperature variations",
-    ]
-    print(f"Text Collection: {len(texts)} texts (4 topics)")
-    # -------------------------------------------------------------------------
-    # Cluster texts
-    # -------------------------------------------------------------------------
-    print("\nClustering texts into 4 groups...")
-    result = model.cluster_texts(texts, n_clusters=4)
-    # -------------------------------------------------------------------------
-    # Display results
-    # -------------------------------------------------------------------------
-    print("\n" + "=" * 60)
-    print("Clustering Results")
-    print("=" * 60)
-    for cluster_id in sorted(result['texts_by_cluster'].keys()):
-        cluster_texts = result['texts_by_cluster'][cluster_id]
-        print(f"\n  Cluster {cluster_id + 1} ({len(cluster_texts)} texts)")
-        print("-" * 40)
-        for text in cluster_texts:
-            print(f"   - {text}")
-    # -------------------------------------------------------------------------
-    # Evaluate clustering (simple check)
-    # -------------------------------------------------------------------------
-    print("\n" + "=" * 60)
-    print("Clustering Analysis")
-    print("=" * 60)
-    # Expected groupings (approximate)
-    expected = {
-        "Technology": texts[0:5],
-        "Food": texts[5:10],
-        "Sports": texts[10:15],
-        "Nature": texts[15:20],
-    }
-    print("\nLabels assigned to each text:")
-    for i, (text, label) in enumerate(zip(texts, result['labels'])):
-        topic = list(expected.keys())[i // 5]
-        print(f"   [{label}] ({topic}) {text[:50]}...")
-    print("\nDone.")
-if __name__ == "__main__":
-    main()

examples/semantic_search.py DELETED Viewed

@@ -1,108 +0,0 @@
-"""
-Semantic Search Example
-=======================
-Demonstrates how to use MiniEmbed for document retrieval.
-The model encodes a query and a corpus of documents into the same vector space,
-then ranks documents by cosine similarity to the query. This finds results based
-on meaning, not keyword overlap.
-"""
-import sys
-sys.path.insert(0, '..')
-from src.inference import EmbeddingInference
-def main():
-    print("=" * 60)
-    print("MiniEmbed - Semantic Search Example")
-    print("=" * 60)
-    # Load the model
-    print("\nLoading model...")
-    model = EmbeddingInference.from_pretrained("../models/mini")
-    print("Model loaded.\n")
-    # -------------------------------------------------------------------------
-    # Document collection
-    # -------------------------------------------------------------------------
-    documents = [
-        "Python is a high-level programming language known for its simplicity",
-        "Machine learning algorithms can learn patterns from data",
-        "The weather today is sunny with a high of 75 degrees",
-        "Neural networks are computational models inspired by the brain",
-        "JavaScript is widely used for web development",
-        "Deep learning has revolutionized computer vision and NLP",
-        "Cats are popular pets known for their independence",
-        "TensorFlow and PyTorch are popular deep learning frameworks",
-        "The stock market showed strong gains today",
-        "Natural language processing helps computers understand text"
-    ]
-    print(f"Document Collection: {len(documents)} documents")
-    for i, doc in enumerate(documents, 1):
-        print(f"   {i}. {doc[:60]}...")
-    # -------------------------------------------------------------------------
-    # Search queries
-    # -------------------------------------------------------------------------
-    queries = [
-        "How do AI systems learn from examples?",
-        "What programming language is good for beginners?",
-        "Tell me about artificial neural networks",
-    ]
-    print("\n" + "=" * 60)
-    print("Search Results")
-    print("=" * 60)
-    for query in queries:
-        print(f"\n  Query: \"{query}\"")
-        print("-" * 50)
-        results = model.search(query, documents, top_k=3)
-        for r in results:
-            score = r['score']
-            if score > 0.6:
-                tag = "[HIGH]"
-            elif score > 0.4:
-                tag = "[ MED]"
-            else:
-                tag = "[ LOW]"
-            print(f"   {tag} #{r['rank']} (score: {score:.4f})")
-            print(f"         {r['text']}")
-    # -------------------------------------------------------------------------
-    # Interactive search (optional)
-    # -------------------------------------------------------------------------
-    print("\n" + "=" * 60)
-    print("Interactive Search")
-    print("=" * 60)
-    print("Enter your own queries (type 'quit' to exit):\n")
-    while True:
-        try:
-            query = input("  Query: ").strip()
-            if query.lower() in ['quit', 'exit', 'q']:
-                break
-            if not query:
-                continue
-            results = model.search(query, documents, top_k=3)
-            print("\n   Results:")
-            for r in results:
-                print(f"   - [{r['score']:.3f}] {r['text'][:70]}...")
-            print()
-        except (KeyboardInterrupt, EOFError):
-            break
-    print("\nDone.")
-if __name__ == "__main__":
-    main()

models/mini/model.pt → model.pt RENAMED Viewed

File without changes

models/mini/model.safetensors → model.safetensors RENAMED Viewed

File without changes

models/large/README.md DELETED Viewed

@@ -1,5 +0,0 @@
-# MiniEmbed - Large
-Full-scale variant for maximum accuracy on complex semantic tasks.
-Coming soon...

models/medium/README.md DELETED Viewed

@@ -1,5 +0,0 @@
-# MiniEmbed - Medium
-Balanced variant offering higher accuracy with moderate compute requirements.
-Coming soon...

models/product/README.md DELETED Viewed

@@ -1,5 +0,0 @@
-# MiniEmbed - Product
-Fine-tuned variant of Mini, specialized for high-accuracy product matching.
-Coming soon...

models/small/README.md DELETED Viewed

@@ -1,5 +0,0 @@
-# MiniEmbed - Small
-A larger variant with increased capacity for general-purpose embeddings.
-Coming soon...

requirements.txt DELETED Viewed

@@ -1,14 +0,0 @@
-# Core
-torch>=2.0.0
-numpy>=1.21.0
-tqdm>=4.64.0
-# Demo UI
-streamlit>=1.30.0
-plotly>=5.0.0
-# Optional (for clustering, CSV processing, & Benchmarking)
-scikit-learn>=1.0.0
-pandas>=2.0.0
-psutil>=5.9.0
-sentence-transformers>=2.2.0

src/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (500 Bytes). View file

src/__pycache__/inference.cpython-313.pyc ADDED Viewed

Binary file (14.7 kB). View file

src/__pycache__/model.cpython-313.pyc ADDED Viewed

Binary file (15 kB). View file

src/__pycache__/tokenizer.cpython-313.pyc ADDED Viewed

Binary file (7.06 kB). View file

models/mini/tokenizer.json → tokenizer.json RENAMED Viewed

File without changes

models/mini/training_info.json → training_info.json RENAMED Viewed

File without changes