Spaces:

akhil-vaidya
/

quailvec

Runtime error

App Files Files Community

akhil-vaidya commited on Dec 25, 2025

Commit

1b04a15

verified ·

1 Parent(s): 4402c16

Upload 30 files

Browse files

Files changed (12) hide show

.dockerignore +84 -0
.gitignore +12 -0
.python-version +1 -0
Dockerfile +39 -20
README.md +712 -19
app/app.py +916 -916
app/run_demo.py +38 -38
dist/.gitignore +1 -0
dist/qualivec-0.1.0-py3-none-any.whl +0 -0
dist/qualivec-0.1.0.tar.gz +3 -0
src/qualivec/__pycache__/embedding.cpython-312.pyc +0 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,84 @@

+# Docker
+.dockerignore
+Dockerfile*
+docker-compose*.yml
+# Git
+.git/
+.gitignore
+# Python Virtual Environment
+.venv/
+venv/
+env/
+ENV/
+env.bak/
+venv.bak/
+# Python cache
+__pycache__/
+*.pyc
+*.py[cod]
+*$py.class
+*.so
+# Build artifacts
+dist/
+build/
+develop-eggs/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Documentation (keep README.md)
+docs/
+*.md
+!README.md
+# Tests
+tests/
+test_*/
+*_test.py
+**/test_*.py
+# Data files (you may want to adjust these based on your needs)
+*.csv
+*.json
+*.pkl
+*.parquet
+# Logs
+*.log
+logs/
+# Temporary files
+tmp/
+temp/
+.tmp/

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+*.pdf
+*.csv

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12.10

Dockerfile CHANGED Viewed

@@ -1,21 +1,40 @@
-FROM python:3.13.5-slim
-WORKDIR /app
-RUN apt-get update && apt-get install -y \
-    build-essential \
-    curl \
-    git \
-    && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt ./
-COPY src/ ./src/
-COPY app/ ./app/
-RUN pip3 install -r requirements.txt
-EXPOSE 8501
-HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
 ENTRYPOINT ["python", "app/run_demo.py"]

+# Dockerfile for QualiVec Streamlit Demo
+# 1. Base Image
+FROM python:3.12-slim
+# 2. Set the working directory
+WORKDIR /app
+# 3. Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# 4. Install uv - the fast Python package manager
+RUN pip install --no-cache-dir uv
+# 5. Copy dependency definition files and README (required for package build)
+COPY pyproject.toml uv.lock README.md ./
+# 6. Copy source code (needed for package installation)
+COPY src/ ./src/
+# 7. Install Python dependencies using uv
+# 'uv pip install .' reads pyproject.toml and installs the project dependencies
+RUN uv pip install --system --no-cache-dir .
+# 8. Copy the rest of the application source code
+# Make sure you have a .dockerignore file to exclude .venv
+COPY . .
+# 9. Expose the port Streamlit runs on
+EXPOSE 8501
+# 10. Add a health check to verify the app is running
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+# 11. Define the entry point to use the run_demo.py script via uv
+# ENTRYPOINT ["uv", "run", "app/run_demo.py"]
 ENTRYPOINT ["python", "app/run_demo.py"]

README.md CHANGED Viewed

@@ -1,19 +1,712 @@
----
-title: Quailvec
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Streamlit template space
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# QualiVec
+**QualiVec** is a Python library for scalable qualitative content analysis powered by Large Language Model (LLM) embeddings. It bridges qualitative content analysis with machine learning by leveraging the semantic understanding capabilities of Large Language Models. Instead of relying on simple keyword matching or manually coding large datasets, QualiVec uses embedding vectors to capture semantic meaning and perform classification based on similarity to reference vectors.
+Key features:
+- LLM-based embedding generation
+- Semantic similarity assessment using cosine similarity
+- Deductive and inductive coding support
+- Reference vector creation from labeled corpora
+- Corpus-driven clustering for robust semantic anchor construction
+- Supports large-scale document classification
+- Domain-agnostic and model-flexible design
+- Human-level performance in multi-domain content analysis
+- Bootstrap evaluation with confidence intervals
+- Threshold optimization for classification performance
+## 💻 Installation
+```bash
+pip install qualivec
+```
+For development installation:
+```bash
+git clone https://github.com/AkhilVaidya91/QualiVec.git
+cd qualivec
+pip install -e .
+```
+## 🖥️ Interactive Demo
+QualiVec includes a comprehensive Streamlit web application that provides an interactive demonstration of the library's capabilities. The demo allows users to upload their own data and experience the full workflow of qualitative content analysis using LLM embeddings.
+### Demo Features
+- **Interactive Data Upload**: Upload your own CSV files for reference and labeled data
+- **Model Configuration**: Choose from different pre-trained embedding models
+- **Threshold Optimization**: Automatically find the optimal similarity threshold
+- **Real-time Classification**: See classification results as they happen
+- **Comprehensive Evaluation**: View detailed performance metrics and visualizations
+- **Bootstrap Analysis**: Get confidence intervals for robust evaluation
+- **Download Results**: Export classification results and metrics
+### Getting Started with Demo
+1. **Install Dependencies**:
+   ```bash
+   pip install -e .
+   ```
+2. **Run the Demo**:
+   ```bash
+   cd app
+   uv run run_demo.py
+   ```
+3. **Access the Demo**:
+   Open your browser and navigate to `http://localhost:8501`
+### Demo Walkthrough
+#### 1. Data Upload Page
+Upload your reference and labeled data files. The demo validates file formats and shows data statistics.
+![Data Upload Interface](assets/data_upload.png)
+#### 2. Configuration Page
+Configure embedding models and optimization parameters. Choose from multiple pre-trained models and set classification thresholds.
+![Configuration Interface](assets/config.png)
+#### 3. Classification Page
+Run the classification process with real-time progress updates. View optimization results and threshold analysis.
+![Classification Process](assets/optim.png)
+#### 4. Results Page
+Examine detailed evaluation metrics, confusion matrices, bootstrap confidence intervals, and sample predictions.
+![Results Dashboard](assets/bootstrap.png)
+### Data Format Requirements
+#### Reference Data (CSV)
+Your reference data should contain:
+- `tag`: The class/category label
+- `sentence`: The example text for that category
+Example:
+| tag      | sentence                        |
+|----------|---------------------------------|
+| Positive | This is absolutely fantastic!   |
+| Negative | This is terrible and disappointing |
+| Neutral  | This is okay I guess            |
+#### Labeled Data (CSV)
+Your labeled data should contain:
+- `sentence`: The text to be classified
+- `Label`: The true class/category (for evaluation)
+Example:
+| sentence                          | Label    |
+|------------------------------------|----------|
+| I love this product so much!       | Positive |
+| Not very good quality              | Negative |
+| Average product nothing special    | Neutral  |
+## 🚀 Quick Start
+Here's a simple example to classify text data using reference vectors:
+```python
+from qualivec.classification import Classifier
+# Initialize classifier
+classifier = Classifier(verbose=True)
+# Load models
+classifier.load_models(model_name="sentence-transformers/all-MiniLM-L6-v2", threshold=0.7)
+# Prepare reference vectors
+reference_data = classifier.prepare_reference_vectors(
+    reference_path="path/to/reference_vectors.csv",
+    class_column="class",
+    node_column="matching_node"
+)
+# Classify corpus
+results_df = classifier.classify(
+    corpus_path="path/to/corpus.csv",
+    reference_data=reference_data,
+    sentence_column="sentence",
+    output_path="path/to/results.csv"
+)
+# Display distribution of classifications
+print(results_df["predicted_class"].value_counts())
+```
+![QualiVec Classification Results](assets/distributions.png)
+## 🧩 Core Concepts
+| Concept              | Description                                                                                      |
+|----------------------|--------------------------------------------------------------------------------------------------|
+| **Reference Vectors**| Semantic anchors that define each class or category, curated as representative example texts.    |
+| **Similarity Threshold** | Determines how similar a text must be to a reference vector to be classified as that class; higher values are more restrictive. |
+| **Embedding**        | Numerical vector representations of text that capture semantic meaning; similar texts have similar embeddings. |
+| **Semantic Matching**| Uses cosine similarity between embeddings to assess how close texts are to reference vectors.     |
+| **Bootstrap Evaluation** | Statistical method for estimating uncertainty in evaluation metrics by resampling with replacement. |
+## 🧰 Components
+### Data Loading and Preparation
+The `DataLoader` class handles loading and validation of data:
+```python
+from qualivec.data import DataLoader
+# Initialize data loader
+data_loader = DataLoader(verbose=True)
+# Load corpus
+corpus_df = data_loader.load_corpus(
+    filepath="path/to/corpus.csv",
+    sentence_column="sentence"
+)
+# Load reference vectors
+reference_df = data_loader.load_reference_vectors(
+    filepath="path/to/reference_vectors.csv",
+    class_column="class",
+    node_column="matching_node"
+)
+# Load labeled data for evaluation
+labeled_df = data_loader.load_labeled_data(
+    filepath="path/to/labeled_data.csv",
+    label_column="label"
+)
+# Save results
+data_loader.save_dataframe(df=results_df, filepath="path/to/output.csv")
+```
+### Embedding Generation
+The `EmbeddingModel` class generates embeddings from text:
+```python
+from qualivec.embedding import EmbeddingModel
+# Initialize embedding model
+model = EmbeddingModel(
+    model_name="sentence-transformers/all-MiniLM-L6-v2",
+    device=None,  # Auto-selects CPU or GPU
+    cache_dir=None,
+    verbose=True
+)
+# Generate embeddings for a list of texts
+texts = ["This is a sample text", "Another example text"]
+embeddings = model.embed_texts(texts, batch_size=32)
+# Generate embeddings from a DataFrame column
+embeddings = model.embed_dataframe(df, text_column="sentence", batch_size=32)
+# Generate embeddings for reference vectors
+reference_data = model.embed_reference_vectors(
+    df=reference_df,
+    class_column="class",
+    node_column="matching_node",
+    batch_size=32
+)
+```
+### Semantic Matching
+The `SemanticMatcher` class performs semantic matching using cosine similarity:
+```python
+from qualivec.matching import SemanticMatcher
+# Initialize matcher with similarity threshold
+matcher = SemanticMatcher(threshold=0.7, verbose=True)
+# Match query embeddings against reference vectors
+match_results = matcher.match(
+    query_embeddings=query_embeddings,
+    reference_data=reference_data,
+    return_similarities=False
+)
+# Classify an entire corpus
+classified_df = matcher.classify_corpus(
+    corpus_embeddings=corpus_embeddings,
+    reference_data=reference_data,
+    corpus_df=corpus_df
+)
+```
+### Classification
+The `Classifier` class combines embedding and matching for end-to-end classification:
+```python
+from qualivec.classification import Classifier
+# Initialize classifier
+classifier = Classifier(verbose=True)
+# Load models
+classifier.load_models(
+    model_name="sentence-transformers/all-MiniLM-L6-v2",
+    threshold=0.7
+)
+# Prepare reference vectors
+reference_data = classifier.prepare_reference_vectors(
+    reference_path="path/to/reference_vectors.csv",
+    class_column="class",
+    node_column="matching_node"
+)
+# Classify corpus
+results_df = classifier.classify(
+    corpus_path="path/to/corpus.csv",
+    reference_data=reference_data,
+    sentence_column="sentence",
+    output_path="path/to/results.csv"
+)
+# Evaluate classification performance
+eval_results = classifier.evaluate_classification(
+    labeled_path="path/to/labeled_data.csv",
+    reference_data=reference_data,
+    sentence_column="sentence",
+    label_column="label",
+    optimize_threshold=False
+)
+```
+### Evaluation
+The `Evaluator` class evaluates classification performance:
+```python
+from qualivec.evaluation import Evaluator
+# Initialize evaluator
+evaluator = Evaluator(verbose=True)
+# Simple evaluation
+results = evaluator.evaluate(
+    true_labels=true_labels,
+    predicted_labels=predicted_labels,
+    class_names=class_names
+)
+# Bootstrap evaluation with confidence intervals
+bootstrap_results = evaluator.bootstrap_evaluate(
+    true_labels=true_labels,
+    predicted_labels=predicted_labels,
+    n_iterations=1000,
+    confidence_levels=[0.9, 0.95, 0.99],
+    random_seed=42
+)
+# Plot confusion matrix
+evaluator.plot_confusion_matrix(
+    confusion_matrix=results['confusion_matrix'],
+    class_names=results['confusion_matrix_labels']
+)
+# Plot bootstrap distributions
+evaluator.plot_bootstrap_distributions(bootstrap_results)
+```
+![QualiVec Confusion Matrix](assets/confusion_matrix.png)
+### Threshold Optimization
+The `ThresholdOptimizer` class finds the optimal similarity threshold:
+```python
+from qualivec.optimization import ThresholdOptimizer
+# Initialize optimizer
+optimizer = ThresholdOptimizer(verbose=True)
+# Optimize threshold
+optimization_results = optimizer.optimize(
+    query_embeddings=query_embeddings,
+    reference_data=reference_data,
+    true_labels=true_labels,
+    start=0.5,
+    end=0.9,
+    step=0.01,
+    metric="f1_macro",
+    bootstrap=True,
+    n_bootstrap=100,
+    confidence_level=0.95
+)
+# Plot optimization results
+optimizer.plot_optimization_results(
+    results=optimization_results,
+    metrics=["accuracy", "precision_macro", "recall_macro", "f1_macro"]
+)
+# Plot class distribution at different thresholds
+optimizer.plot_class_distribution(
+    results=optimization_results,
+    top_n=10
+)
+```
+### Sampling
+The `Sampler` class helps create samples for manual coding:
+```python
+from qualivec.sampling import Sampler
+# Initialize sampler
+sampler = Sampler(verbose=True)
+# Random sampling
+random_sample = sampler.sample(
+    df=corpus_df,
+    sampling_type="random",
+    sample_size=0.1,  # 10% of corpus
+    seed=42,
+    label_column="Label"
+)
+# Stratified sampling
+stratified_sample = sampler.sample(
+    df=corpus_df,
+    sampling_type="stratified",
+    sample_size=0.1,
+    stratify_column="category",
+    seed=42,
+    label_column="Label"
+)
+```
+## 📚 Usage Examples
+### Preparing Reference Vectors
+Reference vectors are the foundation of classification in QualiVec. Here's how to prepare them:
+```python
+# Step 1: Sample data for manual coding
+from qualivec.sampling import Sampler
+sampler = Sampler(verbose=True)
+sample_df = sampler.sample(
+    df=corpus_df,
+    sampling_type="stratified",
+    sample_size=0.05,  # 5% of corpus
+    stratify_column="document_type"
+)
+# Step 2: Save sample for manual coding
+sample_df.to_csv("sample_for_coding.csv", index=False)
+# Step 3: After manual coding, load the coded data
+from qualivec.data import DataLoader
+data_loader = DataLoader(verbose=True)
+coded_df = data_loader.load_labeled_data(
+    filepath="coded_sample.csv",
+    label_column="coded_class"
+)
+# Step 4: Generate embeddings for reference vectors
+from qualivec.embedding import EmbeddingModel
+model = EmbeddingModel(verbose=True)
+reference_data = model.embed_reference_vectors(
+    df=coded_df,
+    class_column="coded_class",
+    node_column="sentence"
+)
+# Step 5: Save reference data for future use
+import pickle
+with open("reference_data.pkl", "wb") as f:
+    pickle.dump(reference_data, f)
+```
+### Classifying New Data
+Once reference vectors are prepared, you can classify new data:
+```python
+# Load reference data
+import pickle
+with open("reference_data.pkl", "rb") as f:
+    reference_data = pickle.load(f)
+# Initialize classifier
+from qualivec.classification import Classifier
+classifier = Classifier(verbose=True)
+classifier.load_models(threshold=0.7)
+# Classify corpus
+results_df = classifier.classify(
+    corpus_path="new_corpus.csv",
+    reference_data=reference_data,
+    sentence_column="sentence",
+    output_path="classified_corpus.csv"
+)
+# Analyze results
+import pandas as pd
+import matplotlib.pyplot as plt
+# Distribution of classes
+plt.figure(figsize=(10, 6))
+results_df["predicted_class"].value_counts().plot(kind="bar")
+plt.title("Distribution of Predicted Classes")
+plt.tight_layout()
+plt.show()
+# Average similarity by class
+results_df.groupby("predicted_class")["similarity_score"].mean().sort_values().plot(kind="barh")
+plt.title("Average Similarity Score by Class")
+plt.tight_layout()
+plt.show()
+```
+### Evaluating Classification Performance
+To assess how well your classification is performing:
+```python
+# Load labeled data
+from qualivec.data import DataLoader
+data_loader = DataLoader(verbose=True)
+labeled_df = data_loader.load_labeled_data(
+    filepath="labeled_test_set.csv",
+    label_column="true_label"
+)
+# Generate embeddings
+from qualivec.embedding import EmbeddingModel
+model = EmbeddingModel(verbose=True)
+labeled_embeddings = model.embed_dataframe(
+    df=labeled_df,
+    text_column="sentence"
+)
+# Initialize evaluator
+from qualivec.evaluation import Evaluator
+from qualivec.matching import SemanticMatcher
+matcher = SemanticMatcher(threshold=0.7, verbose=True)
+match_results = matcher.match(labeled_embeddings, reference_data)
+predicted_labels = match_results["predicted_class"].tolist()
+true_labels = labeled_df["true_label"].tolist()
+evaluator = Evaluator(verbose=True)
+# Simple evaluation
+eval_results = evaluator.evaluate(
+    true_labels=true_labels,
+    predicted_labels=predicted_labels
+)
+# Bootstrap evaluation
+bootstrap_results = evaluator.bootstrap_evaluate(
+    true_labels=true_labels,
+    predicted_labels=predicted_labels,
+    n_iterations=1000
+)
+# Plot confusion matrix
+evaluator.plot_confusion_matrix(
+    confusion_matrix=eval_results['confusion_matrix'],
+    class_names=eval_results['confusion_matrix_labels']
+)
+# Plot bootstrap distributions
+evaluator.plot_bootstrap_distributions(bootstrap_results)
+```
+### Optimizing Similarity Thresholds
+To find the optimal similarity threshold for your classification:
+```python
+# Initialize optimizer
+from qualivec.optimization import ThresholdOptimizer
+optimizer = ThresholdOptimizer(verbose=True)
+# Optimize threshold
+optimization_results = optimizer.optimize(
+    query_embeddings=labeled_embeddings,
+    reference_data=reference_data,
+    true_labels=true_labels,
+    start=0.5,
+    end=0.9,
+    step=0.01,
+    metric="f1_macro"
+)
+# Plot optimization results
+optimizer.plot_optimization_results(
+    results=optimization_results,
+    metrics=["accuracy", "f1_macro"]
+)
+# Plot class distribution
+optimizer.plot_class_distribution(
+    results=optimization_results,
+    top_n=5
+)
+# Use the optimal threshold
+optimal_threshold = optimization_results["optimal_threshold"]
+print(f"Optimal threshold: {optimal_threshold}")
+# Create a new matcher with the optimal threshold
+matcher = SemanticMatcher(threshold=optimal_threshold, verbose=True)
+```
+### Sampling for Manual Coding
+To create samples for manual coding or validation:
+```python
+from qualivec.sampling import Sampler
+sampler = Sampler(verbose=True)
+# Random sampling
+random_sample = sampler.sample(
+    df=corpus_df,
+    sampling_type="random",
+    sample_size=100,  # 100 documents
+    seed=42
+)
+# Stratified sampling by predicted class
+stratified_sample = sampler.sample(
+    df=results_df,
+    sampling_type="stratified",
+    sample_size=0.1,  # 10% of corpus
+    stratify_column="predicted_class",
+    seed=42
+)
+# Save samples for manual coding
+random_sample.to_csv("random_sample_for_coding.csv", index=False)
+stratified_sample.to_csv("stratified_sample_for_coding.csv", index=False)
+```
+## 📖 API Reference
+### DataLoader
+```python
+class DataLoader:
+    def __init__(self, verbose=True)
+    def load_corpus(self, filepath, sentence_column="sentence")
+    def load_reference_vectors(self, filepath, class_column="class", node_column="matching_node")
+    def load_labeled_data(self, filepath, label_column="label")
+    def save_dataframe(self, df, filepath)
+    def validate_labels(self, labeled_df, reference_df, label_column="label", class_column="class")
+```
+### Sampler
+```python
+class Sampler:
+    def __init__(self, verbose=True)
+    def sample(self, df, sampling_type="random", sample_size=0.1, stratify_column=None,
+               seed=None, label_column="Label")
+```
+### EmbeddingModel
+```python
+class EmbeddingModel:
+    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2",
+                 device=None, cache_dir=None, verbose=True)
+    def embed_texts(self, texts, batch_size=32)
+    def embed_dataframe(self, df, text_column, batch_size=32)
+    def embed_reference_vectors(self, df, class_column="class",
+                               node_column="matching_node", batch_size=32)
+```
+### SemanticMatcher
+```python
+class SemanticMatcher:
+    def __init__(self, threshold=0.7, verbose=True)
+    def match(self, query_embeddings, reference_data, return_similarities=False)
+    def classify_corpus(self, corpus_embeddings, reference_data, corpus_df)
+```
+### Evaluator
+```python
+class Evaluator:
+    def __init__(self, verbose=True)
+    def evaluate(self, true_labels, predicted_labels, class_names=None)
+    def bootstrap_evaluate(self, true_labels, predicted_labels, n_iterations=1000,
+                          confidence_levels=[0.9, 0.95, 0.99], random_seed=None)
+    def plot_confusion_matrix(self, confusion_matrix, class_names,
+                             figsize=(10, 8), title="Confusion Matrix")
+    def plot_bootstrap_distributions(self, bootstrap_results, figsize=(12, 8))
+```
+### ThresholdOptimizer
+```python
+class ThresholdOptimizer:
+    def __init__(self, verbose=True)
+    def optimize(self, query_embeddings, reference_data, true_labels,
+                start=0.0, end=1.0, step=0.01, metric="f1_macro",
+                bootstrap=True, n_bootstrap=100, confidence_level=0.95, random_seed=None)
+    def plot_optimization_results(self, results, metrics=None, figsize=(12, 6))
+    def plot_class_distribution(self, results, top_n=10, figsize=(12, 8))
+```
+### Classifier
+```python
+class Classifier:
+    def __init__(self, embedding_model=None, matcher=None, verbose=True)
+    def load_models(self, model_name="sentence-transformers/all-MiniLM-L6-v2", threshold=0.7)
+    def prepare_reference_vectors(self, reference_path, class_column="class",
+                                 node_column="matching_node")
+    def classify(self, corpus_path, reference_data, sentence_column="sentence",
+                output_path=None)
+    def evaluate_classification(self, labeled_path, reference_data,
+                              sentence_column="sentence", label_column="label",
+                              optimize_threshold=False, start=0.5, end=0.9, step=0.01)
+```
+## 💡 Best Practices
+1. **Reference Vector Quality**: The quality of your reference vectors greatly impacts classification performance. Ensure they are representative and distinct.
+2. **Model Selection**: Larger models generally provide better semantic understanding but are slower. For simple tasks, smaller models like MiniLM may be sufficient.
+3. **Threshold Tuning**: Always optimize the similarity threshold for your specific dataset and task.
+4. **Evaluation**: Use bootstrap evaluation to get confidence intervals around your metrics, especially for smaller datasets.
+5. **Class Imbalance**: Be aware of class imbalance in your data. Consider using stratified sampling for creating evaluation sets.
+6. **Preprocessing**: Clean and preprocess your text data before embedding for best results.
+7. **Out-of-Domain Detection**: Use the "Other" class (when similarity is below threshold) to identify texts that might need new reference vectors.
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.

app/app.py CHANGED Viewed

@@ -1,916 +1,916 @@
-import streamlit as st
-import pandas as pd
-import numpy as np
-import matplotlib.pyplot as plt
-import seaborn as sns
-import tempfile
-import os
-import sys
-from io import StringIO
-import plotly.express as px
-import plotly.graph_objects as go
-from plotly.subplots import make_subplots
-# Add the parent directory to sys.path to import the module
-sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-from .data import DataLoader
-from .embedding import EmbeddingModel
-from .matching import SemanticMatcher
-from .classification import Classifier
-from .evaluation import Evaluator
-from .optimization import ThresholdOptimizer
-# Set page config
-st.set_page_config(
-    page_title="QualiVec Demo",
-    page_icon="🔍",
-    layout="wide",
-    initial_sidebar_state="expanded"
-)
-# Custom CSS for better styling
-st.markdown("""
-<style>
-    .main-header {
-        font-size: 2.5rem;
-        font-weight: bold;
-        color: #2E4057;
-        text-align: center;
-        margin-bottom: 2rem;
-    }
-    .section-header {
-        font-size: 1.5rem;
-        font-weight: bold;
-        color: #048A81;
-        margin-top: 2rem;
-        margin-bottom: 1rem;
-    }
-    .metric-card {
-        background-color: #f0f2f6;
-        padding: 1rem;
-        border-radius: 0.5rem;
-        margin: 0.5rem 0;
-    }
-    .success-message {
-        background-color: #d4edda;
-        color: #155724;
-        padding: 1rem;
-        border-radius: 0.5rem;
-        margin: 1rem 0;
-    }
-    .warning-message {
-        background-color: #fff3cd;
-        color: #856404;
-        padding: 1rem;
-        border-radius: 0.5rem;
-        margin: 1rem 0;
-    }
-</style>
-""", unsafe_allow_html=True)
-def main():
-    st.markdown('<div class="main-header">🔍 QualiVec Demo</div>', unsafe_allow_html=True)
-    st.markdown("""
-    <div style="text-align: center; margin-bottom: 2rem;">
-        <p style="font-size: 1.2rem; color: #666;">
-            Qualitative Content Analysis with LLM Embeddings
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Sidebar for navigation
-    st.sidebar.title("Navigation")
-    page = st.sidebar.selectbox(
-        "Choose a page",
-        ["🏠 Home", "📊 Data Upload", "🔧 Configuration", "🎯 Classification", "📈 Results"]
-    )
-    # Initialize session state
-    if 'classifier' not in st.session_state:
-        st.session_state.classifier = None
-    if 'reference_data' not in st.session_state:
-        st.session_state.reference_data = None
-    if 'labeled_data' not in st.session_state:
-        st.session_state.labeled_data = None
-    if 'optimization_results' not in st.session_state:
-        st.session_state.optimization_results = None
-    if 'evaluation_results' not in st.session_state:
-        st.session_state.evaluation_results = None
-    # Route to different pages
-    if page == "🏠 Home":
-        show_home_page()
-    elif page == "📊 Data Upload":
-        show_data_upload_page()
-    elif page == "🔧 Configuration":
-        show_configuration_page()
-    elif page == "🎯 Classification":
-        show_classification_page()
-    elif page == "📈 Results":
-        show_results_page()
-def show_home_page():
-    st.markdown('<div class="section-header">Welcome to QualiVec</div>', unsafe_allow_html=True)
-    col1, col2, col3 = st.columns([1, 2, 1])
-    with col2:
-        st.markdown("""
-        ### What is QualiVec?
-        QualiVec is a Python library that uses Large Language Model (LLM) embeddings for qualitative content analysis. It helps researchers and analysts classify text data by comparing it against reference examples.
-        ### Key Features:
-        - **Semantic Matching**: Uses advanced embedding models to find semantic similarity
-        - **Threshold Optimization**: Automatically finds the best similarity threshold
-        - **Comprehensive Evaluation**: Provides detailed metrics and visualizations
-        - **Bootstrap Analysis**: Confidence intervals for robust evaluation
-        ### How It Works:
-        1. **Upload Data**: Provide reference examples and data to classify
-        2. **Configure**: Set up embedding models and parameters
-        3. **Optimize**: Find the best threshold for classification
-        4. **Classify**: Apply the model to your data
-        5. **Evaluate**: Get detailed performance metrics
-        ### Getting Started:
-        Use the sidebar to navigate through the demo. Start with **Data Upload** to begin your analysis.
-        """)
-    # Add sample data info
-    st.markdown('<div class="section-header">Sample Data Format</div>', unsafe_allow_html=True)
-    col1, col2 = st.columns(2)
-    with col1:
-        st.markdown("**Reference Data Format:**")
-        sample_ref = pd.DataFrame({
-            'tag': ['Positive', 'Negative', 'Neutral'],
-            'sentence': ['This is great!', 'This is terrible', 'This is okay']
-        })
-        st.dataframe(sample_ref, use_container_width=True)
-    with col2:
-        st.markdown("**Labeled Data Format:**")
-        sample_labeled = pd.DataFrame({
-            'sentence': ['I love this product', 'Not very good', 'Average quality'],
-            'Label': ['Positive', 'Negative', 'Neutral']
-        })
-        st.dataframe(sample_labeled, use_container_width=True)
-def show_data_upload_page():
-    st.markdown('<div class="section-header">Data Upload</div>', unsafe_allow_html=True)
-    col1, col2 = st.columns(2)
-    with col1:
-        st.markdown("### Reference Data")
-        st.markdown("Upload a CSV file containing reference examples with columns: `tag` (class) and `sentence` (example text)")
-        reference_file = st.file_uploader(
-            "Choose reference data file",
-            type=['csv'],
-            key='reference_file'
-        )
-        if reference_file is not None:
-            try:
-                reference_df = pd.read_csv(reference_file)
-                st.success("Reference data loaded successfully!")
-                st.dataframe(reference_df.head(), use_container_width=True)
-                # Validate columns
-                required_cols = ['tag', 'sentence']
-                missing_cols = [col for col in required_cols if col not in reference_df.columns]
-                if missing_cols:
-                    st.error(f"Missing required columns: {missing_cols}")
-                else:
-                    # Prepare reference data
-                    reference_df = reference_df.rename(columns={
-                        'tag': 'class',
-                        'sentence': 'matching_node'
-                    })
-                    st.session_state.reference_data = reference_df
-                    # Show statistics
-                    st.markdown("**Data Statistics:**")
-                    st.write(f"- Total examples: {len(reference_df)}")
-                    st.write(f"- Unique classes: {reference_df['class'].nunique()}")
-                    st.write(f"- Class distribution:")
-                    st.write(reference_df['class'].value_counts())
-            except Exception as e:
-                st.error(f"Error loading reference data: {str(e)}")
-    with col2:
-        st.markdown("### Labeled Data")
-        st.markdown("Upload a CSV file containing data to classify with columns: `sentence` (text) and `Label` (true class)")
-        labeled_file = st.file_uploader(
-            "Choose labeled data file",
-            type=['csv'],
-            key='labeled_file'
-        )
-        if labeled_file is not None:
-            try:
-                labeled_df = pd.read_csv(labeled_file)
-                st.success("Labeled data loaded successfully!")
-                st.dataframe(labeled_df.head(), use_container_width=True)
-                # Validate columns
-                required_cols = ['sentence', 'Label']
-                missing_cols = [col for col in required_cols if col not in labeled_df.columns]
-                if missing_cols:
-                    st.error(f"Missing required columns: {missing_cols}")
-                else:
-                    # Prepare labeled data
-                    labeled_df = labeled_df.rename(columns={'Label': 'label'})
-                    labeled_df['label'] = labeled_df['label'].replace('0', 'Other')
-                    st.session_state.labeled_data = labeled_df
-                    # Show statistics
-                    st.markdown("**Data Statistics:**")
-                    st.write(f"- Total samples: {len(labeled_df)}")
-                    st.write(f"- Unique labels: {labeled_df['label'].nunique()}")
-                    st.write(f"- Label distribution:")
-                    st.write(labeled_df['label'].value_counts())
-            except Exception as e:
-                st.error(f"Error loading labeled data: {str(e)}")
-    # Show data compatibility check
-    if st.session_state.reference_data is not None and st.session_state.labeled_data is not None:
-        st.markdown('<div class="section-header">Data Compatibility Check</div>', unsafe_allow_html=True)
-        ref_classes = set(st.session_state.reference_data['class'].unique())
-        labeled_classes = set(st.session_state.labeled_data['label'].unique())
-        # Check for unknown classes
-        unknown_classes = labeled_classes - ref_classes
-        if unknown_classes:
-            st.warning(f"Warning: Labels in labeled data not found in reference data: {unknown_classes}")
-        else:
-            st.success("✅ Data compatibility check passed!")
-        # Show class overlap
-        st.markdown("**Class Overlap Analysis:**")
-        col1, col2, col3 = st.columns(3)
-        with col1:
-            st.metric("Reference Classes", len(ref_classes))
-        with col2:
-            st.metric("Labeled Classes", len(labeled_classes))
-        with col3:
-            st.metric("Common Classes", len(ref_classes.intersection(labeled_classes)))
-def show_configuration_page():
-    st.markdown('<div class="section-header">Model Configuration</div>', unsafe_allow_html=True)
-    # Check if data is loaded
-    if st.session_state.reference_data is None or st.session_state.labeled_data is None:
-        st.warning("Please upload both reference and labeled data first.")
-        return
-    col1, col2 = st.columns(2)
-    with col1:
-        st.markdown("### Embedding Model")
-        # Model type selection
-        model_type = st.selectbox(
-            "Choose model type",
-            ["HuggingFace", "Gemini"],
-            help="Select the type of embedding model to use"
-        )
-        # Model selection based on type
-        if model_type == "HuggingFace":
-            model_options = [
-                "sentence-transformers/all-MiniLM-L6-v2",
-                "sentence-transformers/all-mpnet-base-v2",
-                "sentence-transformers/distilbert-base-nli-mean-tokens"
-            ]
-            selected_model = st.selectbox(
-                "Choose HuggingFace model",
-                model_options,
-                help="Select the pre-trained HuggingFace model for generating embeddings"
-            )
-        else:  # Gemini
-            gemini_models = [
-                "gemini-embedding-001",
-                "text-embedding-004"
-            ]
-            selected_model = st.selectbox(
-                "Choose Gemini model",
-                gemini_models,
-                help="Select the Gemini embedding model for generating embeddings"
-            )
-            # Calculate total texts to process
-            total_texts = 0
-            if st.session_state.reference_data is not None:
-                total_texts += len(st.session_state.reference_data)
-            if st.session_state.labeled_data is not None:
-                total_texts += len(st.session_state.labeled_data)
-            st.warning(
-                f"⚠️ **Gemini API Rate Limits (Free Tier)**\\n\\n"
-                f"- 1,500 requests per day\\n"
-                f"- Each batch of 100 texts = 1 request\\n"
-                f"- Your current dataset: ~{total_texts} texts\\n"
-                f"- Estimated requests needed: ~{(total_texts // 100) + 1}\\n\\n"
-                f"If you exceed quota, consider:\\n"
-                f"1. Using a smaller dataset\\n"
-                f"2. Switching to HuggingFace models (no limits)\\n"
-                f"3. Upgrading to a paid API plan"
-            )
-            st.info("💡 Note: Using Gemini embeddings requires GOOGLE_API_KEY environment variable to be set.")
-        st.markdown("### Initial Threshold")
-        initial_threshold = st.slider(
-            "Initial similarity threshold",
-            min_value=0.0,
-            max_value=1.0,
-            value=0.7,
-            step=0.05,
-            help="Cosine similarity threshold for classification"
-        )
-    with col2:
-        st.markdown("### Optimization Parameters")
-        optimize_threshold = st.checkbox(
-            "Enable threshold optimization",
-            value=True,
-            help="Automatically find the best threshold"
-        )
-        if optimize_threshold:
-            col2_1, col2_2 = st.columns(2)
-            with col2_1:
-                start_threshold = st.slider(
-                    "Start threshold",
-                    min_value=0.0,
-                    max_value=1.0,
-                    value=0.5,
-                    step=0.05
-                )
-                end_threshold = st.slider(
-                    "End threshold",
-                    min_value=0.0,
-                    max_value=1.0,
-                    value=0.9,
-                    step=0.05
-                )
-            with col2_2:
-                step_size = st.slider(
-                    "Step size",
-                    min_value=0.005,
-                    max_value=0.05,
-                    value=0.01,
-                    step=0.005
-                )
-                optimization_metric = st.selectbox(
-                    "Optimization metric",
-                    ["f1_macro", "accuracy", "precision_macro", "recall_macro"]
-                )
-    # Load models button
-    if st.button("Initialize Models", type="primary"):
-        with st.spinner("Loading models... This may take a few minutes."):
-            try:
-                # Initialize classifier
-                classifier = Classifier(verbose=False)
-                # Determine model type parameter
-                model_type_param = "gemini" if model_type == "Gemini" else "huggingface"
-                classifier.load_models(
-                    model_name=selected_model,
-                    model_type=model_type_param,
-                    threshold=initial_threshold
-                )
-                # Prepare reference vectors
-                with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_ref:
-                    tmp_ref_path = tmp_ref.name
-                    st.session_state.reference_data.to_csv(tmp_ref_path, index=False)
-                try:
-                    reference_data = classifier.prepare_reference_vectors(
-                        reference_path=tmp_ref_path,
-                        class_column='class',
-                        node_column='matching_node'
-                    )
-                finally:
-                    # Ensure file is deleted even if an error occurs
-                    try:
-                        os.unlink(tmp_ref_path)
-                    except (OSError, PermissionError):
-                        pass  # File might already be deleted or locked
-                st.session_state.classifier = classifier
-                st.session_state.reference_vectors = reference_data
-                st.session_state.config = {
-                    'model_type': model_type,
-                    'model_name': selected_model,
-                    'initial_threshold': initial_threshold,
-                    'optimize_threshold': optimize_threshold,
-                    'start_threshold': start_threshold if optimize_threshold else None,
-                    'end_threshold': end_threshold if optimize_threshold else None,
-                    'step_size': step_size if optimize_threshold else None,
-                    'optimization_metric': optimization_metric if optimize_threshold else None
-                }
-                st.success("✅ Models initialized successfully!")
-            except Exception as e:
-                st.error(f"Error initializing models: {str(e)}")
-    # Show current configuration
-    if st.session_state.classifier is not None:
-        st.markdown('<div class="section-header">Current Configuration</div>', unsafe_allow_html=True)
-        config = st.session_state.config
-        col1, col2, col3 = st.columns(3)
-        with col1:
-            st.markdown("**Model Settings:**")
-            st.write(f"- Model type: {config['model_type']}")
-            st.write(f"- Model: {config['model_name']}")
-            st.write(f"- Initial threshold: {config['initial_threshold']}")
-        with col2:
-            st.markdown("**Optimization:**")
-            st.write(f"- Enabled: {config['optimize_threshold']}")
-            if config['optimize_threshold']:
-                st.write(f"- Range: {config['start_threshold']:.2f} - {config['end_threshold']:.2f}")
-                st.write(f"- Step: {config['step_size']:.3f}")
-        with col3:
-            st.markdown("**Data:**")
-            st.write(f"- Reference examples: {len(st.session_state.reference_data)}")
-            st.write(f"- Labeled samples: {len(st.session_state.labeled_data)}")
-def show_classification_page():
-    st.markdown('<div class="section-header">Classification & Optimization</div>', unsafe_allow_html=True)
-    # Check if models are loaded
-    if st.session_state.classifier is None:
-        st.warning("Please configure and initialize models first.")
-        return
-    # Run classification
-    if st.button("Run Classification", type="primary"):
-        with st.spinner("Running classification and optimization..."):
-            try:
-                # Save labeled data to temporary file
-                with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_labeled:
-                    tmp_labeled_path = tmp_labeled.name
-                    st.session_state.labeled_data.to_csv(tmp_labeled_path, index=False)
-                try:
-                    # Run optimization if enabled
-                    if st.session_state.config['optimize_threshold']:
-                        optimization_results = st.session_state.classifier.evaluate_classification(
-                            labeled_path=tmp_labeled_path,
-                            reference_data=st.session_state.reference_vectors,
-                            sentence_column='sentence',
-                            label_column='label',
-                            optimize_threshold=True,
-                            start=st.session_state.config['start_threshold'],
-                            end=st.session_state.config['end_threshold'],
-                            step=st.session_state.config['step_size']
-                        )
-                        st.session_state.optimization_results = optimization_results
-                        optimal_threshold = optimization_results["optimal_threshold"]
-                        # Update classifier with optimal threshold
-                        st.session_state.classifier.matcher = SemanticMatcher(
-                            threshold=optimal_threshold,
-                            verbose=False
-                        )
-                        st.success(f"✅ Optimization completed! Optimal threshold: {optimal_threshold:.4f}")
-                    else:
-                        optimal_threshold = st.session_state.config['initial_threshold']
-                    # Run evaluation
-                    embedding_model = st.session_state.classifier.embedding_model
-                    data_loader = DataLoader(verbose=False)
-                    full_df = data_loader.load_labeled_data(tmp_labeled_path, label_column='label')
-                    # Generate embeddings
-                    full_embeddings = embedding_model.embed_dataframe(full_df, text_column='sentence')
-                    # Classify
-                    match_results = st.session_state.classifier.matcher.match(
-                        full_embeddings,
-                        st.session_state.reference_vectors
-                    )
-                    predicted_labels = match_results["predicted_class"].tolist()
-                    true_labels = full_df['label'].tolist()
-                    # Evaluate
-                    evaluator = Evaluator(verbose=False)
-                    eval_results = evaluator.evaluate(
-                        true_labels=true_labels,
-                        predicted_labels=predicted_labels,
-                        class_names=list(set(true_labels) | set(predicted_labels))
-                    )
-                    # Bootstrap evaluation
-                    bootstrap_results = evaluator.bootstrap_evaluate(
-                        true_labels=true_labels,
-                        predicted_labels=predicted_labels,
-                        n_iterations=100
-                    )
-                    st.session_state.evaluation_results = eval_results
-                    st.session_state.bootstrap_results = bootstrap_results
-                    st.session_state.predictions = {
-                        'true_labels': true_labels,
-                        'predicted_labels': predicted_labels,
-                        'match_results': match_results,
-                        'full_df': full_df
-                    }
-                finally:
-                    # Ensure temporary file is deleted
-                    try:
-                        os.unlink(tmp_labeled_path)
-                    except (OSError, PermissionError):
-                        pass  # File might already be deleted or locked
-                    st.success("✅ Classification completed successfully!")
-            except Exception as e:
-                st.error(f"Error during classification: {str(e)}")
-    # Show optimization results if available
-    if st.session_state.optimization_results is not None:
-        st.markdown('<div class="section-header">Optimization Results</div>', unsafe_allow_html=True)
-        results = st.session_state.optimization_results
-        col1, col2, col3, col4 = st.columns(4)
-        with col1:
-            st.metric(
-                "Optimal Threshold",
-                f"{results['optimal_threshold']:.4f}"
-            )
-        with col2:
-            st.metric(
-                "Accuracy",
-                f"{results['optimal_metrics']['accuracy']:.4f}"
-            )
-        with col3:
-            st.metric(
-                "F1 Score",
-                f"{results['optimal_metrics']['f1_macro']:.4f}"
-            )
-        with col4:
-            st.metric(
-                "Precision",
-                f"{results['optimal_metrics']['precision_macro']:.4f}"
-            )
-        # Plot optimization curve
-        st.markdown("### Optimization Curve")
-        opt_results = results["results_by_threshold"]
-        fig = make_subplots(
-            rows=2, cols=2,
-            subplot_titles=('Accuracy', 'F1 Score', 'Precision', 'Recall'),
-            vertical_spacing=0.1
-        )
-        thresholds = opt_results["thresholds"]
-        # Add traces
-        fig.add_trace(
-            go.Scatter(x=thresholds, y=opt_results["accuracy"], name="Accuracy"),
-            row=1, col=1
-        )
-        fig.add_trace(
-            go.Scatter(x=thresholds, y=opt_results["f1_macro"], name="F1 Score"),
-            row=1, col=2
-        )
-        fig.add_trace(
-            go.Scatter(x=thresholds, y=opt_results["precision_macro"], name="Precision"),
-            row=2, col=1
-        )
-        fig.add_trace(
-            go.Scatter(x=thresholds, y=opt_results["recall_macro"], name="Recall"),
-            row=2, col=2
-        )
-        # Add optimal threshold line to each subplot using shapes
-        optimal_thresh = results['optimal_threshold']
-        # Add vertical line as shapes to each subplot
-        shapes = []
-        for row in range(1, 3):
-            for col in range(1, 3):
-                # Calculate the subplot domain
-                xaxis = f'x{(row-1)*2 + col}' if (row-1)*2 + col > 1 else 'x'
-                shapes.append(
-                    dict(
-                        type="line",
-                        x0=optimal_thresh, x1=optimal_thresh,
-                        y0=0, y1=1,
-                        yref=f"y{(row-1)*2 + col} domain" if (row-1)*2 + col > 1 else "y domain",
-                        xref=xaxis,
-                        line=dict(color="red", width=2, dash="dash")
-                    )
-                )
-        fig.update_layout(shapes=shapes)
-        fig.update_layout(
-            title="Threshold Optimization Results",
-            showlegend=False,
-            height=600
-        )
-        st.plotly_chart(fig, use_container_width=True)
-def show_results_page():
-    st.markdown('<div class="section-header">Results & Evaluation</div>', unsafe_allow_html=True)
-    # Check if evaluation results are available
-    if st.session_state.evaluation_results is None:
-        st.warning("Please run classification first to see results.")
-        return
-    eval_results = st.session_state.evaluation_results
-    # Performance metrics
-    st.markdown("### Performance Metrics")
-    col1, col2, col3, col4 = st.columns(4)
-    with col1:
-        st.metric(
-            "Overall Accuracy",
-            f"{eval_results['accuracy']:.4f}"
-        )
-    with col2:
-        st.metric(
-            "Macro F1 Score",
-            f"{eval_results['f1_macro']:.4f}"
-        )
-    with col3:
-        st.metric(
-            "Macro Precision",
-            f"{eval_results['precision_macro']:.4f}"
-        )
-    with col4:
-        st.metric(
-            "Macro Recall",
-            f"{eval_results['recall_macro']:.4f}"
-        )
-    # Class-wise metrics
-    st.markdown("### Class-wise Performance")
-    class_metrics_df = pd.DataFrame({
-        'Class': list(eval_results['class_metrics']['precision'].keys()),
-        'Precision': list(eval_results['class_metrics']['precision'].values()),
-        'Recall': list(eval_results['class_metrics']['recall'].values()),
-        'F1-Score': list(eval_results['class_metrics']['f1'].values()),
-        'Support': list(eval_results['class_metrics']['support'].values())
-    })
-    st.dataframe(class_metrics_df, use_container_width=True)
-    # Confusion Matrix
-    st.markdown("### Confusion Matrix")
-    cm = eval_results['confusion_matrix']
-    class_names = eval_results['confusion_matrix_labels']
-    fig = px.imshow(
-        cm,
-        labels=dict(x="Predicted", y="True", color="Count"),
-        x=class_names,
-        y=class_names,
-        color_continuous_scale='Blues',
-        text_auto=True,
-        title="Confusion Matrix"
-    )
-    fig.update_layout(
-        width=600,
-        height=600
-    )
-    st.plotly_chart(fig, use_container_width=True)
-    # Bootstrap Results
-    if st.session_state.bootstrap_results is not None:
-        st.markdown("### Bootstrap Confidence Intervals")
-        bootstrap_results = st.session_state.bootstrap_results
-        # Debug: show available keys
-        if 'confidence_intervals' in bootstrap_results:
-            metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
-            for metric in metrics:
-                if metric in bootstrap_results['confidence_intervals']:
-                    ci_data = bootstrap_results['confidence_intervals'][metric]
-                    st.markdown(f"**{metric.replace('_', ' ').title()}:**")
-                    col1, col2, col3 = st.columns(3)
-                    # Check available confidence levels
-                    available_levels = list(ci_data.keys())
-                    with col1:
-                        if '0.95' in ci_data:
-                            ci_95 = ci_data['0.95']
-                            if isinstance(ci_95, dict):
-                                st.write(f"95% CI: [{ci_95['lower']:.4f}, {ci_95['upper']:.4f}]")
-                            elif isinstance(ci_95, (list, tuple)) and len(ci_95) >= 2:
-                                st.write(f"95% CI: [{ci_95[0]:.4f}, {ci_95[1]:.4f}]")
-                            else:
-                                st.write("95% CI: Format not recognized")
-                        elif 0.95 in ci_data:
-                            ci_95 = ci_data[0.95]
-                            if isinstance(ci_95, dict):
-                                st.write(f"95% CI: [{ci_95['lower']:.4f}, {ci_95['upper']:.4f}]")
-                            elif isinstance(ci_95, (list, tuple)) and len(ci_95) >= 2:
-                                st.write(f"95% CI: [{ci_95[0]:.4f}, {ci_95[1]:.4f}]")
-                            else:
-                                st.write("95% CI: Format not recognized")
-                        else:
-                            st.write("95% CI: Not available")
-                    with col2:
-                        if '0.99' in ci_data:
-                            ci_99 = ci_data['0.99']
-                            if isinstance(ci_99, dict):
-                                st.write(f"99% CI: [{ci_99['lower']:.4f}, {ci_99['upper']:.4f}]")
-                            elif isinstance(ci_99, (list, tuple)) and len(ci_99) >= 2:
-                                st.write(f"99% CI: [{ci_99[0]:.4f}, {ci_99[1]:.4f}]")
-                            else:
-                                st.write("99% CI: Format not recognized")
-                        elif 0.99 in ci_data:
-                            ci_99 = ci_data[0.99]
-                            if isinstance(ci_99, dict):
-                                st.write(f"99% CI: [{ci_99['lower']:.4f}, {ci_99['upper']:.4f}]")
-                            elif isinstance(ci_99, (list, tuple)) and len(ci_99) >= 2:
-                                st.write(f"99% CI: [{ci_99[0]:.4f}, {ci_99[1]:.4f}]")
-                            else:
-                                st.write("99% CI: Format not recognized")
-                        else:
-                            st.write("99% CI: Not available")
-                    with col3:
-                        if 'point_estimates' in bootstrap_results and metric in bootstrap_results['point_estimates']:
-                            st.write(f"Point Estimate: {bootstrap_results['point_estimates'][metric]:.4f}")
-                        else:
-                            st.write("Point Estimate: Not available")
-        else:
-            st.info("Bootstrap confidence intervals not available.")
-        # Bootstrap Distribution Plot
-        st.markdown("### Bootstrap Distributions")
-        if 'bootstrap_distribution' in bootstrap_results:
-            fig = make_subplots(
-                rows=2, cols=2,
-                subplot_titles=('Accuracy', 'F1 Score', 'Precision', 'Recall')
-            )
-            distributions = bootstrap_results['bootstrap_distribution']
-            if 'accuracy' in distributions:
-                fig.add_trace(
-                    go.Histogram(x=distributions['accuracy'], name="Accuracy", nbinsx=30),
-                    row=1, col=1
-                )
-            if 'f1_macro' in distributions:
-                fig.add_trace(
-                    go.Histogram(x=distributions['f1_macro'], name="F1 Score", nbinsx=30),
-                    row=1, col=2
-                )
-            if 'precision_macro' in distributions:
-                fig.add_trace(
-                    go.Histogram(x=distributions['precision_macro'], name="Precision", nbinsx=30),
-                    row=2, col=1
-                )
-            if 'recall_macro' in distributions:
-                fig.add_trace(
-                    go.Histogram(x=distributions['recall_macro'], name="Recall", nbinsx=30),
-                    row=2, col=2
-                )
-            fig.update_layout(
-                title="Bootstrap Distributions",
-                showlegend=False,
-                height=600
-            )
-            st.plotly_chart(fig, use_container_width=True)
-        else:
-            st.info("Bootstrap distributions not available.")
-    # Sample predictions
-    if 'predictions' in st.session_state:
-        st.markdown("### Sample Predictions")
-        predictions = st.session_state.predictions
-        sample_df = predictions['full_df'].copy()
-        sample_df['predicted_class'] = predictions['predicted_labels']
-        sample_df['true_class'] = predictions['true_labels']
-        sample_df['similarity_score'] = predictions['match_results']['similarity_score']
-        sample_df['correct'] = sample_df['predicted_class'] == sample_df['true_class']
-        # Filter options
-        col1, col2 = st.columns(2)
-        with col1:
-            show_correct = st.checkbox("Show correct predictions", value=True)
-        with col2:
-            show_incorrect = st.checkbox("Show incorrect predictions", value=True)
-        # Filter data
-        if show_correct and show_incorrect:
-            filtered_df = sample_df
-        elif show_correct:
-            filtered_df = sample_df[sample_df['correct'] == True]
-        elif show_incorrect:
-            filtered_df = sample_df[sample_df['correct'] == False]
-        else:
-            filtered_df = pd.DataFrame()
-        if not filtered_df.empty:
-            # Sample random rows
-            n_samples = min(20, len(filtered_df))
-            sample_rows = filtered_df.sample(n=n_samples) if len(filtered_df) > n_samples else filtered_df
-            display_df = sample_rows[['sentence', 'true_class', 'predicted_class', 'similarity_score', 'correct']].reset_index(drop=True)
-            st.dataframe(display_df, use_container_width=True)
-        else:
-            st.info("No predictions to show with current filters.")
-    # Download results
-    st.markdown("### Download Results")
-    col1, col2 = st.columns(2)
-    with col1:
-        # Download class-wise metrics
-        csv_metrics = class_metrics_df.to_csv(index=False)
-        st.download_button(
-            label="Download Class Metrics",
-            data=csv_metrics,
-            file_name="class_metrics.csv",
-            mime="text/csv"
-        )
-    with col2:
-        # Download predictions
-        if 'predictions' in st.session_state:
-            predictions = st.session_state.predictions
-            results_df = predictions['full_df'].copy()
-            results_df['predicted_class'] = predictions['predicted_labels']
-            results_df['similarity_score'] = predictions['match_results']['similarity_score']
-            csv_results = results_df.to_csv(index=False)
-            st.download_button(
-                label="Download Predictions",
-                data=csv_results,
-                file_name="predictions.csv",
-                mime="text/csv"
-            )
-if __name__ == "__main__":
-    main()

+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import tempfile
+import os
+import sys
+from io import StringIO
+import plotly.express as px
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+# Add the parent directory to sys.path to import the module
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from src.qualivec.data import DataLoader
+from src.qualivec.embedding import EmbeddingModel
+from src.qualivec.matching import SemanticMatcher
+from src.qualivec.classification import Classifier
+from src.qualivec.evaluation import Evaluator
+from src.qualivec.optimization import ThresholdOptimizer
+# Set page config
+st.set_page_config(
+    page_title="QualiVec Demo",
+    page_icon="🔍",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+# Custom CSS for better styling
+st.markdown("""
+<style>
+    .main-header {
+        font-size: 2.5rem;
+        font-weight: bold;
+        color: #2E4057;
+        text-align: center;
+        margin-bottom: 2rem;
+    }
+    .section-header {
+        font-size: 1.5rem;
+        font-weight: bold;
+        color: #048A81;
+        margin-top: 2rem;
+        margin-bottom: 1rem;
+    }
+    .metric-card {
+        background-color: #f0f2f6;
+        padding: 1rem;
+        border-radius: 0.5rem;
+        margin: 0.5rem 0;
+    }
+    .success-message {
+        background-color: #d4edda;
+        color: #155724;
+        padding: 1rem;
+        border-radius: 0.5rem;
+        margin: 1rem 0;
+    }
+    .warning-message {
+        background-color: #fff3cd;
+        color: #856404;
+        padding: 1rem;
+        border-radius: 0.5rem;
+        margin: 1rem 0;
+    }
+</style>
+""", unsafe_allow_html=True)
+def main():
+    st.markdown('<div class="main-header">🔍 QualiVec Demo</div>', unsafe_allow_html=True)
+    st.markdown("""
+    <div style="text-align: center; margin-bottom: 2rem;">
+        <p style="font-size: 1.2rem; color: #666;">
+            Qualitative Content Analysis with LLM Embeddings
+        </p>
+    </div>
+    """, unsafe_allow_html=True)
+    # Sidebar for navigation
+    st.sidebar.title("Navigation")
+    page = st.sidebar.selectbox(
+        "Choose a page",
+        ["🏠 Home", "📊 Data Upload", "🔧 Configuration", "🎯 Classification", "📈 Results"]
+    )
+    # Initialize session state
+    if 'classifier' not in st.session_state:
+        st.session_state.classifier = None
+    if 'reference_data' not in st.session_state:
+        st.session_state.reference_data = None
+    if 'labeled_data' not in st.session_state:
+        st.session_state.labeled_data = None
+    if 'optimization_results' not in st.session_state:
+        st.session_state.optimization_results = None
+    if 'evaluation_results' not in st.session_state:
+        st.session_state.evaluation_results = None
+    # Route to different pages
+    if page == "🏠 Home":
+        show_home_page()
+    elif page == "📊 Data Upload":
+        show_data_upload_page()
+    elif page == "🔧 Configuration":
+        show_configuration_page()
+    elif page == "🎯 Classification":
+        show_classification_page()
+    elif page == "📈 Results":
+        show_results_page()
+def show_home_page():
+    st.markdown('<div class="section-header">Welcome to QualiVec</div>', unsafe_allow_html=True)
+    col1, col2, col3 = st.columns([1, 2, 1])
+    with col2:
+        st.markdown("""
+        ### What is QualiVec?
+        QualiVec is a Python library that uses Large Language Model (LLM) embeddings for qualitative content analysis. It helps researchers and analysts classify text data by comparing it against reference examples.
+        ### Key Features:
+        - **Semantic Matching**: Uses advanced embedding models to find semantic similarity
+        - **Threshold Optimization**: Automatically finds the best similarity threshold
+        - **Comprehensive Evaluation**: Provides detailed metrics and visualizations
+        - **Bootstrap Analysis**: Confidence intervals for robust evaluation
+        ### How It Works:
+        1. **Upload Data**: Provide reference examples and data to classify
+        2. **Configure**: Set up embedding models and parameters
+        3. **Optimize**: Find the best threshold for classification
+        4. **Classify**: Apply the model to your data
+        5. **Evaluate**: Get detailed performance metrics
+        ### Getting Started:
+        Use the sidebar to navigate through the demo. Start with **Data Upload** to begin your analysis.
+        """)
+    # Add sample data info
+    st.markdown('<div class="section-header">Sample Data Format</div>', unsafe_allow_html=True)
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("**Reference Data Format:**")
+        sample_ref = pd.DataFrame({
+            'tag': ['Positive', 'Negative', 'Neutral'],
+            'sentence': ['This is great!', 'This is terrible', 'This is okay']
+        })
+        st.dataframe(sample_ref, use_container_width=True)
+    with col2:
+        st.markdown("**Labeled Data Format:**")
+        sample_labeled = pd.DataFrame({
+            'sentence': ['I love this product', 'Not very good', 'Average quality'],
+            'Label': ['Positive', 'Negative', 'Neutral']
+        })
+        st.dataframe(sample_labeled, use_container_width=True)
+def show_data_upload_page():
+    st.markdown('<div class="section-header">Data Upload</div>', unsafe_allow_html=True)
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("### Reference Data")
+        st.markdown("Upload a CSV file containing reference examples with columns: `tag` (class) and `sentence` (example text)")
+        reference_file = st.file_uploader(
+            "Choose reference data file",
+            type=['csv'],
+            key='reference_file'
+        )
+        if reference_file is not None:
+            try:
+                reference_df = pd.read_csv(reference_file)
+                st.success("Reference data loaded successfully!")
+                st.dataframe(reference_df.head(), use_container_width=True)
+                # Validate columns
+                required_cols = ['tag', 'sentence']
+                missing_cols = [col for col in required_cols if col not in reference_df.columns]
+                if missing_cols:
+                    st.error(f"Missing required columns: {missing_cols}")
+                else:
+                    # Prepare reference data
+                    reference_df = reference_df.rename(columns={
+                        'tag': 'class',
+                        'sentence': 'matching_node'
+                    })
+                    st.session_state.reference_data = reference_df
+                    # Show statistics
+                    st.markdown("**Data Statistics:**")
+                    st.write(f"- Total examples: {len(reference_df)}")
+                    st.write(f"- Unique classes: {reference_df['class'].nunique()}")
+                    st.write(f"- Class distribution:")
+                    st.write(reference_df['class'].value_counts())
+            except Exception as e:
+                st.error(f"Error loading reference data: {str(e)}")
+    with col2:
+        st.markdown("### Labeled Data")
+        st.markdown("Upload a CSV file containing data to classify with columns: `sentence` (text) and `Label` (true class)")
+        labeled_file = st.file_uploader(
+            "Choose labeled data file",
+            type=['csv'],
+            key='labeled_file'
+        )
+        if labeled_file is not None:
+            try:
+                labeled_df = pd.read_csv(labeled_file)
+                st.success("Labeled data loaded successfully!")
+                st.dataframe(labeled_df.head(), use_container_width=True)
+                # Validate columns
+                required_cols = ['sentence', 'Label']
+                missing_cols = [col for col in required_cols if col not in labeled_df.columns]
+                if missing_cols:
+                    st.error(f"Missing required columns: {missing_cols}")
+                else:
+                    # Prepare labeled data
+                    labeled_df = labeled_df.rename(columns={'Label': 'label'})
+                    labeled_df['label'] = labeled_df['label'].replace('0', 'Other')
+                    st.session_state.labeled_data = labeled_df
+                    # Show statistics
+                    st.markdown("**Data Statistics:**")
+                    st.write(f"- Total samples: {len(labeled_df)}")
+                    st.write(f"- Unique labels: {labeled_df['label'].nunique()}")
+                    st.write(f"- Label distribution:")
+                    st.write(labeled_df['label'].value_counts())
+            except Exception as e:
+                st.error(f"Error loading labeled data: {str(e)}")
+    # Show data compatibility check
+    if st.session_state.reference_data is not None and st.session_state.labeled_data is not None:
+        st.markdown('<div class="section-header">Data Compatibility Check</div>', unsafe_allow_html=True)
+        ref_classes = set(st.session_state.reference_data['class'].unique())
+        labeled_classes = set(st.session_state.labeled_data['label'].unique())
+        # Check for unknown classes
+        unknown_classes = labeled_classes - ref_classes
+        if unknown_classes:
+            st.warning(f"Warning: Labels in labeled data not found in reference data: {unknown_classes}")
+        else:
+            st.success("✅ Data compatibility check passed!")
+        # Show class overlap
+        st.markdown("**Class Overlap Analysis:**")
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("Reference Classes", len(ref_classes))
+        with col2:
+            st.metric("Labeled Classes", len(labeled_classes))
+        with col3:
+            st.metric("Common Classes", len(ref_classes.intersection(labeled_classes)))
+def show_configuration_page():
+    st.markdown('<div class="section-header">Model Configuration</div>', unsafe_allow_html=True)
+    # Check if data is loaded
+    if st.session_state.reference_data is None or st.session_state.labeled_data is None:
+        st.warning("Please upload both reference and labeled data first.")
+        return
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("### Embedding Model")
+        # Model type selection
+        model_type = st.selectbox(
+            "Choose model type",
+            ["HuggingFace", "Gemini"],
+            help="Select the type of embedding model to use"
+        )
+        # Model selection based on type
+        if model_type == "HuggingFace":
+            model_options = [
+                "sentence-transformers/all-MiniLM-L6-v2",
+                "sentence-transformers/all-mpnet-base-v2",
+                "sentence-transformers/distilbert-base-nli-mean-tokens"
+            ]
+            selected_model = st.selectbox(
+                "Choose HuggingFace model",
+                model_options,
+                help="Select the pre-trained HuggingFace model for generating embeddings"
+            )
+        else:  # Gemini
+            gemini_models = [
+                "gemini-embedding-001",
+                "text-embedding-004"
+            ]
+            selected_model = st.selectbox(
+                "Choose Gemini model",
+                gemini_models,
+                help="Select the Gemini embedding model for generating embeddings"
+            )
+            # Calculate total texts to process
+            total_texts = 0
+            if st.session_state.reference_data is not None:
+                total_texts += len(st.session_state.reference_data)
+            if st.session_state.labeled_data is not None:
+                total_texts += len(st.session_state.labeled_data)
+            st.warning(
+                f"⚠️ **Gemini API Rate Limits (Free Tier)**\\n\\n"
+                f"- 1,500 requests per day\\n"
+                f"- Each batch of 100 texts = 1 request\\n"
+                f"- Your current dataset: ~{total_texts} texts\\n"
+                f"- Estimated requests needed: ~{(total_texts // 100) + 1}\\n\\n"
+                f"If you exceed quota, consider:\\n"
+                f"1. Using a smaller dataset\\n"
+                f"2. Switching to HuggingFace models (no limits)\\n"
+                f"3. Upgrading to a paid API plan"
+            )
+            st.info("💡 Note: Using Gemini embeddings requires GOOGLE_API_KEY environment variable to be set.")
+        st.markdown("### Initial Threshold")
+        initial_threshold = st.slider(
+            "Initial similarity threshold",
+            min_value=0.0,
+            max_value=1.0,
+            value=0.7,
+            step=0.05,
+            help="Cosine similarity threshold for classification"
+        )
+    with col2:
+        st.markdown("### Optimization Parameters")
+        optimize_threshold = st.checkbox(
+            "Enable threshold optimization",
+            value=True,
+            help="Automatically find the best threshold"
+        )
+        if optimize_threshold:
+            col2_1, col2_2 = st.columns(2)
+            with col2_1:
+                start_threshold = st.slider(
+                    "Start threshold",
+                    min_value=0.0,
+                    max_value=1.0,
+                    value=0.5,
+                    step=0.05
+                )
+                end_threshold = st.slider(
+                    "End threshold",
+                    min_value=0.0,
+                    max_value=1.0,
+                    value=0.9,
+                    step=0.05
+                )
+            with col2_2:
+                step_size = st.slider(
+                    "Step size",
+                    min_value=0.005,
+                    max_value=0.05,
+                    value=0.01,
+                    step=0.005
+                )
+                optimization_metric = st.selectbox(
+                    "Optimization metric",
+                    ["f1_macro", "accuracy", "precision_macro", "recall_macro"]
+                )
+    # Load models button
+    if st.button("Initialize Models", type="primary"):
+        with st.spinner("Loading models... This may take a few minutes."):
+            try:
+                # Initialize classifier
+                classifier = Classifier(verbose=False)
+                # Determine model type parameter
+                model_type_param = "gemini" if model_type == "Gemini" else "huggingface"
+                classifier.load_models(
+                    model_name=selected_model,
+                    model_type=model_type_param,
+                    threshold=initial_threshold
+                )
+                # Prepare reference vectors
+                with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_ref:
+                    tmp_ref_path = tmp_ref.name
+                    st.session_state.reference_data.to_csv(tmp_ref_path, index=False)
+                try:
+                    reference_data = classifier.prepare_reference_vectors(
+                        reference_path=tmp_ref_path,
+                        class_column='class',
+                        node_column='matching_node'
+                    )
+                finally:
+                    # Ensure file is deleted even if an error occurs
+                    try:
+                        os.unlink(tmp_ref_path)
+                    except (OSError, PermissionError):
+                        pass  # File might already be deleted or locked
+                st.session_state.classifier = classifier
+                st.session_state.reference_vectors = reference_data
+                st.session_state.config = {
+                    'model_type': model_type,
+                    'model_name': selected_model,
+                    'initial_threshold': initial_threshold,
+                    'optimize_threshold': optimize_threshold,
+                    'start_threshold': start_threshold if optimize_threshold else None,
+                    'end_threshold': end_threshold if optimize_threshold else None,
+                    'step_size': step_size if optimize_threshold else None,
+                    'optimization_metric': optimization_metric if optimize_threshold else None
+                }
+                st.success("✅ Models initialized successfully!")
+            except Exception as e:
+                st.error(f"Error initializing models: {str(e)}")
+    # Show current configuration
+    if st.session_state.classifier is not None:
+        st.markdown('<div class="section-header">Current Configuration</div>', unsafe_allow_html=True)
+        config = st.session_state.config
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.markdown("**Model Settings:**")
+            st.write(f"- Model type: {config['model_type']}")
+            st.write(f"- Model: {config['model_name']}")
+            st.write(f"- Initial threshold: {config['initial_threshold']}")
+        with col2:
+            st.markdown("**Optimization:**")
+            st.write(f"- Enabled: {config['optimize_threshold']}")
+            if config['optimize_threshold']:
+                st.write(f"- Range: {config['start_threshold']:.2f} - {config['end_threshold']:.2f}")
+                st.write(f"- Step: {config['step_size']:.3f}")
+        with col3:
+            st.markdown("**Data:**")
+            st.write(f"- Reference examples: {len(st.session_state.reference_data)}")
+            st.write(f"- Labeled samples: {len(st.session_state.labeled_data)}")
+def show_classification_page():
+    st.markdown('<div class="section-header">Classification & Optimization</div>', unsafe_allow_html=True)
+    # Check if models are loaded
+    if st.session_state.classifier is None:
+        st.warning("Please configure and initialize models first.")
+        return
+    # Run classification
+    if st.button("Run Classification", type="primary"):
+        with st.spinner("Running classification and optimization..."):
+            try:
+                # Save labeled data to temporary file
+                with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as tmp_labeled:
+                    tmp_labeled_path = tmp_labeled.name
+                    st.session_state.labeled_data.to_csv(tmp_labeled_path, index=False)
+                try:
+                    # Run optimization if enabled
+                    if st.session_state.config['optimize_threshold']:
+                        optimization_results = st.session_state.classifier.evaluate_classification(
+                            labeled_path=tmp_labeled_path,
+                            reference_data=st.session_state.reference_vectors,
+                            sentence_column='sentence',
+                            label_column='label',
+                            optimize_threshold=True,
+                            start=st.session_state.config['start_threshold'],
+                            end=st.session_state.config['end_threshold'],
+                            step=st.session_state.config['step_size']
+                        )
+                        st.session_state.optimization_results = optimization_results
+                        optimal_threshold = optimization_results["optimal_threshold"]
+                        # Update classifier with optimal threshold
+                        st.session_state.classifier.matcher = SemanticMatcher(
+                            threshold=optimal_threshold,
+                            verbose=False
+                        )
+                        st.success(f"✅ Optimization completed! Optimal threshold: {optimal_threshold:.4f}")
+                    else:
+                        optimal_threshold = st.session_state.config['initial_threshold']
+                    # Run evaluation
+                    embedding_model = st.session_state.classifier.embedding_model
+                    data_loader = DataLoader(verbose=False)
+                    full_df = data_loader.load_labeled_data(tmp_labeled_path, label_column='label')
+                    # Generate embeddings
+                    full_embeddings = embedding_model.embed_dataframe(full_df, text_column='sentence')
+                    # Classify
+                    match_results = st.session_state.classifier.matcher.match(
+                        full_embeddings,
+                        st.session_state.reference_vectors
+                    )
+                    predicted_labels = match_results["predicted_class"].tolist()
+                    true_labels = full_df['label'].tolist()
+                    # Evaluate
+                    evaluator = Evaluator(verbose=False)
+                    eval_results = evaluator.evaluate(
+                        true_labels=true_labels,
+                        predicted_labels=predicted_labels,
+                        class_names=list(set(true_labels) | set(predicted_labels))
+                    )
+                    # Bootstrap evaluation
+                    bootstrap_results = evaluator.bootstrap_evaluate(
+                        true_labels=true_labels,
+                        predicted_labels=predicted_labels,
+                        n_iterations=100
+                    )
+                    st.session_state.evaluation_results = eval_results
+                    st.session_state.bootstrap_results = bootstrap_results
+                    st.session_state.predictions = {
+                        'true_labels': true_labels,
+                        'predicted_labels': predicted_labels,
+                        'match_results': match_results,
+                        'full_df': full_df
+                    }
+                finally:
+                    # Ensure temporary file is deleted
+                    try:
+                        os.unlink(tmp_labeled_path)
+                    except (OSError, PermissionError):
+                        pass  # File might already be deleted or locked
+                    st.success("✅ Classification completed successfully!")
+            except Exception as e:
+                st.error(f"Error during classification: {str(e)}")
+    # Show optimization results if available
+    if st.session_state.optimization_results is not None:
+        st.markdown('<div class="section-header">Optimization Results</div>', unsafe_allow_html=True)
+        results = st.session_state.optimization_results
+        col1, col2, col3, col4 = st.columns(4)
+        with col1:
+            st.metric(
+                "Optimal Threshold",
+                f"{results['optimal_threshold']:.4f}"
+            )
+        with col2:
+            st.metric(
+                "Accuracy",
+                f"{results['optimal_metrics']['accuracy']:.4f}"
+            )
+        with col3:
+            st.metric(
+                "F1 Score",
+                f"{results['optimal_metrics']['f1_macro']:.4f}"
+            )
+        with col4:
+            st.metric(
+                "Precision",
+                f"{results['optimal_metrics']['precision_macro']:.4f}"
+            )
+        # Plot optimization curve
+        st.markdown("### Optimization Curve")
+        opt_results = results["results_by_threshold"]
+        fig = make_subplots(
+            rows=2, cols=2,
+            subplot_titles=('Accuracy', 'F1 Score', 'Precision', 'Recall'),
+            vertical_spacing=0.1
+        )
+        thresholds = opt_results["thresholds"]
+        # Add traces
+        fig.add_trace(
+            go.Scatter(x=thresholds, y=opt_results["accuracy"], name="Accuracy"),
+            row=1, col=1
+        )
+        fig.add_trace(
+            go.Scatter(x=thresholds, y=opt_results["f1_macro"], name="F1 Score"),
+            row=1, col=2
+        )
+        fig.add_trace(
+            go.Scatter(x=thresholds, y=opt_results["precision_macro"], name="Precision"),
+            row=2, col=1
+        )
+        fig.add_trace(
+            go.Scatter(x=thresholds, y=opt_results["recall_macro"], name="Recall"),
+            row=2, col=2
+        )
+        # Add optimal threshold line to each subplot using shapes
+        optimal_thresh = results['optimal_threshold']
+        # Add vertical line as shapes to each subplot
+        shapes = []
+        for row in range(1, 3):
+            for col in range(1, 3):
+                # Calculate the subplot domain
+                xaxis = f'x{(row-1)*2 + col}' if (row-1)*2 + col > 1 else 'x'
+                shapes.append(
+                    dict(
+                        type="line",
+                        x0=optimal_thresh, x1=optimal_thresh,
+                        y0=0, y1=1,
+                        yref=f"y{(row-1)*2 + col} domain" if (row-1)*2 + col > 1 else "y domain",
+                        xref=xaxis,
+                        line=dict(color="red", width=2, dash="dash")
+                    )
+                )
+        fig.update_layout(shapes=shapes)
+        fig.update_layout(
+            title="Threshold Optimization Results",
+            showlegend=False,
+            height=600
+        )
+        st.plotly_chart(fig, use_container_width=True)
+def show_results_page():
+    st.markdown('<div class="section-header">Results & Evaluation</div>', unsafe_allow_html=True)
+    # Check if evaluation results are available
+    if st.session_state.evaluation_results is None:
+        st.warning("Please run classification first to see results.")
+        return
+    eval_results = st.session_state.evaluation_results
+    # Performance metrics
+    st.markdown("### Performance Metrics")
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric(
+            "Overall Accuracy",
+            f"{eval_results['accuracy']:.4f}"
+        )
+    with col2:
+        st.metric(
+            "Macro F1 Score",
+            f"{eval_results['f1_macro']:.4f}"
+        )
+    with col3:
+        st.metric(
+            "Macro Precision",
+            f"{eval_results['precision_macro']:.4f}"
+        )
+    with col4:
+        st.metric(
+            "Macro Recall",
+            f"{eval_results['recall_macro']:.4f}"
+        )
+    # Class-wise metrics
+    st.markdown("### Class-wise Performance")
+    class_metrics_df = pd.DataFrame({
+        'Class': list(eval_results['class_metrics']['precision'].keys()),
+        'Precision': list(eval_results['class_metrics']['precision'].values()),
+        'Recall': list(eval_results['class_metrics']['recall'].values()),
+        'F1-Score': list(eval_results['class_metrics']['f1'].values()),
+        'Support': list(eval_results['class_metrics']['support'].values())
+    })
+    st.dataframe(class_metrics_df, use_container_width=True)
+    # Confusion Matrix
+    st.markdown("### Confusion Matrix")
+    cm = eval_results['confusion_matrix']
+    class_names = eval_results['confusion_matrix_labels']
+    fig = px.imshow(
+        cm,
+        labels=dict(x="Predicted", y="True", color="Count"),
+        x=class_names,
+        y=class_names,
+        color_continuous_scale='Blues',
+        text_auto=True,
+        title="Confusion Matrix"
+    )
+    fig.update_layout(
+        width=600,
+        height=600
+    )
+    st.plotly_chart(fig, use_container_width=True)
+    # Bootstrap Results
+    if st.session_state.bootstrap_results is not None:
+        st.markdown("### Bootstrap Confidence Intervals")
+        bootstrap_results = st.session_state.bootstrap_results
+        # Debug: show available keys
+        if 'confidence_intervals' in bootstrap_results:
+            metrics = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
+            for metric in metrics:
+                if metric in bootstrap_results['confidence_intervals']:
+                    ci_data = bootstrap_results['confidence_intervals'][metric]
+                    st.markdown(f"**{metric.replace('_', ' ').title()}:**")
+                    col1, col2, col3 = st.columns(3)
+                    # Check available confidence levels
+                    available_levels = list(ci_data.keys())
+                    with col1:
+                        if '0.95' in ci_data:
+                            ci_95 = ci_data['0.95']
+                            if isinstance(ci_95, dict):
+                                st.write(f"95% CI: [{ci_95['lower']:.4f}, {ci_95['upper']:.4f}]")
+                            elif isinstance(ci_95, (list, tuple)) and len(ci_95) >= 2:
+                                st.write(f"95% CI: [{ci_95[0]:.4f}, {ci_95[1]:.4f}]")
+                            else:
+                                st.write("95% CI: Format not recognized")
+                        elif 0.95 in ci_data:
+                            ci_95 = ci_data[0.95]
+                            if isinstance(ci_95, dict):
+                                st.write(f"95% CI: [{ci_95['lower']:.4f}, {ci_95['upper']:.4f}]")
+                            elif isinstance(ci_95, (list, tuple)) and len(ci_95) >= 2:
+                                st.write(f"95% CI: [{ci_95[0]:.4f}, {ci_95[1]:.4f}]")
+                            else:
+                                st.write("95% CI: Format not recognized")
+                        else:
+                            st.write("95% CI: Not available")
+                    with col2:
+                        if '0.99' in ci_data:
+                            ci_99 = ci_data['0.99']
+                            if isinstance(ci_99, dict):
+                                st.write(f"99% CI: [{ci_99['lower']:.4f}, {ci_99['upper']:.4f}]")
+                            elif isinstance(ci_99, (list, tuple)) and len(ci_99) >= 2:
+                                st.write(f"99% CI: [{ci_99[0]:.4f}, {ci_99[1]:.4f}]")
+                            else:
+                                st.write("99% CI: Format not recognized")
+                        elif 0.99 in ci_data:
+                            ci_99 = ci_data[0.99]
+                            if isinstance(ci_99, dict):
+                                st.write(f"99% CI: [{ci_99['lower']:.4f}, {ci_99['upper']:.4f}]")
+                            elif isinstance(ci_99, (list, tuple)) and len(ci_99) >= 2:
+                                st.write(f"99% CI: [{ci_99[0]:.4f}, {ci_99[1]:.4f}]")
+                            else:
+                                st.write("99% CI: Format not recognized")
+                        else:
+                            st.write("99% CI: Not available")
+                    with col3:
+                        if 'point_estimates' in bootstrap_results and metric in bootstrap_results['point_estimates']:
+                            st.write(f"Point Estimate: {bootstrap_results['point_estimates'][metric]:.4f}")
+                        else:
+                            st.write("Point Estimate: Not available")
+        else:
+            st.info("Bootstrap confidence intervals not available.")
+        # Bootstrap Distribution Plot
+        st.markdown("### Bootstrap Distributions")
+        if 'bootstrap_distribution' in bootstrap_results:
+            fig = make_subplots(
+                rows=2, cols=2,
+                subplot_titles=('Accuracy', 'F1 Score', 'Precision', 'Recall')
+            )
+            distributions = bootstrap_results['bootstrap_distribution']
+            if 'accuracy' in distributions:
+                fig.add_trace(
+                    go.Histogram(x=distributions['accuracy'], name="Accuracy", nbinsx=30),
+                    row=1, col=1
+                )
+            if 'f1_macro' in distributions:
+                fig.add_trace(
+                    go.Histogram(x=distributions['f1_macro'], name="F1 Score", nbinsx=30),
+                    row=1, col=2
+                )
+            if 'precision_macro' in distributions:
+                fig.add_trace(
+                    go.Histogram(x=distributions['precision_macro'], name="Precision", nbinsx=30),
+                    row=2, col=1
+                )
+            if 'recall_macro' in distributions:
+                fig.add_trace(
+                    go.Histogram(x=distributions['recall_macro'], name="Recall", nbinsx=30),
+                    row=2, col=2
+                )
+            fig.update_layout(
+                title="Bootstrap Distributions",
+                showlegend=False,
+                height=600
+            )
+            st.plotly_chart(fig, use_container_width=True)
+        else:
+            st.info("Bootstrap distributions not available.")
+    # Sample predictions
+    if 'predictions' in st.session_state:
+        st.markdown("### Sample Predictions")
+        predictions = st.session_state.predictions
+        sample_df = predictions['full_df'].copy()
+        sample_df['predicted_class'] = predictions['predicted_labels']
+        sample_df['true_class'] = predictions['true_labels']
+        sample_df['similarity_score'] = predictions['match_results']['similarity_score']
+        sample_df['correct'] = sample_df['predicted_class'] == sample_df['true_class']
+        # Filter options
+        col1, col2 = st.columns(2)
+        with col1:
+            show_correct = st.checkbox("Show correct predictions", value=True)
+        with col2:
+            show_incorrect = st.checkbox("Show incorrect predictions", value=True)
+        # Filter data
+        if show_correct and show_incorrect:
+            filtered_df = sample_df
+        elif show_correct:
+            filtered_df = sample_df[sample_df['correct'] == True]
+        elif show_incorrect:
+            filtered_df = sample_df[sample_df['correct'] == False]
+        else:
+            filtered_df = pd.DataFrame()
+        if not filtered_df.empty:
+            # Sample random rows
+            n_samples = min(20, len(filtered_df))
+            sample_rows = filtered_df.sample(n=n_samples) if len(filtered_df) > n_samples else filtered_df
+            display_df = sample_rows[['sentence', 'true_class', 'predicted_class', 'similarity_score', 'correct']].reset_index(drop=True)
+            st.dataframe(display_df, use_container_width=True)
+        else:
+            st.info("No predictions to show with current filters.")
+    # Download results
+    st.markdown("### Download Results")
+    col1, col2 = st.columns(2)
+    with col1:
+        # Download class-wise metrics
+        csv_metrics = class_metrics_df.to_csv(index=False)
+        st.download_button(
+            label="Download Class Metrics",
+            data=csv_metrics,
+            file_name="class_metrics.csv",
+            mime="text/csv"
+        )
+    with col2:
+        # Download predictions
+        if 'predictions' in st.session_state:
+            predictions = st.session_state.predictions
+            results_df = predictions['full_df'].copy()
+            results_df['predicted_class'] = predictions['predicted_labels']
+            results_df['similarity_score'] = predictions['match_results']['similarity_score']
+            csv_results = results_df.to_csv(index=False)
+            st.download_button(
+                label="Download Predictions",
+                data=csv_results,
+                file_name="predictions.csv",
+                mime="text/csv"
+            )
+if __name__ == "__main__":
+    main()

app/run_demo.py CHANGED Viewed

@@ -1,38 +1,38 @@
-#!/usr/bin/env python3
-"""
-Quick launcher script for the QualiVec Streamlit demo.
-"""
-import subprocess
-import sys
-import os
-def main():
-    """Launch the Streamlit app."""
-    # Get the directory of this script
-    script_dir = os.path.dirname(os.path.abspath(__file__))
-    app_path = os.path.join(script_dir, "app.py")
-    print("🚀 Starting QualiVec Demo...")
-    print("📍 App will be available at: http://localhost:8501")
-    print("⏹️  Press Ctrl+C to stop the app")
-    print("-" * 50)
-    try:
-        # Run streamlit
-        subprocess.run([
-            sys.executable, "-m", "streamlit", "run", app_path,
-            "--server.headless", "true",
-            "--server.address=0.0.0.0",
-            "--server.port=8501",
-            "--server.enableCORS", "false",
-            "--server.enableXsrfProtection", "false"
-        ])
-    except KeyboardInterrupt:
-        print("\n🛑 App stopped by user")
-    except Exception as e:
-        print(f"❌ Error starting app: {e}")
-if __name__ == "__main__":
-    main()

+#!/usr/bin/env python3
+"""
+Quick launcher script for the QualiVec Streamlit demo.
+"""
+import subprocess
+import sys
+import os
+def main():
+    """Launch the Streamlit app."""
+    # Get the directory of this script
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    app_path = os.path.join(script_dir, "app.py")
+    print("🚀 Starting QualiVec Demo...")
+    print("📍 App will be available at: http://localhost:8501")
+    print("⏹️  Press Ctrl+C to stop the app")
+    print("-" * 50)
+    try:
+        # Run streamlit
+        subprocess.run([
+            sys.executable, "-m", "streamlit", "run", app_path,
+            "--server.headless", "true",
+            # "--server.address=0.0.0.0",
+            "--server.port=8501",
+            "--server.enableCORS", "false",
+            "--server.enableXsrfProtection", "false"
+        ])
+    except KeyboardInterrupt:
+        print("\n🛑 App stopped by user")
+    except Exception as e:
+        print(f"❌ Error starting app: {e}")
+if __name__ == "__main__":
+    main()

dist/.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ *

dist/qualivec-0.1.0-py3-none-any.whl ADDED Viewed

Binary file (19.9 kB). View file

dist/qualivec-0.1.0.tar.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:80b1c1f4ac5470593b6b873c82f620ec544e8f9d8ac2834d23ef81521e65625c
+size 46670

src/qualivec/__pycache__/embedding.cpython-312.pyc CHANGED Viewed

Binary files a/src/qualivec/__pycache__/embedding.cpython-312.pyc and b/src/qualivec/__pycache__/embedding.cpython-312.pyc differ

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff