Spaces:

aneeb15
/

Auto-FineTune-Ops

Configuration error

App Files Files Community

aneeb15 commited on Feb 12

Commit

9ec2fac

1 Parent(s): 335cd41

feat: implement dynamic evaluation page (remove static charts)

Browse files

Files changed (2) hide show

PROJECT_HIGHLIGHTS.md +0 -54
app.py +186 -127

PROJECT_HIGHLIGHTS.md DELETED Viewed

@@ -1,54 +0,0 @@
-# 🚀 Auto-FineTune-Ops: Project Highlights
-**Autonomous Machine Learning Pipeline for Production-Grade LLM Fine-Tuning**
-Auto-FineTune-Ops is a comprehensive, no-code/low-code platform that democratizes access to state-of-the-art LLM fine-tuning. It automates the complex lifecycle of data preparation, training, evaluation, and deployment.
----
-## 🌟 Key Features
-### 1. 🧠 Intelligent Preprocessing Engine
-A modular, production-ready data pipeline with 10+ specialized modules:
-- **Text Cleaning:** Auto-strip HTML, emojis, URLs, and normalize whitespace.
-- **PII Redaction:** Detect and mask emails, phone numbers, and keys for security.
-- **Deduplication:** Remove exact and semantic duplicates (using TF-IDF/Cosine Similarity).
-- **Quality Filtering:** Filter by language, toxicity, and length constraints.
-- **Advanced Formatting:** Auto-convert loose CSV/JSON into strict Chat Templates (ShareGPT/OpenAI).
-### 2. ⚡ Hybrid Training Ecosystem
-Flexible training workflows designed for all hardware setups:
-- **Local GPU Power:** Leverages **Unsloth** for 2x faster training and 70% less memory usage (4-bit quantization).
-- **Google Colab Bridge:** Seamless "No-GPU" fallback flow. Generate a ready-to-run Colab notebook to train on free cloud GPUs if local hardware is insufficient.
-- **Custom Model Support:** Fine-tune any HuggingFace model (Llama 3, Mistral, Gemma, Phi-3, etc.).
-### 3. ⚖️ Multi-Provider AI Judge Arena
-Production-grade model evaluation using LLM-as-a-Judge:
-- **Provider Agnostic:** Supports OpenAI (GPT-4o), Anthropic (Claude 3.5), Google (Gemini 1.5), and Groq (Llama 3).
-- **Custom Endpoints:** Connect to local LLMs (Ollama/vLLM) as judges.
-- **Comprehensive Metrics:** Automated scoring for Accuracy, Helpfulness, Clarity, and Tone.
-- **Head-to-Head:** Win-rate visualization comparing Base Model vs. Fine-Tuned Model.
-### 4. 🖥️ Interactive Streamlit Dashboard
-A premium, dark-mode UI that abstracts away CLI complexity:
-- **Project Management:** Manage datasets, models, and logs visually.
-- **Real-time Monitoring:** Track training loss and progress live.
-- **Visualization:** Interactive Plotly charts for evaluation results.
-### 5. 🚀 One-Click Deployment
-- **Instant API:** Export trained models as a production-ready **FastAPI** microservice.
-- **Standardized Interface:** OpenAI-compatible `/generate` endpoints for easy integration into apps.
----
-## 🔧 Technical Stack
-- **Frontend:** Streamlit, Plotly
-- **Core ML:** PyTorch, Transformers, PEFT, Unsloth, TRL
-- **Data:** Pandas, NumPy, Scikit-learn
-- **API:** FastAPI, Uvicorn
-- **LLM Clients:** OpenAI SDK, Anthropic SDK
-## 🛡️ Production Readiness
-- **Modular Architecture:** Agent-based design (DataArchitect, TrainingPilot, TheJudge) allows easy extensibility.
-- **Error Handling:** Robust fallback mechanisms and detailed logging.
-- **Security:** PII masking and API key management best practices.

app.py CHANGED Viewed

@@ -1213,178 +1213,237 @@ def render_training():
 def render_evaluation():
     st.markdown('<p class="gradient-header">⚖️ Model Evaluation</p>', unsafe_allow_html=True)
     # ── Judge Provider Selection ──
     st.markdown("### 🤖 Select AI Judge Provider")
-    st.caption("Choose which LLM provider to use as the evaluation judge. You can use any model you have API access to.")
     judge_provider = st.selectbox("AI Provider", [
         "OpenAI (GPT-4o, GPT-4-turbo, etc.)",
         "Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
-        "Google Gemini (Gemini Pro, Gemini 1.5, etc.)",
-        "Groq (Llama, Mixtral, Gemma, etc.)",
         "Custom OpenAI-Compatible Endpoint"
-    ], help="Select the AI provider whose model will act as the judge for evaluating your fine-tuned model.")
     st.markdown("---")
     st.markdown("### 🔑 API Configuration")
     if "OpenAI" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
-            openai_key = st.text_input("OpenAI API Key", type="password",
-                help="Your OpenAI API key (starts with sk-)")
-            if openai_key:
-                os.environ["OPENAI_API_KEY"] = openai_key
         with col2:
-            judge_model = st.selectbox("Judge Model", [
-                "gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-4", "gpt-3.5-turbo"
-            ])
     elif "Anthropic" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
-            anthropic_key = st.text_input("Anthropic API Key", type="password",
-                help="Your Anthropic API key")
-            if anthropic_key:
-                os.environ["ANTHROPIC_API_KEY"] = anthropic_key
-        with col2:
-            judge_model = st.selectbox("Judge Model", [
-                "claude-3-5-sonnet-20241022", "claude-3-opus-20240229",
-                "claude-3-sonnet-20240229", "claude-3-haiku-20240307"
-            ])
-    elif "Gemini" in judge_provider:
-        col1, col2 = st.columns(2)
-        with col1:
-            gemini_key = st.text_input("Google AI API Key", type="password",
-                help="Your Google AI Studio API key for Gemini models")
-            if gemini_key:
-                os.environ["GOOGLE_API_KEY"] = gemini_key
         with col2:
-            judge_model = st.selectbox("Judge Model", [
-                "gemini-1.5-pro", "gemini-1.5-flash", "gemini-pro"
-            ])
     elif "Groq" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
-            groq_key = st.text_input("Groq API Key", type="password",
-                help="Your Groq API key for fast inference")
-            if groq_key:
-                os.environ["GROQ_API_KEY"] = groq_key
         with col2:
-            judge_model = st.selectbox("Judge Model", [
-                "llama-3.1-70b-versatile", "llama-3.1-8b-instant",
-                "mixtral-8x7b-32768", "gemma2-9b-it"
-            ])
-    else:  # Custom endpoint
         col1, col2 = st.columns(2)
         with col1:
-            custom_base_url = st.text_input("API Base URL",
-                placeholder="https://api.your-provider.com/v1",
-                help="OpenAI-compatible API endpoint (e.g., vLLM, Ollama, LM Studio)")
-            custom_api_key = st.text_input("API Key", type="password",
-                help="API key for the custom endpoint (use 'none' for local servers)")
-            if custom_api_key:
-                os.environ["OPENAI_API_KEY"] = custom_api_key
-            if custom_base_url:
-                os.environ["OPENAI_BASE_URL"] = custom_base_url
         with col2:
-            judge_model = st.text_input("Model Name",
-                placeholder="e.g., my-model, llama-3-8b",
-                help="Model identifier used by your custom endpoint")
     st.markdown("---")
-    # ── Model / Results to Evaluate ──
     st.markdown("### 📊 Evaluation Data")
-    if st.session_state.model_path:
-        st.info(f"📦 Model / Results: `{st.session_state.model_path}`")
-    else:
-        st.warning("⚠️ No trained model or uploaded results found. You can upload evaluation data below or train a model first.")
-    # Upload evaluation data
-    eval_upload = st.file_uploader("Upload evaluation data (JSONL with instruction + model output)",
-        type=['jsonl', 'json'], key="eval_data_upload",
-        help="Upload a JSONL file containing instruction-response pairs to evaluate")
     if eval_upload:
         try:
-            eval_df = pd.read_json(eval_upload, lines=eval_upload.name.endswith('.jsonl'))
-            st.success(f"✅ Loaded **{len(eval_df):,}** samples for evaluation")
-            st.dataframe(eval_df.head(5), use_container_width=True)
-            st.session_state['eval_data'] = eval_df
         except Exception as e:
-            st.error(f"Error loading evaluation data: {e}")
     st.markdown("---")
-    # ── Demo Charts ──
-    st.markdown("### 📈 Evaluation Results")
-    col1, col2 = st.columns(2)
-    with col1:
-        fig = go.Figure(data=[go.Pie(
-            values=[72, 18, 10],
-            labels=['Fine-tuned Wins', 'Base Model Wins', 'Ties'],
-            hole=0.6,
-            marker_colors=['#6366f1', '#ef4444', '#94a3b8']
-        )])
-        fig.update_layout(
-            title="Win Rate Distribution",
-            paper_bgcolor='rgba(0,0,0,0)',
-            plot_bgcolor='rgba(0,0,0,0)',
-            font_color='#e2e8f0',
-            showlegend=True
-        )
-        st.plotly_chart(fig, use_container_width=True)
-    with col2:
-        fig = go.Figure(data=[
-            go.Bar(name='Base Model', x=['Accuracy', 'Helpfulness', 'Clarity', 'Relevance'], y=[6.2, 5.8, 6.5, 6.0], marker_color='#ef4444'),
-            go.Bar(name='Fine-tuned', x=['Accuracy', 'Helpfulness', 'Clarity', 'Relevance'], y=[7.8, 8.1, 7.5, 8.2], marker_color='#6366f1')
-        ])
-        fig.update_layout(
-            title="Score Comparison by Category",
-            barmode='group',
-            paper_bgcolor='rgba(0,0,0,0)',
-            plot_bgcolor='rgba(0,0,0,0)',
-            font_color='#e2e8f0',
-            yaxis_title="Score (1-10)"
-        )
-        st.plotly_chart(fig, use_container_width=True)
-    # Summary metrics
-    col1, col2, col3, col4 = st.columns(4)
-    with col1:
-        st.metric("Win Rate", "72%", "+22%")
-    with col2:
-        st.metric("Base Avg Score", "6.4/10")
-    with col3:
-        st.metric("Fine-tuned Avg", "7.8/10", "+1.4")
-    with col4:
-        st.metric("Comparisons", "50")
-    st.markdown("---")
-    # Run evaluation
-    col1, col2, col3 = st.columns([1, 2, 1])
-    with col2:
-        if st.button("🏃 Run Full Evaluation", type="primary", use_container_width=True):
-            has_key = any([
-                os.environ.get("OPENAI_API_KEY"),
-                os.environ.get("ANTHROPIC_API_KEY"),
-                os.environ.get("GOOGLE_API_KEY"),
-                os.environ.get("GROQ_API_KEY"),
-            ])
-            if not has_key:
-                st.error("❌ Please provide an API key for your selected judge provider.")
-            elif not st.session_state.model_path and not st.session_state.get('eval_data') is not None:
-                st.error("❌ Please either train a model, upload fine-tuned results, or upload evaluation data.")
-            else:
-                st.info(f"🏃 Starting evaluation with **{judge_model}** as judge...")
-                st.warning("⏳ Full evaluation pipeline integration coming soon. Demo results shown above.")
 # ============================================================================

 def render_evaluation():
     st.markdown('<p class="gradient-header">⚖️ Model Evaluation</p>', unsafe_allow_html=True)
+    # Initialize session state for results if not present
+    if 'eval_results' not in st.session_state:
+        st.session_state.eval_results = None
     # ── Judge Provider Selection ──
     st.markdown("### 🤖 Select AI Judge Provider")
+    st.caption("Choose which LLM provider to use as the evaluation judge.")
     judge_provider = st.selectbox("AI Provider", [
         "OpenAI (GPT-4o, GPT-4-turbo, etc.)",
         "Anthropic (Claude 3.5, Claude 3 Opus, etc.)",
+        "Groq (Llama 3, Mixtral, Gemma, etc.)",
         "Custom OpenAI-Compatible Endpoint"
+    ], help="Select the AI provider whose model will act as the judge.")
     st.markdown("---")
     st.markdown("### 🔑 API Configuration")
+    api_key = None
+    base_url = None
     if "OpenAI" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
+            api_key = st.text_input("OpenAI API Key", type="password", key="openai_key_input")
+            if api_key: os.environ["OPENAI_API_KEY"] = api_key
         with col2:
+            judge_model = st.selectbox("Judge Model", ["gpt-4o", "gpt-4-turbo", "gpt-3.5-turbo"])
     elif "Anthropic" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
+            api_key = st.text_input("Anthropic API Key", type="password", key="anthropic_key_input")
+            if api_key: os.environ["ANTHROPIC_API_KEY"] = api_key
         with col2:
+            judge_model = st.selectbox("Judge Model", ["claude-3-5-sonnet-20241022", "claude-3-opus-20240229", "claude-3-sonnet-20240229"])
     elif "Groq" in judge_provider:
         col1, col2 = st.columns(2)
         with col1:
+            api_key = st.text_input("Groq API Key", type="password", key="groq_key_input")
+            if api_key: os.environ["GROQ_API_KEY"] = api_key
         with col2:
+            judge_model = st.selectbox("Judge Model", ["llama3-70b-8192", "llama3-8b-8192", "mixtral-8x7b-32768", "gemma-7b-it"])
+        base_url = "https://api.groq.com/openai/v1"
+    else:  # Custom
         col1, col2 = st.columns(2)
         with col1:
+            base_url = st.text_input("API Base URL", placeholder="https://api.your-provider.com/v1")
+            api_key = st.text_input("API Key", type="password", key="custom_key_input")
+            if api_key: os.environ["OPENAI_API_KEY"] = api_key
         with col2:
+            judge_model = st.text_input("Model Name", placeholder="e.g., my-model")
     st.markdown("---")
+    # ── Evaluation Data ──
     st.markdown("### 📊 Evaluation Data")
+    # 1. Use data from training (if available)
+    if st.session_state.model_path and "finetuned_outputs" in str(st.session_state.model_path):
+        st.info(f"Using results from training: `{st.session_state.model_path}`")
+        try:
+            st.session_state['eval_data'] = pd.read_json(st.session_state.model_path, lines=True)
+        except Exception:
+            pass
+    # 2. Upload new data
+    eval_upload = st.file_uploader("Upload JSONL (Must contain: 'instruction', 'base_output', 'finetuned_output')",
+        type=['jsonl', 'json'], key="eval_uploader")
     if eval_upload:
         try:
+            df = pd.read_json(eval_upload, lines=eval_upload.name.endswith('.jsonl'))
+            required_cols = ['instruction', 'base_output', 'finetuned_output']
+            if all(col in df.columns for col in required_cols):
+                st.session_state['eval_data'] = df
+                st.success(f"✅ Loaded {len(df)} samples")
+            else:
+                st.error(f"❌ Missing columns! Found: {list(df.columns)}. Required: {required_cols}")
         except Exception as e:
+            st.error(f"Error loading file: {e}")
+    # Show Preview
+    if st.session_state.get('eval_data') is not None:
+        with st.expander("👁️ View Data Preview"):
+            st.dataframe(st.session_state['eval_data'].head(3), use_container_width=True)
     st.markdown("---")
+    # ── Run Evaluation ──
+    if st.button("🚀 Run Dynamic Evaluation", type="primary", use_container_width=True):
+        if not api_key:
+            st.error("❌ Please provide an API Key above!")
+            return
+        if st.session_state.get('eval_data') is None:
+            st.error("❌ No evaluation data loaded!")
+            return
+        # Prepare Judge
+        st.session_state.pipeline_status['evaluation'] = 'running'
+        progress_bar = st.progress(0)
+        status_text = st.empty()
+        results = []
+        df = st.session_state['eval_data']
+        total = len(df)
+        try:
+            # Initialize Client
+            client = None
+            if "Anthropic" in judge_provider:
+                from anthropic import Anthropic
+                client = Anthropic(api_key=api_key)
+            else:
+                from openai import OpenAI
+                client = OpenAI(api_key=api_key, base_url=base_url)
+            JUDGE_PROMPT = """You are an expert evaluator comparing two AI responses.
+Query: {prompt}
+Response A (Base Model):
+{response_a}
+Response B (Fine-tuned Model):
+{response_b}
+Compare them on: Accuracy, Helpfulness, Clarity.
+Return a valid JSON object ONLY:
+{{
+    "winner": "A" or "B" or "TIE",
+    "score_a": <1-10>,
+    "score_b": <1-10>,
+    "reasoning": "short explanation",
+    "accuracy": {{"A": <1-10>, "B": <1-10>}},
+    "helpfulness": {{"A": <1-10>, "B": <1-10>}},
+    "clarity": {{"A": <1-10>, "B": <1-10>}}
+}}
+"""
+            for i, row in df.iterrows():
+                status_text.text(f"Evaluating sample {i+1}/{total}...")
+                prompt_text = JUDGE_PROMPT.format(
+                    prompt=row['instruction'],
+                    response_a=row['base_output'],
+                    response_b=row['finetuned_output']
+                )
+                # Call API
+                if "Anthropic" in judge_provider:
+                    resp = client.messages.create(
+                        model=judge_model, max_tokens=1000,
+                        messages=[{"role": "user", "content": prompt_text}]
+                    ).content[0].text
+                else:
+                    resp = client.chat.completions.create(
+                        model=judge_model, max_tokens=1000,
+                        messages=[{"role": "user", "content": prompt_text}],
+                        response_format={"type": "json_object"}
+                    ).choices[0].message.content
+                # Parse
+                try:
+                    import json
+                    # Clean json string if needed
+                    if "```json" in resp: resp = resp.split("```json")[1].split("```")[0]
+                    if "```" in resp: resp = resp.split("```")[1]
+                    data = json.loads(resp.strip())
+                    data['instruction'] = row['instruction']
+                    results.append(data)
+                except Exception as e:
+                    print(f"Parse error: {e}")
+                    results.append({"winner": "TIE", "score_a": 5, "score_b": 5, "reasoning": "Error parsing judge response"})
+                progress_bar.progress((i + 1) / total)
+            st.session_state.eval_results = results
+            st.session_state.pipeline_status['evaluation'] = 'complete'
+            status_text.text("✅ Evaluation Complete!")
+        except Exception as e:
+            st.error(f"Evaluation Failed: {str(e)}")
+            st.session_state.pipeline_status['evaluation'] = 'error'
+    # ── Display Results ──
+    if st.session_state.get('eval_results'):
+        results = st.session_state.eval_results
+        df_res = pd.DataFrame(results)
+        # Metrics
+        wins_b = len(df_res[df_res['winner'] == 'B'])
+        wins_a = len(df_res[df_res['winner'] == 'A'])
+        ties = len(df_res[df_res['winner'] == 'TIE'])
+        win_rate = (wins_b / len(df_res)) * 100
+        col1, col2, col3, col4 = st.columns(4)
+        col1.metric("Fine-tuned Win Rate", f"{win_rate:.1f}%")
+        col2.metric("Fine-Tuned Wins", wins_b)
+        col3.metric("Base Model Wins", wins_a)
+        col4.metric("Avg Score Improvement", f"{df_res['score_b'].mean() - df_res['score_a'].mean():.2f}")
+        # Charts
+        c1, c2 = st.columns(2)
+        with c1:
+            fig = px.pie(values=[wins_b, wins_a, ties], names=['Fine-tuned', 'Base', 'Ties'],
+                         title="Win Distribution", color_discrete_sequence=['#6366f1', '#ef4444', '#94a3b8'])
+            st.plotly_chart(fig, use_container_width=True)
+        with c2:
+            avg_scores = pd.DataFrame({
+                'Model': ['Base', 'Fine-tuned'],
+                'Score': [df_res['score_a'].mean(), df_res['score_b'].mean()]
+            })
+            fig2 = px.bar(avg_scores, x='Model', y='Score', color='Model',
+                          title="Average Overall Score", color_discrete_map={'Base': '#ef4444', 'Fine-tuned': '#6366f1'})
+            st.plotly_chart(fig2, use_container_width=True)
+        # Detailed Table
+        st.markdown("### 📝 Detailed Verdicts")
+        st.dataframe(df_res[['instruction', 'winner', 'score_a', 'score_b', 'reasoning']], use_container_width=True)
+        # Download
+        st.download_button("⬇️ Download Report (JSON)",
+                           data=json.dumps(results, indent=2),
+                           file_name="evaluation_report.json",
+                           mime="application/json")
 # ============================================================================