Spaces:

efecelik
/

dataview-mcp

Sleeping

efecelik Claude Opus 4.5 commited on Jan 16

Commit

b67578f

1 Parent(s): 9fd6495

Initial release: DataView MCP - HuggingFace Dataset Explorer

Features 10 MCP tools:
- search_datasets, search_by_columns
- get_dataset_info, get_schema
- sample_rows
- get_statistics, profile_quality
- find_similar, suggest_tasks, compare_datasets

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (14) hide show

.env.example +3 -0
.gitignore +18 -0
README.md +183 -6
app.py +252 -0
requirements.txt +7 -0
tools/__init__.py +20 -0
tools/discovery.py +384 -0
tools/metadata.py +99 -0
tools/profiling.py +283 -0
tools/sampling.py +51 -0
tools/search.py +116 -0
utils/__init__.py +28 -0
utils/formatting.py +199 -0
utils/hf_client.py +164 -0

.env.example ADDED Viewed

	@@ -0,0 +1,3 @@

+# Hugging Face API Token (optional, but recommended for higher rate limits)
+# Get yours at: https://huggingface.co/settings/tokens
+HF_TOKEN=your_token_here

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Environment
+.env
+.venv/
+venv/
+__pycache__/
+*.pyc
+# IDE
+.idea/
+.vscode/
+*.swp
+# OS
+.DS_Store
+Thumbs.db
+# Gradio
+flagged/

README.md CHANGED Viewed

@@ -1,12 +1,189 @@
 ---
-title: Dataview Mcp
-emoji: 🐠
-colorFrom: indigo
-colorTo: green
 sdk: gradio
-sdk_version: 6.3.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DataView MCP
+emoji: 🔍
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: 5.0.0
 app_file: app.py
 pinned: false
+license: mit
+tags:
+  - mcp
+  - datasets
+  - huggingface
+  - exploration
+  - gradio
 ---
+# DataView MCP 🔍
+A comprehensive **Model Context Protocol (MCP) server** for exploring Hugging Face datasets. Give your AI assistant the power to search, sample, analyze, and discover datasets on the Hub.
+## Features
+| Tool | Description |
+|------|-------------|
+| `search_datasets` | Find datasets by keyword, task, or domain |
+| `search_by_columns` | Find datasets with specific column names |
+| `get_dataset_info` | Get detailed metadata and README |
+| `get_schema` | Get column names and data types |
+| `sample_rows` | Get actual data samples |
+| `get_statistics` | Compute column statistics |
+| `profile_quality` | Assess data quality issues |
+| `find_similar` | Find similar datasets |
+| `suggest_tasks` | Suggest ML tasks for a dataset |
+| `compare_datasets` | Compare two datasets side-by-side |
+## Quick Start
+### Use with Claude Desktop
+Add to your `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "dataview": {
+      "url": "https://efecelik-dataview-mcp.hf.space/gradio_api/mcp/sse"
+    }
+  }
+}
+```
+### Use with Claude Code
+Add to your MCP settings:
+```json
+{
+  "mcpServers": {
+    "dataview": {
+      "command": "npx",
+      "args": ["mcp-remote", "https://efecelik-dataview-mcp.hf.space/gradio_api/mcp/sse"]
+    }
+  }
+}
+```
+### Run Locally
+```bash
+# Clone the repository
+git clone https://huggingface.co/spaces/efecelik/dataview-mcp
+cd dataview-mcp
+# Install dependencies
+pip install -r requirements.txt
+# Optional: Set HF token for higher rate limits
+export HF_TOKEN=your_token_here
+# Run the server
+python app.py
+```
+Then connect to `http://localhost:7860/gradio_api/mcp/sse`
+## Example Usage
+Once connected, ask your AI assistant:
+- *"Search for sentiment analysis datasets"*
+- *"Show me 5 sample rows from the IMDB dataset"*
+- *"What's the schema of the SQuAD dataset?"*
+- *"Find datasets similar to IMDB"*
+- *"What ML tasks could I do with the IMDB dataset?"*
+- *"Compare IMDB and Rotten Tomatoes datasets"*
+- *"Check the data quality of this dataset"*
+## Tool Details
+### search_datasets
+Find datasets matching your criteria.
+```
+Query: "sentiment analysis"
+Filter: text-classification
+Limit: 10
+```
+### sample_rows
+See actual data from a dataset.
+```
+Dataset: imdb
+Rows: 5
+Split: train
+```
+### get_statistics
+Get statistical overview of columns.
+```
+Dataset: imdb
+Sample Size: 1000
+```
+### profile_quality
+Check for data quality issues.
+```
+Dataset: imdb
+Sample Size: 500
+```
+Returns quality score, missing values, duplicates, class imbalance.
+### suggest_tasks
+AI-powered task suggestions based on dataset structure.
+```
+Dataset: imdb
+```
+Returns suggested ML tasks with confidence levels.
+## Development
+```bash
+# Install dev dependencies
+pip install -r requirements.txt
+# Run in development mode
+gradio app.py --reload
+```
+## Architecture
+```
+dataview-mcp/
+├── app.py              # Main Gradio MCP server
+├── tools/
+│   ├── search.py       # search_datasets, search_by_columns
+│   ├── metadata.py     # get_dataset_info, get_schema
+│   ├── sampling.py     # sample_rows
+│   ├── profiling.py    # get_statistics, profile_quality
+│   └── discovery.py    # find_similar, suggest_tasks, compare_datasets
+├── utils/
+│   ├── hf_client.py    # HF API wrapper
+│   └── formatting.py   # Output formatters
+└── requirements.txt
+```
+## License
+MIT
+## Contributing
+Contributions welcome! Please open an issue or PR.
+---
+Built with Gradio and Hugging Face Hub

app.py ADDED Viewed

	@@ -0,0 +1,252 @@

+"""
+DataView MCP - A comprehensive MCP server for exploring Hugging Face datasets.
+This MCP server provides 10 tools for searching, sampling, profiling, and
+discovering datasets on the Hugging Face Hub.
+Tools:
+    1. search_datasets - Find datasets by keyword, task, or domain
+    2. search_by_columns - Find datasets with specific column names
+    3. get_dataset_info - Get detailed metadata and README
+    4. get_schema - Get column names and data types
+    5. sample_rows - Get actual data samples
+    6. get_statistics - Compute column statistics
+    7. profile_quality - Assess data quality issues
+    8. find_similar - Find similar datasets
+    9. suggest_tasks - Suggest ML tasks for a dataset
+    10. compare_datasets - Compare two datasets side-by-side
+Usage:
+    # Run locally
+    python app.py
+    # Or with Gradio CLI
+    gradio app.py
+    # Connect via MCP
+    Add to your MCP client config:
+    {
+        "mcpServers": {
+            "dataview": {
+                "url": "http://localhost:7860/gradio_api/mcp/sse"
+            }
+        }
+    }
+"""
+import gradio as gr
+from typing import Optional, List
+# Import all tools
+from tools.search import search_datasets, search_by_columns
+from tools.metadata import get_dataset_info, get_schema
+from tools.sampling import sample_rows
+from tools.profiling import get_statistics, profile_quality
+from tools.discovery import find_similar, suggest_tasks, compare_datasets
+# Create Gradio interfaces for each tool
+# Note: Gradio will automatically convert these to MCP tools
+def create_demo():
+    """Create the Gradio demo with all tools."""
+    with gr.Blocks(
+        title="DataView MCP - HuggingFace Dataset Explorer",
+        theme=gr.themes.Soft()
+    ) as demo:
+        gr.Markdown("""
+        # DataView MCP
+        ## Explore Hugging Face Datasets with AI
+        This MCP server provides tools for AI assistants to explore, analyze, and
+        understand datasets on the Hugging Face Hub.
+        **10 Tools Available:**
+        - Search & Discovery: `search_datasets`, `search_by_columns`, `find_similar`
+        - Metadata: `get_dataset_info`, `get_schema`
+        - Data Access: `sample_rows`
+        - Analysis: `get_statistics`, `profile_quality`
+        - Intelligence: `suggest_tasks`, `compare_datasets`
+        ---
+        ### Try the tools below or connect via MCP
+        """)
+        with gr.Tabs():
+            # Search Tab
+            with gr.Tab("Search"):
+                with gr.Row():
+                    with gr.Column():
+                        search_query = gr.Textbox(
+                            label="Search Query",
+                            placeholder="e.g., sentiment analysis, medical imaging"
+                        )
+                        search_limit = gr.Slider(1, 50, value=10, step=1, label="Max Results")
+                        search_task = gr.Dropdown(
+                            choices=[
+                                None, "text-classification", "question-answering",
+                                "summarization", "translation", "image-classification",
+                                "object-detection", "text-generation"
+                            ],
+                            label="Filter by Task (optional)"
+                        )
+                        search_btn = gr.Button("Search Datasets", variant="primary")
+                    with gr.Column():
+                        search_output = gr.Markdown(label="Results")
+                search_btn.click(
+                    search_datasets,
+                    inputs=[search_query, search_limit, search_task],
+                    outputs=search_output
+                )
+            # Dataset Info Tab
+            with gr.Tab("Dataset Info"):
+                with gr.Row():
+                    with gr.Column():
+                        info_dataset_id = gr.Textbox(
+                            label="Dataset ID",
+                            placeholder="e.g., imdb, squad, huggingface/documentation-images"
+                        )
+                        info_btn = gr.Button("Get Info", variant="primary")
+                        schema_btn = gr.Button("Get Schema")
+                    with gr.Column():
+                        info_output = gr.Markdown(label="Dataset Info")
+                info_btn.click(get_dataset_info, inputs=[info_dataset_id], outputs=info_output)
+                schema_btn.click(get_schema, inputs=[info_dataset_id], outputs=info_output)
+            # Sample Data Tab
+            with gr.Tab("Sample Data"):
+                with gr.Row():
+                    with gr.Column():
+                        sample_dataset_id = gr.Textbox(
+                            label="Dataset ID",
+                            placeholder="e.g., imdb"
+                        )
+                        sample_n_rows = gr.Slider(1, 20, value=5, step=1, label="Number of Rows")
+                        sample_split = gr.Dropdown(
+                            choices=["train", "test", "validation"],
+                            value="train",
+                            label="Split"
+                        )
+                        sample_btn = gr.Button("Get Sample", variant="primary")
+                    with gr.Column():
+                        sample_output = gr.Markdown(label="Sample Data")
+                sample_btn.click(
+                    sample_rows,
+                    inputs=[sample_dataset_id, sample_n_rows, gr.State(None), sample_split],
+                    outputs=sample_output
+                )
+            # Analysis Tab
+            with gr.Tab("Analysis"):
+                with gr.Row():
+                    with gr.Column():
+                        analysis_dataset_id = gr.Textbox(
+                            label="Dataset ID",
+                            placeholder="e.g., imdb"
+                        )
+                        analysis_sample_size = gr.Slider(
+                            100, 2000, value=500, step=100,
+                            label="Sample Size for Analysis"
+                        )
+                        stats_btn = gr.Button("Get Statistics", variant="primary")
+                        quality_btn = gr.Button("Profile Quality")
+                    with gr.Column():
+                        analysis_output = gr.Markdown(label="Analysis Results")
+                stats_btn.click(
+                    get_statistics,
+                    inputs=[analysis_dataset_id, gr.State(None), gr.State("train"), analysis_sample_size],
+                    outputs=analysis_output
+                )
+                quality_btn.click(
+                    profile_quality,
+                    inputs=[analysis_dataset_id, gr.State(None), gr.State("train"), analysis_sample_size],
+                    outputs=analysis_output
+                )
+            # Discovery Tab
+            with gr.Tab("Discovery"):
+                with gr.Row():
+                    with gr.Column():
+                        discovery_dataset_id = gr.Textbox(
+                            label="Dataset ID",
+                            placeholder="e.g., imdb"
+                        )
+                        discovery_top_k = gr.Slider(1, 10, value=5, step=1, label="Number of Results")
+                        similar_btn = gr.Button("Find Similar", variant="primary")
+                        suggest_btn = gr.Button("Suggest Tasks")
+                    with gr.Column():
+                        discovery_output = gr.Markdown(label="Discovery Results")
+                similar_btn.click(
+                    find_similar,
+                    inputs=[discovery_dataset_id, discovery_top_k],
+                    outputs=discovery_output
+                )
+                suggest_btn.click(
+                    suggest_tasks,
+                    inputs=[discovery_dataset_id],
+                    outputs=discovery_output
+                )
+            # Compare Tab
+            with gr.Tab("Compare"):
+                with gr.Row():
+                    with gr.Column():
+                        compare_dataset_a = gr.Textbox(
+                            label="Dataset A",
+                            placeholder="e.g., imdb"
+                        )
+                        compare_dataset_b = gr.Textbox(
+                            label="Dataset B",
+                            placeholder="e.g., rotten_tomatoes"
+                        )
+                        compare_btn = gr.Button("Compare Datasets", variant="primary")
+                    with gr.Column():
+                        compare_output = gr.Markdown(label="Comparison Results")
+                compare_btn.click(
+                    compare_datasets,
+                    inputs=[compare_dataset_a, compare_dataset_b],
+                    outputs=compare_output
+                )
+        gr.Markdown("""
+        ---
+        ### MCP Connection
+        To use with Claude or other MCP clients, add this to your config:
+        ```json
+        {
+            "mcpServers": {
+                "dataview": {
+                    "url": "https://YOUR-SPACE.hf.space/gradio_api/mcp/sse"
+                }
+            }
+        }
+        ```
+        ---
+        Built with Gradio MCP
+        """)
+    return demo
+# Create the demo
+demo = create_demo()
+# Launch with MCP server enabled
+if __name__ == "__main__":
+    demo.launch(
+        mcp_server=True,
+        share=False,
+        server_name="0.0.0.0",
+        server_port=7860
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio>=5.0.0
+huggingface_hub>=0.25.0
+datasets>=3.0.0
+pandas>=2.0.0
+numpy>=1.24.0
+sentence-transformers>=3.0.0
+python-dotenv>=1.0.0

tools/__init__.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""MCP Tools for dataset exploration."""
+from .search import search_datasets, search_by_columns
+from .metadata import get_dataset_info, get_schema
+from .sampling import sample_rows
+from .profiling import get_statistics, profile_quality
+from .discovery import find_similar, suggest_tasks, compare_datasets
+__all__ = [
+    "search_datasets",
+    "search_by_columns",
+    "get_dataset_info",
+    "get_schema",
+    "sample_rows",
+    "get_statistics",
+    "profile_quality",
+    "find_similar",
+    "suggest_tasks",
+    "compare_datasets",
+]

tools/discovery.py ADDED Viewed

	@@ -0,0 +1,384 @@

+"""Discovery tools for finding similar datasets and suggesting ML tasks."""
+from typing import Optional, List, Dict, Any
+from utils.hf_client import get_client
+from utils.formatting import format_similar_datasets, format_task_suggestions, format_comparison
+# Common ML task patterns based on column names and types
+TASK_PATTERNS = {
+    "text-classification": {
+        "columns": ["text", "label", "sentence", "review", "comment", "content"],
+        "name": "Text Classification",
+        "target_hints": ["label", "class", "category", "sentiment", "target"]
+    },
+    "question-answering": {
+        "columns": ["question", "answer", "context", "response"],
+        "name": "Question Answering",
+        "target_hints": ["answer", "response"]
+    },
+    "summarization": {
+        "columns": ["article", "summary", "document", "highlights", "abstract"],
+        "name": "Text Summarization",
+        "target_hints": ["summary", "highlights", "abstract"]
+    },
+    "translation": {
+        "columns": ["source", "target", "en", "de", "fr", "es", "translation"],
+        "name": "Machine Translation",
+        "target_hints": ["target", "translation"]
+    },
+    "image-classification": {
+        "columns": ["image", "label", "img", "photo"],
+        "name": "Image Classification",
+        "target_hints": ["label", "class", "category"]
+    },
+    "named-entity-recognition": {
+        "columns": ["tokens", "ner_tags", "tags", "entities"],
+        "name": "Named Entity Recognition",
+        "target_hints": ["ner_tags", "tags", "entities", "labels"]
+    },
+    "token-classification": {
+        "columns": ["tokens", "labels", "tags", "pos_tags"],
+        "name": "Token Classification",
+        "target_hints": ["labels", "tags"]
+    },
+    "text-generation": {
+        "columns": ["prompt", "completion", "input", "output", "instruction"],
+        "name": "Text Generation / Instruction Following",
+        "target_hints": ["completion", "output", "response"]
+    },
+    "tabular-classification": {
+        "columns": ["target", "label", "class"],
+        "name": "Tabular Classification",
+        "target_hints": ["target", "label", "class", "y"]
+    },
+    "tabular-regression": {
+        "columns": ["target", "price", "value", "score", "rating"],
+        "name": "Tabular Regression",
+        "target_hints": ["target", "price", "value", "score", "rating"]
+    }
+}
+def find_similar(
+    dataset_id: str,
+    top_k: int = 5
+) -> str:
+    """
+    Find datasets similar to a given dataset based on tags, domain, and structure.
+    Use this tool to discover alternative or complementary datasets for your task.
+    Similarity is based on shared tags, similar column structures, and domain overlap.
+    Args:
+        dataset_id: The dataset to find similar datasets for (e.g., "imdb", "squad")
+        top_k: Number of similar datasets to return (1-10, default: 5)
+    Returns:
+        List of similar datasets with:
+        - Dataset ID and download count
+        - Similarity score (0-1)
+        - Reason for similarity (shared tags, similar structure, etc.)
+    How similarity is computed:
+        - Tag overlap (same task categories, languages, domains)
+        - Similar column names and structures
+        - Same author/organization
+        - Related task types
+    """
+    top_k = max(1, min(10, top_k))
+    client = get_client()
+    # Get info about the source dataset
+    source_info = client.get_dataset_info(dataset_id)
+    if "error" in source_info:
+        return f"Error: Could not fetch info for dataset '{dataset_id}': {source_info.get('error')}"
+    source_tags = set(source_info.get('tags', []))
+    # Get schema for structure comparison
+    source_schema = client.get_schema(dataset_id)
+    source_columns = set(source_schema.get('columns', [])) if "error" not in source_schema else set()
+    # Extract key tags for search
+    search_terms = []
+    for tag in source_tags:
+        if ':' in tag:
+            # Task category tags like "task_categories:text-classification"
+            if tag.startswith('task_categories:'):
+                search_terms.append(tag.split(':')[1])
+            elif tag.startswith('language:'):
+                search_terms.append(tag.split(':')[1])
+        elif len(tag) > 2:
+            search_terms.append(tag)
+    # Search for candidates
+    candidates = []
+    for term in search_terms[:3]:  # Use top 3 terms
+        results = client.search_datasets(term, limit=20)
+        candidates.extend(results)
+    # Remove duplicates and source dataset
+    seen = {dataset_id}
+    unique_candidates = []
+    for ds in candidates:
+        if ds['id'] not in seen:
+            seen.add(ds['id'])
+            unique_candidates.append(ds)
+    # Score candidates
+    scored = []
+    for ds in unique_candidates[:30]:  # Limit processing
+        try:
+            ds_info = client.get_dataset_info(ds['id'])
+            ds_tags = set(ds_info.get('tags', []))
+            # Compute similarity score
+            tag_overlap = len(source_tags & ds_tags)
+            tag_score = tag_overlap / max(len(source_tags), 1)
+            # Check column similarity
+            ds_schema = client.get_schema(ds['id'])
+            ds_columns = set(ds_schema.get('columns', [])) if "error" not in ds_schema else set()
+            col_overlap = len(source_columns & ds_columns)
+            col_score = col_overlap / max(len(source_columns), 1) if source_columns else 0
+            # Combined score
+            similarity = (tag_score * 0.6) + (col_score * 0.4)
+            # Determine reason
+            reasons = []
+            if tag_overlap > 0:
+                common_tags = list(source_tags & ds_tags)[:3]
+                reasons.append(f"Shared tags: {', '.join(common_tags)}")
+            if col_overlap > 0:
+                common_cols = list(source_columns & ds_columns)[:3]
+                reasons.append(f"Similar columns: {', '.join(common_cols)}")
+            if ds_info.get('author') == source_info.get('author'):
+                reasons.append("Same author")
+                similarity += 0.1
+            if similarity > 0.1:
+                scored.append({
+                    "id": ds['id'],
+                    "downloads": ds.get('downloads', 0),
+                    "similarity_score": min(1.0, similarity),
+                    "reason": "; ".join(reasons) if reasons else "Related domain"
+                })
+        except Exception:
+            continue
+    # Sort by similarity and return top_k
+    scored.sort(key=lambda x: x['similarity_score'], reverse=True)
+    return format_similar_datasets(scored[:top_k])
+def suggest_tasks(dataset_id: str) -> str:
+    """
+    Analyze a dataset and suggest suitable machine learning tasks.
+    Use this tool to discover what ML tasks a dataset could be used for,
+    based on its column structure, data types, and metadata.
+    Args:
+        dataset_id: The dataset to analyze (e.g., "imdb", "squad", "cnn_dailymail")
+    Returns:
+        List of suggested ML tasks with:
+        - Task name and confidence level (high/medium/low)
+        - Reasoning for the suggestion
+        - Recommended target column
+        - Recommended feature columns
+    Task types detected:
+        - Text Classification (sentiment, topic, intent)
+        - Question Answering
+        - Summarization
+        - Translation
+        - Image Classification
+        - Named Entity Recognition
+        - Token Classification
+        - Text Generation
+        - Tabular Classification/Regression
+    """
+    client = get_client()
+    # Get schema
+    schema = client.get_schema(dataset_id)
+    if "error" in schema:
+        return format_task_suggestions({"error": f"Could not load schema: {schema['error']}"})
+    columns = [c.lower() for c in schema.get('columns', [])]
+    features = schema.get('features', {})
+    # Get dataset info for tags
+    info = client.get_dataset_info(dataset_id)
+    tags = [t.lower() for t in info.get('tags', [])] if "error" not in info else []
+    suggestions: List[Dict[str, Any]] = []
+    for task_id, pattern in TASK_PATTERNS.items():
+        # Check column name matches
+        pattern_cols = [c.lower() for c in pattern['columns']]
+        matching_cols = [c for c in columns if any(p in c for p in pattern_cols)]
+        # Check tag matches
+        tag_match = any(task_id in t for t in tags)
+        if matching_cols or tag_match:
+            # Determine confidence
+            if tag_match and len(matching_cols) >= 2:
+                confidence = "high"
+            elif tag_match or len(matching_cols) >= 2:
+                confidence = "medium"
+            else:
+                confidence = "low"
+            # Find target column
+            target_hints = [h.lower() for h in pattern['target_hints']]
+            target_col = None
+            for col in columns:
+                if any(hint in col for hint in target_hints):
+                    target_col = col
+                    break
+            # Feature columns (all except target)
+            feature_cols = [c for c in columns if c != target_col][:5]
+            # Build reason
+            reasons = []
+            if matching_cols:
+                reasons.append(f"Found columns: {', '.join(matching_cols[:3])}")
+            if tag_match:
+                reasons.append("Dataset tags indicate this task")
+            suggestions.append({
+                "name": pattern['name'],
+                "confidence": confidence,
+                "reason": "; ".join(reasons),
+                "target_column": target_col,
+                "feature_columns": feature_cols
+            })
+    # Sort by confidence
+    confidence_order = {"high": 0, "medium": 1, "low": 2}
+    suggestions.sort(key=lambda x: confidence_order.get(x['confidence'], 3))
+    if not suggestions:
+        # Generic suggestion based on column types
+        has_text = any('string' in str(features.get(c, '')).lower() for c in schema.get('columns', []))
+        has_numeric = any('int' in str(features.get(c, '')).lower() or 'float' in str(features.get(c, '')).lower()
+                         for c in schema.get('columns', []))
+        if has_text:
+            suggestions.append({
+                "name": "Text Analysis (Generic)",
+                "confidence": "low",
+                "reason": "Dataset contains text columns",
+                "target_column": None,
+                "feature_columns": columns[:5]
+            })
+        if has_numeric:
+            suggestions.append({
+                "name": "Regression/Classification (Generic)",
+                "confidence": "low",
+                "reason": "Dataset contains numeric columns",
+                "target_column": columns[-1] if columns else None,
+                "feature_columns": columns[:-1] if len(columns) > 1 else columns
+            })
+    return format_task_suggestions({
+        "dataset_id": dataset_id,
+        "tasks": suggestions[:5]  # Return top 5 suggestions
+    })
+def compare_datasets(
+    dataset_a: str,
+    dataset_b: str
+) -> str:
+    """
+    Compare two datasets side-by-side to understand their differences.
+    Use this tool when deciding between similar datasets or understanding
+    how datasets differ in structure, size, and content.
+    Args:
+        dataset_a: First dataset ID to compare (e.g., "imdb")
+        dataset_b: Second dataset ID to compare (e.g., "rotten_tomatoes")
+    Returns:
+        Comparison table showing:
+        - Download and like counts
+        - Number of columns
+        - Column names (common and unique)
+        - License information
+        - Tags comparison
+        - Data types comparison
+    Use cases:
+        - Choosing between similar datasets for a task
+        - Understanding dataset versions or variants
+        - Finding complementary datasets
+    """
+    client = get_client()
+    # Get info for both datasets
+    info_a = client.get_dataset_info(dataset_a)
+    info_b = client.get_dataset_info(dataset_b)
+    if "error" in info_a:
+        return f"Error loading dataset A ({dataset_a}): {info_a.get('error')}"
+    if "error" in info_b:
+        return f"Error loading dataset B ({dataset_b}): {info_b.get('error')}"
+    # Get schemas
+    schema_a = client.get_schema(dataset_a)
+    schema_b = client.get_schema(dataset_b)
+    cols_a = set(schema_a.get('columns', [])) if "error" not in schema_a else set()
+    cols_b = set(schema_b.get('columns', [])) if "error" not in schema_b else set()
+    comparison = {
+        "dataset_a": dataset_a,
+        "dataset_b": dataset_b,
+        "comparison": {
+            "Downloads": {
+                "a": f"{info_a.get('downloads', 0):,}",
+                "b": f"{info_b.get('downloads', 0):,}"
+            },
+            "Likes": {
+                "a": str(info_a.get('likes', 0)),
+                "b": str(info_b.get('likes', 0))
+            },
+            "License": {
+                "a": info_a.get('license') or "N/A",
+                "b": info_b.get('license') or "N/A"
+            },
+            "Columns": {
+                "a": str(len(cols_a)),
+                "b": str(len(cols_b))
+            },
+            "Author": {
+                "a": info_a.get('author') or "N/A",
+                "b": info_b.get('author') or "N/A"
+            }
+        },
+        "common_columns": list(cols_a & cols_b),
+        "unique_to_a": list(cols_a - cols_b),
+        "unique_to_b": list(cols_b - cols_a)
+    }
+    # Compare tags
+    tags_a = set(info_a.get('tags', []))
+    tags_b = set(info_b.get('tags', []))
+    common_tags = tags_a & tags_b
+    if common_tags:
+        comparison["comparison"]["Common Tags"] = {
+            "a": str(len(common_tags)),
+            "b": ", ".join(list(common_tags)[:3])
+        }
+    return format_comparison(comparison)

tools/metadata.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""Metadata tools for getting dataset information and schemas."""
+from typing import Optional
+from utils.hf_client import get_client
+from utils.formatting import format_dataset_info, format_schema
+def get_dataset_info(dataset_id: str) -> str:
+    """
+    Get detailed information about a specific dataset on Hugging Face Hub.
+    Use this tool to learn about a dataset's metadata, including its author,
+    download count, license, tags, and a summary of its dataset card/README.
+    Args:
+        dataset_id: The full dataset identifier (e.g., "squad", "imdb", "huggingface/documentation-images",
+                   "username/dataset-name")
+    Returns:
+        Formatted dataset information including:
+        - Author and creation date
+        - Download and like counts
+        - License information
+        - Tags and categories
+        - Dataset card summary (first ~1500 characters)
+    Example dataset IDs:
+        - "squad" - Stanford Question Answering Dataset
+        - "imdb" - IMDB movie reviews for sentiment
+        - "cnn_dailymail" - News summarization
+        - "imagenet-1k" - Image classification benchmark
+    """
+    client = get_client()
+    info = client.get_dataset_info(dataset_id)
+    if "error" in info:
+        return f"Error fetching dataset info: {info['error']}\n\nMake sure the dataset ID is correct and the dataset exists."
+    return format_dataset_info(info)
+def get_schema(
+    dataset_id: str,
+    config: Optional[str] = None,
+    split: str = "train"
+) -> str:
+    """
+    Get the schema (columns and data types) of a dataset.
+    Use this tool to understand the structure of a dataset before loading samples
+    or performing analysis. Shows all column names and their data types.
+    Args:
+        dataset_id: The full dataset identifier (e.g., "squad", "imdb")
+        config: Optional dataset configuration name. Many datasets have multiple configs
+               (e.g., "plain_text" vs "parquet" for some datasets). Leave empty for default.
+        split: The dataset split to examine ("train", "test", "validation"). Default: "train"
+    Returns:
+        Formatted schema showing:
+        - Number of columns
+        - Column names and their Hugging Face feature types
+        - Table view for easy reading
+    Common feature types:
+        - Value(dtype='string') - Text data
+        - Value(dtype='int64') - Integer numbers
+        - Value(dtype='float32') - Decimal numbers
+        - ClassLabel - Categorical labels with names
+        - Image - PIL Image objects
+        - Audio - Audio waveform data
+        - Sequence - Lists/arrays of values
+    """
+    client = get_client()
+    # First, get available configs and splits
+    configs_splits = client.get_configs_and_splits(dataset_id)
+    schema = client.get_schema(dataset_id, config, split)
+    if "error" in schema:
+        # Provide helpful error message
+        error_msg = f"Error getting schema: {schema['error']}\n\n"
+        if configs_splits:
+            error_msg += "Available configurations and splits:\n"
+            for cfg, splits in configs_splits.items():
+                error_msg += f"- Config '{cfg}': {', '.join(splits)}\n"
+            error_msg += "\nTry specifying a valid config and split."
+        return error_msg
+    result = format_schema(schema)
+    # Add configs info
+    if configs_splits:
+        result += "\n\n### Available Configurations\n"
+        for cfg, splits in configs_splits.items():
+            result += f"- **{cfg}**: {', '.join(splits)}\n"
+    return result

tools/profiling.py ADDED Viewed

	@@ -0,0 +1,283 @@

+"""Profiling tools for analyzing dataset statistics and quality."""
+from typing import Optional, Dict, Any, List
+from utils.hf_client import get_client
+from utils.formatting import format_statistics, format_quality_report
+import json
+def get_statistics(
+    dataset_id: str,
+    config: Optional[str] = None,
+    split: str = "train",
+    sample_size: int = 1000
+) -> str:
+    """
+    Compute basic statistics for each column in a dataset.
+    Use this tool to get a statistical overview of a dataset, including
+    counts, means, unique values, and distributions for each column.
+    Args:
+        dataset_id: The full dataset identifier (e.g., "squad", "imdb")
+        config: Optional dataset configuration name. Leave empty for default.
+        split: The dataset split to analyze ("train", "test", "validation"). Default: "train"
+        sample_size: Number of rows to sample for statistics (100-5000, default: 1000).
+                    Larger samples are more accurate but slower.
+    Returns:
+        Formatted statistics including:
+        - Total row count (estimated from sample)
+        - Per-column statistics:
+          - Numeric: min, max, mean, median, std
+          - Text: avg length, min/max length, unique count
+          - Categorical: value counts, top categories
+    Notes:
+        - Statistics are computed on a sample for efficiency
+        - Very large datasets may show approximate values
+        - Binary data columns (images, audio) show type info only
+    """
+    sample_size = max(100, min(5000, sample_size))
+    client = get_client()
+    # Load sample for statistics
+    samples = client.load_sample(
+        dataset_id=dataset_id,
+        config=config,
+        split=split,
+        n_rows=sample_size
+    )
+    if not samples or "error" in samples[0]:
+        error_msg = samples[0].get('error', 'Unknown error') if samples else 'No data'
+        return f"Error loading data for statistics: {error_msg}"
+    # Compute statistics
+    stats = {
+        "total_rows": f"~{len(samples):,}+ (sampled)",
+        "column_stats": {}
+    }
+    # Analyze each column
+    columns = samples[0].keys() if samples else []
+    for col in columns:
+        col_values = [row.get(col) for row in samples if row.get(col) is not None]
+        if not col_values:
+            stats["column_stats"][col] = {"status": "all null"}
+            continue
+        # Determine column type and compute appropriate stats
+        sample_val = col_values[0]
+        if isinstance(sample_val, (int, float)) and not isinstance(sample_val, bool):
+            # Numeric column
+            numeric_vals = [v for v in col_values if isinstance(v, (int, float))]
+            if numeric_vals:
+                stats["column_stats"][col] = {
+                    "type": "numeric",
+                    "count": len(numeric_vals),
+                    "min": min(numeric_vals),
+                    "max": max(numeric_vals),
+                    "mean": sum(numeric_vals) / len(numeric_vals),
+                    "unique": len(set(numeric_vals))
+                }
+        elif isinstance(sample_val, str):
+            # Text column
+            lengths = [len(v) for v in col_values if isinstance(v, str)]
+            unique_vals = set(col_values)
+            stats["column_stats"][col] = {
+                "type": "text",
+                "count": len(col_values),
+                "avg_length": sum(lengths) / len(lengths) if lengths else 0,
+                "min_length": min(lengths) if lengths else 0,
+                "max_length": max(lengths) if lengths else 0,
+                "unique": len(unique_vals),
+                "sample_values": list(unique_vals)[:3]
+            }
+        elif isinstance(sample_val, bool):
+            # Boolean column
+            true_count = sum(1 for v in col_values if v is True)
+            stats["column_stats"][col] = {
+                "type": "boolean",
+                "count": len(col_values),
+                "true_count": true_count,
+                "false_count": len(col_values) - true_count,
+                "true_pct": (true_count / len(col_values)) * 100
+            }
+        elif isinstance(sample_val, list):
+            # List/sequence column
+            lengths = [len(v) for v in col_values if isinstance(v, list)]
+            stats["column_stats"][col] = {
+                "type": "list/sequence",
+                "count": len(col_values),
+                "avg_length": sum(lengths) / len(lengths) if lengths else 0,
+                "min_length": min(lengths) if lengths else 0,
+                "max_length": max(lengths) if lengths else 0
+            }
+        elif isinstance(sample_val, dict):
+            # Nested object
+            stats["column_stats"][col] = {
+                "type": "object/nested",
+                "count": len(col_values),
+                "sample_keys": list(sample_val.keys())[:5] if sample_val else []
+            }
+        else:
+            # Binary or other type
+            stats["column_stats"][col] = {
+                "type": str(type(sample_val).__name__),
+                "count": len(col_values),
+                "note": "Binary/special data type"
+            }
+    return format_statistics(stats)
+def profile_quality(
+    dataset_id: str,
+    config: Optional[str] = None,
+    split: str = "train",
+    sample_size: int = 500
+) -> str:
+    """
+    Assess the data quality of a dataset and identify potential issues.
+    Use this tool to check for common data quality problems like missing values,
+    duplicates, imbalanced classes, and outliers before using a dataset.
+    Args:
+        dataset_id: The full dataset identifier (e.g., "squad", "imdb")
+        config: Optional dataset configuration name. Leave empty for default.
+        split: The dataset split to analyze ("train", "test", "validation"). Default: "train"
+        sample_size: Number of rows to sample for quality check (100-2000, default: 500).
+    Returns:
+        Data quality report including:
+        - Overall quality score (0-100)
+        - List of identified issues
+        - Per-column quality metrics:
+          - Missing value percentage
+          - Unique value percentage
+          - Specific issues (constant values, high cardinality, etc.)
+    Quality checks performed:
+        - Missing/null values
+        - Duplicate rows
+        - Constant columns (single value)
+        - High cardinality text columns
+        - Class imbalance for categorical columns
+        - Outliers for numeric columns
+    """
+    sample_size = max(100, min(2000, sample_size))
+    client = get_client()
+    # Load sample
+    samples = client.load_sample(
+        dataset_id=dataset_id,
+        config=config,
+        split=split,
+        n_rows=sample_size
+    )
+    if not samples or "error" in samples[0]:
+        error_msg = samples[0].get('error', 'Unknown error') if samples else 'No data'
+        return format_quality_report({"error": error_msg})
+    # Initialize report
+    report: Dict[str, Any] = {
+        "dataset_id": dataset_id,
+        "sample_size": len(samples),
+        "issues": [],
+        "column_quality": {},
+        "overall_score": 100
+    }
+    # Check for duplicate rows
+    try:
+        row_strings = [json.dumps(row, sort_keys=True, default=str) for row in samples]
+        unique_rows = len(set(row_strings))
+        duplicate_pct = ((len(samples) - unique_rows) / len(samples)) * 100
+        if duplicate_pct > 5:
+            report["issues"].append(f"High duplicate rate: {duplicate_pct:.1f}% duplicate rows")
+            report["overall_score"] -= min(20, duplicate_pct)
+    except Exception:
+        pass
+    # Analyze each column
+    columns = samples[0].keys() if samples else []
+    for col in columns:
+        col_values = [row.get(col) for row in samples]
+        non_null_values = [v for v in col_values if v is not None and v != ""]
+        col_quality: Dict[str, Any] = {
+            "missing_pct": ((len(samples) - len(non_null_values)) / len(samples)) * 100,
+            "issues": []
+        }
+        # Check missing values
+        if col_quality["missing_pct"] > 20:
+            col_quality["issues"].append("High missing rate")
+            report["overall_score"] -= 5
+        elif col_quality["missing_pct"] > 50:
+            report["issues"].append(f"Column '{col}' has {col_quality['missing_pct']:.0f}% missing values")
+            report["overall_score"] -= 10
+        # Calculate unique percentage
+        if non_null_values:
+            unique_count = len(set(str(v) for v in non_null_values))
+            col_quality["unique_pct"] = (unique_count / len(non_null_values)) * 100
+            # Check for constant column
+            if unique_count == 1:
+                col_quality["issues"].append("Constant value")
+                report["issues"].append(f"Column '{col}' has only one unique value")
+                report["overall_score"] -= 5
+            # Check for potential ID column (all unique)
+            elif col_quality["unique_pct"] > 99 and len(non_null_values) > 10:
+                col_quality["issues"].append("Possibly ID column")
+            # Check for high cardinality in small dataset
+            elif isinstance(non_null_values[0], str) and unique_count > len(samples) * 0.8:
+                col_quality["issues"].append("High cardinality")
+            # Check class imbalance for categorical
+            sample_val = non_null_values[0]
+            if isinstance(sample_val, (str, int, bool)) and unique_count <= 20:
+                value_counts = {}
+                for v in non_null_values:
+                    key = str(v)
+                    value_counts[key] = value_counts.get(key, 0) + 1
+                if value_counts:
+                    max_count = max(value_counts.values())
+                    min_count = min(value_counts.values())
+                    if max_count > min_count * 10:
+                        col_quality["issues"].append("Class imbalance")
+                        if "label" in col.lower() or "class" in col.lower():
+                            report["issues"].append(f"Significant class imbalance in '{col}'")
+                            report["overall_score"] -= 10
+        else:
+            col_quality["unique_pct"] = 0
+        col_quality["issues"] = ", ".join(col_quality["issues"]) if col_quality["issues"] else "-"
+        report["column_quality"][col] = col_quality
+    # Clamp score
+    report["overall_score"] = max(0, min(100, report["overall_score"]))
+    if not report["issues"]:
+        report["issues"].append("No major issues detected")
+    return format_quality_report(report)

tools/sampling.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""Sampling tools for loading actual data from datasets."""
+from typing import Optional
+from utils.hf_client import get_client
+from utils.formatting import format_sample
+def sample_rows(
+    dataset_id: str,
+    n_rows: int = 5,
+    config: Optional[str] = None,
+    split: str = "train",
+    random_seed: Optional[int] = None
+) -> str:
+    """
+    Get a sample of actual rows from a dataset to inspect the data.
+    Use this tool to see real examples from a dataset. This helps understand
+    what the data looks like, the format of each column, and typical values.
+    Args:
+        dataset_id: The full dataset identifier (e.g., "squad", "imdb")
+        n_rows: Number of rows to sample (1-20, default: 5). Keep small for large datasets.
+        config: Optional dataset configuration name. Leave empty for default config.
+        split: The dataset split to sample from ("train", "test", "validation"). Default: "train"
+        random_seed: Optional seed for reproducible sampling. If not provided, returns first N rows.
+    Returns:
+        Formatted sample showing actual data rows in JSON format, with each row
+        numbered and clearly separated.
+    Notes:
+        - Large binary data (images, audio) is shown as placeholders
+        - Very long text is truncated for readability
+        - Use get_schema first to understand column types before sampling
+    Example usage:
+        - sample_rows("imdb", 3) - Get 3 movie reviews
+        - sample_rows("squad", 5, split="validation") - Get 5 QA pairs from validation
+    """
+    n_rows = max(1, min(20, n_rows))  # Clamp between 1 and 20
+    client = get_client()
+    samples = client.load_sample(
+        dataset_id=dataset_id,
+        config=config,
+        split=split,
+        n_rows=n_rows
+    )
+    return format_sample(samples, dataset_id)

tools/search.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""Search tools for finding datasets on Hugging Face Hub."""
+from typing import Optional, List
+from utils.hf_client import get_client
+from utils.formatting import format_dataset_list
+from huggingface_hub import list_datasets
+def search_datasets(
+    query: str,
+    limit: int = 10,
+    filter_task: Optional[str] = None,
+    sort_by: str = "downloads"
+) -> str:
+    """
+    Search for datasets on Hugging Face Hub by keyword, task, or domain.
+    Use this tool to find datasets matching specific criteria. You can search by
+    name, description, or filter by ML task category.
+    Args:
+        query: Search query string (e.g., "sentiment analysis", "image classification", "medical")
+        limit: Maximum number of results to return (1-50, default: 10)
+        filter_task: Optional ML task filter (e.g., "text-classification", "image-classification",
+                    "question-answering", "summarization", "translation")
+        sort_by: Sort results by "downloads", "likes", or "created" (default: "downloads")
+    Returns:
+        Formatted list of matching datasets with their IDs, download counts, and tags.
+    Example queries:
+        - "sentiment" - Find sentiment analysis datasets
+        - "medical imaging" - Find medical image datasets
+        - "french translation" - Find French translation datasets
+    """
+    limit = max(1, min(50, limit))  # Clamp between 1 and 50
+    client = get_client()
+    datasets = client.search_datasets(
+        query=query,
+        limit=limit,
+        filter_task=filter_task,
+        sort=sort_by
+    )
+    return format_dataset_list(datasets)
+def search_by_columns(
+    column_names: List[str],
+    data_types: Optional[List[str]] = None,
+    limit: int = 10
+) -> str:
+    """
+    Find datasets that contain specific column names or data types.
+    Use this tool when you need datasets with particular features or columns,
+    such as finding all datasets with a "label" column or "image" type.
+    Args:
+        column_names: List of column names to search for (e.g., ["text", "label"], ["image", "caption"])
+        data_types: Optional list of data types to filter by (e.g., ["Image", "Audio", "ClassLabel"])
+        limit: Maximum number of results to return (1-30, default: 10)
+    Returns:
+        Formatted list of datasets containing the specified columns/types.
+    Common column patterns:
+        - Text classification: ["text", "label"]
+        - Image classification: ["image", "label"]
+        - Question answering: ["question", "answer", "context"]
+        - Translation: ["source", "target"] or ["en", "fr"]
+    """
+    limit = max(1, min(30, limit))
+    # Build search query from column names
+    search_query = " ".join(column_names)
+    # Search for datasets
+    client = get_client()
+    all_datasets = client.search_datasets(query=search_query, limit=limit * 3)
+    # Filter by actually checking schemas (best effort)
+    matching_datasets = []
+    for ds in all_datasets:
+        if len(matching_datasets) >= limit:
+            break
+        try:
+            schema = client.get_schema(ds['id'])
+            if "error" not in schema:
+                columns = schema.get('columns', [])
+                columns_lower = [c.lower() for c in columns]
+                # Check if any requested columns match
+                matches = sum(1 for col in column_names if col.lower() in columns_lower)
+                if matches > 0:
+                    ds['matched_columns'] = matches
+                    ds['total_columns'] = len(columns)
+                    matching_datasets.append(ds)
+        except Exception:
+            continue
+    if not matching_datasets:
+        return f"No datasets found with columns matching: {', '.join(column_names)}\n\nTry broader search terms or check column naming conventions."
+    # Format results
+    lines = [f"## Datasets with columns: {', '.join(column_names)}\n"]
+    for i, ds in enumerate(matching_datasets, 1):
+        lines.append(f"### {i}. {ds['id']}")
+        lines.append(f"- Matched columns: {ds.get('matched_columns', 'N/A')}/{len(column_names)}")
+        lines.append(f"- Total columns: {ds.get('total_columns', 'N/A')}")
+        lines.append(f"- Downloads: {ds.get('downloads', 'N/A'):,}")
+        lines.append("")
+    return "\n".join(lines)

utils/__init__.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Utility modules for dataview-mcp."""
+from .hf_client import get_client, HFDatasetClient
+from .formatting import (
+    format_dataset_list,
+    format_dataset_info,
+    format_schema,
+    format_sample,
+    format_statistics,
+    format_quality_report,
+    format_comparison,
+    format_similar_datasets,
+    format_task_suggestions,
+)
+__all__ = [
+    "get_client",
+    "HFDatasetClient",
+    "format_dataset_list",
+    "format_dataset_info",
+    "format_schema",
+    "format_sample",
+    "format_statistics",
+    "format_quality_report",
+    "format_comparison",
+    "format_similar_datasets",
+    "format_task_suggestions",
+]

utils/formatting.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""Formatting utilities for MCP tool outputs."""
+import json
+from typing import Any, Dict, List
+def format_dataset_list(datasets: List[Dict[str, Any]]) -> str:
+    """Format a list of datasets for display."""
+    if not datasets:
+        return "No datasets found."
+    lines = ["## Datasets Found\n"]
+    for i, ds in enumerate(datasets, 1):
+        lines.append(f"### {i}. {ds['id']}")
+        lines.append(f"- Downloads: {ds.get('downloads', 'N/A'):,}")
+        lines.append(f"- Likes: {ds.get('likes', 'N/A')}")
+        if ds.get('tags'):
+            lines.append(f"- Tags: {', '.join(ds['tags'][:5])}")
+        lines.append("")
+    return "\n".join(lines)
+def format_dataset_info(info: Dict[str, Any]) -> str:
+    """Format dataset info for display."""
+    lines = [f"## Dataset: {info['id']}\n"]
+    lines.append(f"- **Author**: {info.get('author', 'N/A')}")
+    lines.append(f"- **Downloads**: {info.get('downloads', 0):,}")
+    lines.append(f"- **Likes**: {info.get('likes', 0)}")
+    lines.append(f"- **License**: {info.get('license', 'N/A')}")
+    if info.get('tags'):
+        lines.append(f"- **Tags**: {', '.join(info['tags'][:10])}")
+    if info.get('card_summary'):
+        lines.append("\n### Dataset Card (Summary)")
+        lines.append(info['card_summary'][:1500] + "..." if len(info.get('card_summary', '')) > 1500 else info['card_summary'])
+    return "\n".join(lines)
+def format_schema(schema: Dict[str, Any]) -> str:
+    """Format schema information for display."""
+    if "error" in schema:
+        return f"Error: {schema['error']}"
+    lines = ["## Dataset Schema\n"]
+    lines.append(f"**Number of columns**: {schema.get('num_columns', 'N/A')}\n")
+    lines.append("### Columns\n")
+    lines.append("| Column | Type |")
+    lines.append("|--------|------|")
+    for col, dtype in schema.get('features', {}).items():
+        lines.append(f"| `{col}` | {dtype} |")
+    return "\n".join(lines)
+def format_sample(samples: List[Dict[str, Any]], dataset_id: str) -> str:
+    """Format sample rows for display."""
+    if not samples:
+        return "No samples available."
+    if "error" in samples[0]:
+        return f"Error loading samples: {samples[0]['error']}"
+    lines = [f"## Sample from `{dataset_id}`\n"]
+    lines.append(f"Showing {len(samples)} row(s):\n")
+    for i, row in enumerate(samples, 1):
+        lines.append(f"### Row {i}")
+        lines.append("```json")
+        lines.append(json.dumps(row, indent=2, default=str, ensure_ascii=False)[:1000])
+        lines.append("```\n")
+    return "\n".join(lines)
+def format_statistics(stats: Dict[str, Any]) -> str:
+    """Format statistics for display."""
+    if "error" in stats:
+        return f"Error: {stats['error']}"
+    lines = ["## Dataset Statistics\n"]
+    lines.append(f"**Total rows**: {stats.get('total_rows', 'N/A'):,}\n")
+    if stats.get('column_stats'):
+        lines.append("### Column Statistics\n")
+        for col, col_stats in stats['column_stats'].items():
+            lines.append(f"#### `{col}`")
+            for key, value in col_stats.items():
+                if isinstance(value, float):
+                    lines.append(f"- {key}: {value:.2f}")
+                else:
+                    lines.append(f"- {key}: {value}")
+            lines.append("")
+    return "\n".join(lines)
+def format_quality_report(report: Dict[str, Any]) -> str:
+    """Format data quality report for display."""
+    if "error" in report:
+        return f"Error: {report['error']}"
+    lines = ["## Data Quality Report\n"]
+    # Overall score
+    if "overall_score" in report:
+        score = report['overall_score']
+        emoji = "" if score >= 80 else "" if score >= 60 else ""
+        lines.append(f"**Overall Quality Score**: {emoji} {score}/100\n")
+    # Issues
+    if report.get('issues'):
+        lines.append("### Issues Found\n")
+        for issue in report['issues']:
+            lines.append(f"- {issue}")
+        lines.append("")
+    # Column-level quality
+    if report.get('column_quality'):
+        lines.append("### Column Quality\n")
+        lines.append("| Column | Missing % | Unique % | Issues |")
+        lines.append("|--------|-----------|----------|--------|")
+        for col, quality in report['column_quality'].items():
+            missing = quality.get('missing_pct', 0)
+            unique = quality.get('unique_pct', 0)
+            issues = quality.get('issues', '-')
+            lines.append(f"| `{col}` | {missing:.1f}% | {unique:.1f}% | {issues} |")
+    return "\n".join(lines)
+def format_comparison(comparison: Dict[str, Any]) -> str:
+    """Format dataset comparison for display."""
+    if "error" in comparison:
+        return f"Error: {comparison['error']}"
+    lines = ["## Dataset Comparison\n"]
+    lines.append(f"Comparing **{comparison['dataset_a']}** vs **{comparison['dataset_b']}**\n")
+    lines.append("| Aspect | Dataset A | Dataset B |")
+    lines.append("|--------|-----------|-----------|")
+    for aspect, values in comparison.get('comparison', {}).items():
+        lines.append(f"| {aspect} | {values.get('a', 'N/A')} | {values.get('b', 'N/A')} |")
+    if comparison.get('common_columns'):
+        lines.append(f"\n**Common columns**: {', '.join(comparison['common_columns'])}")
+    if comparison.get('unique_to_a'):
+        lines.append(f"**Unique to A**: {', '.join(comparison['unique_to_a'])}")
+    if comparison.get('unique_to_b'):
+        lines.append(f"**Unique to B**: {', '.join(comparison['unique_to_b'])}")
+    return "\n".join(lines)
+def format_similar_datasets(similar: List[Dict[str, Any]]) -> str:
+    """Format similar datasets list for display."""
+    if not similar:
+        return "No similar datasets found."
+    lines = ["## Similar Datasets\n"]
+    for i, ds in enumerate(similar, 1):
+        score = ds.get('similarity_score', 0)
+        lines.append(f"### {i}. {ds['id']} (similarity: {score:.2f})")
+        lines.append(f"- Downloads: {ds.get('downloads', 'N/A'):,}")
+        if ds.get('reason'):
+            lines.append(f"- Why similar: {ds['reason']}")
+        lines.append("")
+    return "\n".join(lines)
+def format_task_suggestions(suggestions: Dict[str, Any]) -> str:
+    """Format ML task suggestions for display."""
+    if "error" in suggestions:
+        return f"Error: {suggestions['error']}"
+    lines = [f"## Suggested ML Tasks for `{suggestions.get('dataset_id', 'dataset')}`\n"]
+    if suggestions.get('tasks'):
+        for i, task in enumerate(suggestions['tasks'], 1):
+            confidence = task.get('confidence', 'medium')
+            emoji = "" if confidence == 'high' else "" if confidence == 'medium' else ""
+            lines.append(f"### {i}. {task['name']} {emoji}")
+            lines.append(f"- **Confidence**: {confidence}")
+            lines.append(f"- **Reason**: {task.get('reason', 'Based on dataset structure')}")
+            if task.get('target_column'):
+                lines.append(f"- **Target column**: `{task['target_column']}`")
+            if task.get('feature_columns'):
+                lines.append(f"- **Feature columns**: {', '.join(f'`{c}`' for c in task['feature_columns'][:5])}")
+            lines.append("")
+    return "\n".join(lines)

utils/hf_client.py ADDED Viewed

	@@ -0,0 +1,164 @@

+"""Hugging Face API client wrapper for dataset operations."""
+import os
+from typing import Optional, List, Dict, Any
+from huggingface_hub import HfApi, list_datasets, DatasetCard
+from datasets import load_dataset, get_dataset_config_names, get_dataset_split_names
+from dotenv import load_dotenv
+load_dotenv()
+class HFDatasetClient:
+    """Client for interacting with Hugging Face datasets."""
+    def __init__(self, token: Optional[str] = None):
+        self.token = token or os.getenv("HF_TOKEN")
+        self.api = HfApi(token=self.token)
+    def search_datasets(
+        self,
+        query: str,
+        limit: int = 10,
+        filter_task: Optional[str] = None,
+        sort: str = "downloads"
+    ) -> List[Dict[str, Any]]:
+        """Search for datasets on Hugging Face Hub."""
+        datasets = list(list_datasets(
+            search=query,
+            limit=limit,
+            sort=sort,
+            task_categories=filter_task if filter_task else None
+        ))
+        return [
+            {
+                "id": ds.id,
+                "downloads": ds.downloads,
+                "likes": ds.likes,
+                "tags": ds.tags[:5] if ds.tags else [],
+                "created_at": str(ds.created_at) if ds.created_at else None,
+            }
+            for ds in datasets
+        ]
+    def get_dataset_info(self, dataset_id: str) -> Dict[str, Any]:
+        """Get detailed information about a dataset."""
+        info = self.api.dataset_info(dataset_id)
+        # Try to get the dataset card
+        card_content = None
+        try:
+            card = DatasetCard.load(dataset_id)
+            card_content = card.text[:2000] if card.text else None  # Limit card size
+        except Exception:
+            pass
+        return {
+            "id": info.id,
+            "author": info.author,
+            "downloads": info.downloads,
+            "likes": info.likes,
+            "tags": info.tags,
+            "license": getattr(info, 'license', None),
+            "created_at": str(info.created_at) if info.created_at else None,
+            "last_modified": str(info.last_modified) if info.last_modified else None,
+            "card_summary": card_content,
+        }
+    def get_configs_and_splits(self, dataset_id: str) -> Dict[str, List[str]]:
+        """Get available configs and splits for a dataset."""
+        try:
+            configs = get_dataset_config_names(dataset_id, trust_remote_code=True)
+        except Exception:
+            configs = ["default"]
+        result = {}
+        for config in configs[:3]:  # Limit to first 3 configs
+            try:
+                splits = get_dataset_split_names(dataset_id, config, trust_remote_code=True)
+                result[config] = splits
+            except Exception:
+                result[config] = ["train"]
+        return result
+    def load_sample(
+        self,
+        dataset_id: str,
+        config: Optional[str] = None,
+        split: str = "train",
+        n_rows: int = 5,
+        streaming: bool = True
+    ) -> List[Dict[str, Any]]:
+        """Load a sample of rows from a dataset."""
+        try:
+            ds = load_dataset(
+                dataset_id,
+                config,
+                split=split,
+                streaming=streaming,
+                trust_remote_code=True
+            )
+            if streaming:
+                samples = []
+                for i, row in enumerate(ds):
+                    if i >= n_rows:
+                        break
+                    # Convert row to serializable format
+                    samples.append(self._serialize_row(row))
+                return samples
+            else:
+                return [self._serialize_row(row) for row in ds.select(range(min(n_rows, len(ds))))]
+        except Exception as e:
+            return [{"error": str(e)}]
+    def get_schema(self, dataset_id: str, config: Optional[str] = None, split: str = "train") -> Dict[str, Any]:
+        """Get the schema/features of a dataset."""
+        try:
+            ds = load_dataset(
+                dataset_id,
+                config,
+                split=split,
+                streaming=True,
+                trust_remote_code=True
+            )
+            features = ds.features
+            schema = {}
+            for name, feature in features.items():
+                schema[name] = str(feature)
+            return {
+                "columns": list(features.keys()),
+                "features": schema,
+                "num_columns": len(features)
+            }
+        except Exception as e:
+            return {"error": str(e)}
+    def _serialize_row(self, row: Dict[str, Any]) -> Dict[str, Any]:
+        """Convert a row to JSON-serializable format."""
+        result = {}
+        for key, value in row.items():
+            if hasattr(value, 'tolist'):  # numpy array
+                result[key] = value.tolist()
+            elif hasattr(value, '__dict__'):  # PIL Image or similar
+                result[key] = f"<{type(value).__name__}>"
+            elif isinstance(value, bytes):
+                result[key] = f"<bytes: {len(value)} bytes>"
+            else:
+                result[key] = value
+        return result
+# Singleton instance
+_client = None
+def get_client() -> HFDatasetClient:
+    """Get or create the HF client singleton."""
+    global _client
+    if _client is None:
+        _client = HFDatasetClient()
+    return _client