Initial README for Internal RAG CX Data Preprocessing Demo
Browse files
README.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags: [model]
|
| 3 |
+
---
|
| 4 |
+
# Internal RAG CX Data Preprocessing Demo
|
| 5 |
+
|
| 6 |
+
A robust data preprocessing pipeline for Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) systems, deployed on Hugging Face as a Model repository (free tier). Built with over 5 years of AI expertise since 2020, this demo focuses on cleaning and preparing call center datasets for enterprise-grade CX applications in SaaS, HealthTech, FinTech, and eCommerce. It integrates advanced data wrangling with Pandas, ensuring high-quality FAQs for downstream RAG/CAG pipelines, and is compatible with Amazon SageMaker and Azure AI for scalable modeling.
|
| 7 |
+
|
| 8 |
+
## Technical Architecture
|
| 9 |
+
|
| 10 |
+
### Data Preprocessing Pipeline
|
| 11 |
+
|
| 12 |
+
The core of this demo is a comprehensive data preprocessing pipeline designed to clean raw call center datasets:
|
| 13 |
+
|
| 14 |
+
- **Data Ingestion**:
|
| 15 |
+
- Parses CSVs with `pd.read_csv`, using `io.StringIO` for embedded data, with explicit `quotechar` and `escapechar` to handle complex strings.
|
| 16 |
+
- Handles datasets with columns: `call_id`, `question`, `answer`, `language`.
|
| 17 |
+
|
| 18 |
+
- **Junk Data Cleanup**:
|
| 19 |
+
- **Null Handling**: Drops rows with missing `question` or `answer` using `df.dropna()`.
|
| 20 |
+
- **Duplicate Removal**: Eliminates redundant FAQs via `df[~df['question'].duplicated()]`.
|
| 21 |
+
- **Short Entry Filtering**: Excludes questions <10 chars or answers <20 chars with `df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)]`.
|
| 22 |
+
- **Malformed Detection**: Uses regex (`[!?]{2,}|(Invalid|N/A)`) to filter invalid questions.
|
| 23 |
+
- **Standardization**: Normalizes text (e.g., "mo" to "month") and fills missing `language` with "en".
|
| 24 |
+
|
| 25 |
+
- **Output**:
|
| 26 |
+
- Generates `cleaned_call_center_faqs.csv` for downstream modeling.
|
| 27 |
+
- Provides cleanup stats: nulls removed, duplicates removed, short entries filtered, malformed entries detected.
|
| 28 |
+
|
| 29 |
+
### Enterprise-Grade Modeling Compatibility
|
| 30 |
+
|
| 31 |
+
The cleaned dataset is optimized for:
|
| 32 |
+
|
| 33 |
+
- **Amazon SageMaker**: Ready for training BERT-based models (e.g., `bert-base-uncased`) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
|
| 34 |
+
- **Azure AI**: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
|
| 35 |
+
- **LLM Integration**: Supports fine-tuning LLMs (e.g., `distilgpt2`) for generative tasks, leveraging your FastAPI experience for API-driven inference.
|
| 36 |
+
|
| 37 |
+
## Performance Monitoring and Visualization
|
| 38 |
+
|
| 39 |
+
The demo includes a performance monitoring suite:
|
| 40 |
+
|
| 41 |
+
- **Processing Time Tracking**: Measures data ingestion, cleaning, and output times using `time.perf_counter()`, reported in milliseconds.
|
| 42 |
+
- **Cleanup Metrics**: Tracks the number of nulls, duplicates, short entries, and malformed entries removed.
|
| 43 |
+
- **Visualization**: Uses Matplotlib to plot a bar chart (`cleanup_stats.png`):
|
| 44 |
+
- Bars: Number of entries removed per category (Nulls, Duplicates, Short, Malformed).
|
| 45 |
+
- Palette: Professional muted colors for enterprise aesthetics.
|
| 46 |
+
|
| 47 |
+
## Gradio Interface for Interactive Demo
|
| 48 |
+
|
| 49 |
+
The demo is accessible via Gradio, providing an interactive data preprocessing experience:
|
| 50 |
+
|
| 51 |
+
- **Input**: Upload a sample call center CSV or use the embedded demo dataset.
|
| 52 |
+
- **Outputs**:
|
| 53 |
+
- **Cleaned Dataset**: Download `cleaned_call_center_faqs.csv`.
|
| 54 |
+
- **Cleanup Stats**: Detailed breakdown (e.g., “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”).
|
| 55 |
+
- **Performance Plot**: Visual metrics for processing time and cleanup stats.
|
| 56 |
+
- **Styling**: Custom dark theme CSS (`#2a2a2a` background, blue buttons) for a sleek, enterprise-ready UI.
|
| 57 |
+
|
| 58 |
+
## Setup
|
| 59 |
+
|
| 60 |
+
- Clone this repository to a Hugging Face Model repository (free tier, public).
|
| 61 |
+
- Add `requirements.txt` with dependencies (`gradio==4.44.0`, `pandas==2.2.3`, `matplotlib==3.9.2`, etc.).
|
| 62 |
+
- Upload `app.py` (includes embedded demo dataset for seamless deployment).
|
| 63 |
+
- Configure to run with Python 3.9+, CPU hardware (no GPU).
|
| 64 |
+
|
| 65 |
+
## Usage
|
| 66 |
+
|
| 67 |
+
- **Upload CSV**: Provide a call center CSV in the Gradio UI, or use the default demo dataset.
|
| 68 |
+
- **Output**:
|
| 69 |
+
- **Cleaned Dataset**: Download the processed `cleaned_call_center_faqs.csv`.
|
| 70 |
+
- **Cleanup Stats**: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
|
| 71 |
+
- **Performance Plot**: Visual metrics for processing time and cleanup stats.
|
| 72 |
+
|
| 73 |
+
**Example**:
|
| 74 |
+
- **Input CSV**: Sample dataset with 10 FAQs, including 2 nulls, 1 duplicate, 1 short entry.
|
| 75 |
+
- **Output**:
|
| 76 |
+
- Cleaned Dataset: 6 FAQs in `cleaned_call_center_faqs.csv`.
|
| 77 |
+
- Cleanup Stats: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
|
| 78 |
+
- Plot: Processing Time (Ingestion: 50ms, Cleaning: 30ms, Output: 10ms), Cleanup Stats (Nulls: 2, Duplicates: 1, Short: 1, Malformed: 0).
|
| 79 |
+
|
| 80 |
+
## Technical Details
|
| 81 |
+
|
| 82 |
+
**Stack**:
|
| 83 |
+
- **Pandas**: Data wrangling and preprocessing for call center CSVs.
|
| 84 |
+
- **Gradio**: Interactive UI for real-time data preprocessing demos.
|
| 85 |
+
- **Matplotlib**: Performance visualization with bar charts.
|
| 86 |
+
- **FastAPI Compatibility**: Designed with API-driven preprocessing in mind, leveraging your experience with FastAPI for scalable deployments.
|
| 87 |
+
|
| 88 |
+
**Free Tier Optimization**: Lightweight with CPU-only dependencies, no GPU required.
|
| 89 |
+
|
| 90 |
+
**Extensibility**: Ready for integration with RAG/CAG pipelines, and cloud deployments on AWS Lambda or Azure Functions.
|
| 91 |
+
|
| 92 |
+
## Purpose
|
| 93 |
+
|
| 94 |
+
This demo showcases expertise in data preprocessing for AI-driven CX automation, focusing on call center data quality. Built on over 5 years of experience in AI, data engineering, and enterprise-grade deployments, it demonstrates the power of Pandas-based data cleaning for RAG/CAG pipelines, making it ideal for advanced CX solutions in call center environments.
|
| 95 |
+
|
| 96 |
+
## Latest Update
|
| 97 |
+
|
| 98 |
+
**Status Update**: Placeholder update - January 01, 2025 📝
|
| 99 |
+
- Placeholder update text.
|
| 100 |
+
|
| 101 |
+
## Future Enhancements
|
| 102 |
+
|
| 103 |
+
- **Real-Time Streaming**: Add support for real-time data streaming from Kafka for live preprocessing.
|
| 104 |
+
- **FastAPI Deployment**: Expose preprocessing pipeline via FastAPI endpoints for production-grade use.
|
| 105 |
+
- **Advanced Validation**: Implement stricter data validation rules using machine learning-based outlier detection.
|
| 106 |
+
- **Cloud Integration**: Enhance compatibility with AWS Glue or Azure Data Factory for enterprise data pipelines.
|
| 107 |
+
|
| 108 |
+
**Website**: https://ghostainews.com/
|
| 109 |
+
**Discord**: https://discord.gg/BfA23aYz
|