ghostai1 commited on
Commit
fe7ddde
·
verified ·
1 Parent(s): 0e474e6

Initial README for Internal RAG CX Data Preprocessing Demo

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags: [model]
3
+ ---
4
+ # Internal RAG CX Data Preprocessing Demo
5
+
6
+ A robust data preprocessing pipeline for Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) systems, deployed on Hugging Face as a Model repository (free tier). Built with over 5 years of AI expertise since 2020, this demo focuses on cleaning and preparing call center datasets for enterprise-grade CX applications in SaaS, HealthTech, FinTech, and eCommerce. It integrates advanced data wrangling with Pandas, ensuring high-quality FAQs for downstream RAG/CAG pipelines, and is compatible with Amazon SageMaker and Azure AI for scalable modeling.
7
+
8
+ ## Technical Architecture
9
+
10
+ ### Data Preprocessing Pipeline
11
+
12
+ The core of this demo is a comprehensive data preprocessing pipeline designed to clean raw call center datasets:
13
+
14
+ - **Data Ingestion**:
15
+ - Parses CSVs with `pd.read_csv`, using `io.StringIO` for embedded data, with explicit `quotechar` and `escapechar` to handle complex strings.
16
+ - Handles datasets with columns: `call_id`, `question`, `answer`, `language`.
17
+
18
+ - **Junk Data Cleanup**:
19
+ - **Null Handling**: Drops rows with missing `question` or `answer` using `df.dropna()`.
20
+ - **Duplicate Removal**: Eliminates redundant FAQs via `df[~df['question'].duplicated()]`.
21
+ - **Short Entry Filtering**: Excludes questions <10 chars or answers <20 chars with `df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)]`.
22
+ - **Malformed Detection**: Uses regex (`[!?]{2,}|(Invalid|N/A)`) to filter invalid questions.
23
+ - **Standardization**: Normalizes text (e.g., "mo" to "month") and fills missing `language` with "en".
24
+
25
+ - **Output**:
26
+ - Generates `cleaned_call_center_faqs.csv` for downstream modeling.
27
+ - Provides cleanup stats: nulls removed, duplicates removed, short entries filtered, malformed entries detected.
28
+
29
+ ### Enterprise-Grade Modeling Compatibility
30
+
31
+ The cleaned dataset is optimized for:
32
+
33
+ - **Amazon SageMaker**: Ready for training BERT-based models (e.g., `bert-base-uncased`) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
34
+ - **Azure AI**: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
35
+ - **LLM Integration**: Supports fine-tuning LLMs (e.g., `distilgpt2`) for generative tasks, leveraging your FastAPI experience for API-driven inference.
36
+
37
+ ## Performance Monitoring and Visualization
38
+
39
+ The demo includes a performance monitoring suite:
40
+
41
+ - **Processing Time Tracking**: Measures data ingestion, cleaning, and output times using `time.perf_counter()`, reported in milliseconds.
42
+ - **Cleanup Metrics**: Tracks the number of nulls, duplicates, short entries, and malformed entries removed.
43
+ - **Visualization**: Uses Matplotlib to plot a bar chart (`cleanup_stats.png`):
44
+ - Bars: Number of entries removed per category (Nulls, Duplicates, Short, Malformed).
45
+ - Palette: Professional muted colors for enterprise aesthetics.
46
+
47
+ ## Gradio Interface for Interactive Demo
48
+
49
+ The demo is accessible via Gradio, providing an interactive data preprocessing experience:
50
+
51
+ - **Input**: Upload a sample call center CSV or use the embedded demo dataset.
52
+ - **Outputs**:
53
+ - **Cleaned Dataset**: Download `cleaned_call_center_faqs.csv`.
54
+ - **Cleanup Stats**: Detailed breakdown (e.g., “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”).
55
+ - **Performance Plot**: Visual metrics for processing time and cleanup stats.
56
+ - **Styling**: Custom dark theme CSS (`#2a2a2a` background, blue buttons) for a sleek, enterprise-ready UI.
57
+
58
+ ## Setup
59
+
60
+ - Clone this repository to a Hugging Face Model repository (free tier, public).
61
+ - Add `requirements.txt` with dependencies (`gradio==4.44.0`, `pandas==2.2.3`, `matplotlib==3.9.2`, etc.).
62
+ - Upload `app.py` (includes embedded demo dataset for seamless deployment).
63
+ - Configure to run with Python 3.9+, CPU hardware (no GPU).
64
+
65
+ ## Usage
66
+
67
+ - **Upload CSV**: Provide a call center CSV in the Gradio UI, or use the default demo dataset.
68
+ - **Output**:
69
+ - **Cleaned Dataset**: Download the processed `cleaned_call_center_faqs.csv`.
70
+ - **Cleanup Stats**: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
71
+ - **Performance Plot**: Visual metrics for processing time and cleanup stats.
72
+
73
+ **Example**:
74
+ - **Input CSV**: Sample dataset with 10 FAQs, including 2 nulls, 1 duplicate, 1 short entry.
75
+ - **Output**:
76
+ - Cleaned Dataset: 6 FAQs in `cleaned_call_center_faqs.csv`.
77
+ - Cleanup Stats: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
78
+ - Plot: Processing Time (Ingestion: 50ms, Cleaning: 30ms, Output: 10ms), Cleanup Stats (Nulls: 2, Duplicates: 1, Short: 1, Malformed: 0).
79
+
80
+ ## Technical Details
81
+
82
+ **Stack**:
83
+ - **Pandas**: Data wrangling and preprocessing for call center CSVs.
84
+ - **Gradio**: Interactive UI for real-time data preprocessing demos.
85
+ - **Matplotlib**: Performance visualization with bar charts.
86
+ - **FastAPI Compatibility**: Designed with API-driven preprocessing in mind, leveraging your experience with FastAPI for scalable deployments.
87
+
88
+ **Free Tier Optimization**: Lightweight with CPU-only dependencies, no GPU required.
89
+
90
+ **Extensibility**: Ready for integration with RAG/CAG pipelines, and cloud deployments on AWS Lambda or Azure Functions.
91
+
92
+ ## Purpose
93
+
94
+ This demo showcases expertise in data preprocessing for AI-driven CX automation, focusing on call center data quality. Built on over 5 years of experience in AI, data engineering, and enterprise-grade deployments, it demonstrates the power of Pandas-based data cleaning for RAG/CAG pipelines, making it ideal for advanced CX solutions in call center environments.
95
+
96
+ ## Latest Update
97
+
98
+ **Status Update**: Placeholder update - January 01, 2025 📝
99
+ - Placeholder update text.
100
+
101
+ ## Future Enhancements
102
+
103
+ - **Real-Time Streaming**: Add support for real-time data streaming from Kafka for live preprocessing.
104
+ - **FastAPI Deployment**: Expose preprocessing pipeline via FastAPI endpoints for production-grade use.
105
+ - **Advanced Validation**: Implement stricter data validation rules using machine learning-based outlier detection.
106
+ - **Cloud Integration**: Enhance compatibility with AWS Glue or Azure Data Factory for enterprise data pipelines.
107
+
108
+ **Website**: https://ghostainews.com/
109
+ **Discord**: https://discord.gg/BfA23aYz