neelnsoni13
/

LargeMultiExcelsCleanDeepSeek

Model card Files Files and versions

neelnsoni13 commited on Feb 16, 2025

Commit

cc412d8

·

verified ·

1 Parent(s): d9198d9

Create README.md

Files changed (1) hide show

README.md +79 -0

README.md ADDED Viewed

	@@ -0,0 +1,79 @@

+---
+license: mit
+---
+# CSV Data Processing & Merging Pipeline
+## Overview
+This project processes and merges multiple CSV datasets by intelligently mapping column headers, cleaning data, and ensuring consistency. It leverages AI-based semantic similarity to standardize headers and applies filtering rules for valid data entries.
+## Features
+- **Recursive CSV File Search:** Collects all CSV files from a specified directory.
+- **AI-Powered Header Mapping:** Uses Sentence-BERT (`paraphrase-MiniLM-L6-v2`) and DeepSeek-R1:14B to map headers to predefined target columns.
+- **Data Cleaning & Validation:**
+  - Filters valid phone numbers (7 to 15 digits).
+  - Normalizes column names for consistency.
+- **Merging & Structuring:** Combines multiple datasets into a structured format.
+## Installation
+### Step 1: Create a Virtual Environment
+```sh
+python -m venv venv
+source venv/bin/activate   # On macOS/Linux
+venv\Scripts\activate     # On Windows
+```
+### Step 2: Install Dependencies
+```sh
+pip install -r requirements.txt
+```
+### Step 3: Ensure Required Models are Installed
+```sh
+python -m sentence_transformers.sentence_transformer
+```
+## Usage
+### Running the Script
+```sh
+python main.py --root_directory path/to/csv/files
+```
+### Arguments
+- `--root_directory`: Path to the folder containing CSV files.
+## Project Structure
+```
+project/
+│── 1_CleanLargeMultipleExcels.py                # Main script
+```
+## Functions
+### `read_csv_files(root_directory)`
+Recursively scans a directory to find all CSV files.
+### `calculate_similarity(source_list, target_list)`
+Computes cosine similarity between dataset headers and target headers.
+### `get_header_similarity(headers)`
+Uses DeepSeek AI to match dataset headers to predefined target columns.
+### `is_valid_mobile(value)`
+Checks if a given value is a valid phone number.
+### `merge_dataframes(file_paths)`
+Reads, maps, cleans, and merges CSV data.
+## Future Enhancements
+- **Parallel Processing:** Implement multiprocessing for handling large datasets.
+- **Improved Header Mapping:** Train a custom model for better accuracy.
+- **Batch Processing:** Process CSV files in chunks to optimize memory usage.
+## License
+This project is licensed under the MIT License.
+---
+**NEEL N SONI:**