|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# CSV Data Processing & Merging Pipeline |
|
|
|
|
|
## Overview |
|
|
This project processes and merges multiple CSV datasets by intelligently mapping column headers, cleaning data, and ensuring consistency. It leverages AI-based semantic similarity to standardize headers and applies filtering rules for valid data entries. |
|
|
|
|
|
## Features |
|
|
- **Recursive CSV File Search:** Collects all CSV files from a specified directory. |
|
|
- **AI-Powered Header Mapping:** Uses Sentence-BERT (`paraphrase-MiniLM-L6-v2`) and DeepSeek-R1:14B to map headers to predefined target columns. |
|
|
- **Data Cleaning & Validation:** |
|
|
- Filters valid phone numbers (7 to 15 digits). |
|
|
- Normalizes column names for consistency. |
|
|
- **Merging & Structuring:** Combines multiple datasets into a structured format. |
|
|
|
|
|
## Installation |
|
|
|
|
|
### Step 1: Create a Virtual Environment |
|
|
```sh |
|
|
python -m venv venv |
|
|
source venv/bin/activate # On macOS/Linux |
|
|
venv\Scripts\activate # On Windows |
|
|
``` |
|
|
|
|
|
### Step 2: Install Dependencies |
|
|
```sh |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Step 3: Ensure Required Models are Installed |
|
|
```sh |
|
|
python -m sentence_transformers.sentence_transformer |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Running the Script |
|
|
```sh |
|
|
python main.py --root_directory path/to/csv/files |
|
|
``` |
|
|
|
|
|
### Arguments |
|
|
- `--root_directory`: Path to the folder containing CSV files. |
|
|
|
|
|
## Project Structure |
|
|
``` |
|
|
project/ |
|
|
│── 1_CleanLargeMultipleExcels.py # Main script |
|
|
``` |
|
|
|
|
|
## Functions |
|
|
|
|
|
### `read_csv_files(root_directory)` |
|
|
Recursively scans a directory to find all CSV files. |
|
|
|
|
|
### `calculate_similarity(source_list, target_list)` |
|
|
Computes cosine similarity between dataset headers and target headers. |
|
|
|
|
|
### `get_header_similarity(headers)` |
|
|
Uses DeepSeek AI to match dataset headers to predefined target columns. |
|
|
|
|
|
### `is_valid_mobile(value)` |
|
|
Checks if a given value is a valid phone number. |
|
|
|
|
|
### `merge_dataframes(file_paths)` |
|
|
Reads, maps, cleans, and merges CSV data. |
|
|
|
|
|
## Future Enhancements |
|
|
- **Parallel Processing:** Implement multiprocessing for handling large datasets. |
|
|
- **Improved Header Mapping:** Train a custom model for better accuracy. |
|
|
- **Batch Processing:** Process CSV files in chunks to optimize memory usage. |
|
|
|
|
|
## License |
|
|
This project is licensed under the MIT License. |
|
|
|
|
|
--- |
|
|
**NEEL N SONI:** |
|
|
|