--- license: mit --- # CSV Data Processing & Merging Pipeline ## Overview This project processes and merges multiple CSV datasets by intelligently mapping column headers, cleaning data, and ensuring consistency. It leverages AI-based semantic similarity to standardize headers and applies filtering rules for valid data entries. ## Features - **Recursive CSV File Search:** Collects all CSV files from a specified directory. - **AI-Powered Header Mapping:** Uses Sentence-BERT (`paraphrase-MiniLM-L6-v2`) and DeepSeek-R1:14B to map headers to predefined target columns. - **Data Cleaning & Validation:** - Filters valid phone numbers (7 to 15 digits). - Normalizes column names for consistency. - **Merging & Structuring:** Combines multiple datasets into a structured format. ## Installation ### Step 1: Create a Virtual Environment ```sh python -m venv venv source venv/bin/activate # On macOS/Linux venv\Scripts\activate # On Windows ``` ### Step 2: Install Dependencies ```sh pip install -r requirements.txt ``` ### Step 3: Ensure Required Models are Installed ```sh python -m sentence_transformers.sentence_transformer ``` ## Usage ### Running the Script ```sh python main.py --root_directory path/to/csv/files ``` ### Arguments - `--root_directory`: Path to the folder containing CSV files. ## Project Structure ``` project/ │── 1_CleanLargeMultipleExcels.py # Main script ``` ## Functions ### `read_csv_files(root_directory)` Recursively scans a directory to find all CSV files. ### `calculate_similarity(source_list, target_list)` Computes cosine similarity between dataset headers and target headers. ### `get_header_similarity(headers)` Uses DeepSeek AI to match dataset headers to predefined target columns. ### `is_valid_mobile(value)` Checks if a given value is a valid phone number. ### `merge_dataframes(file_paths)` Reads, maps, cleans, and merges CSV data. ## Future Enhancements - **Parallel Processing:** Implement multiprocessing for handling large datasets. - **Improved Header Mapping:** Train a custom model for better accuracy. - **Batch Processing:** Process CSV files in chunks to optimize memory usage. ## License This project is licensed under the MIT License. --- **NEEL N SONI:**