CSV Data Processing & Merging Pipeline

Overview

This project processes and merges multiple CSV datasets by intelligently mapping column headers, cleaning data, and ensuring consistency. It leverages AI-based semantic similarity to standardize headers and applies filtering rules for valid data entries.

Features

  • Recursive CSV File Search: Collects all CSV files from a specified directory.
  • AI-Powered Header Mapping: Uses Sentence-BERT (paraphrase-MiniLM-L6-v2) and DeepSeek-R1:14B to map headers to predefined target columns.
  • Data Cleaning & Validation:
    • Filters valid phone numbers (7 to 15 digits).
    • Normalizes column names for consistency.
  • Merging & Structuring: Combines multiple datasets into a structured format.

Installation

Step 1: Create a Virtual Environment

python -m venv venv
source venv/bin/activate   # On macOS/Linux
venv\Scripts\activate     # On Windows

Step 2: Install Dependencies

pip install -r requirements.txt

Step 3: Ensure Required Models are Installed

python -m sentence_transformers.sentence_transformer

Usage

Running the Script

python main.py --root_directory path/to/csv/files

Arguments

  • --root_directory: Path to the folder containing CSV files.

Project Structure

project/
โ”‚โ”€โ”€ 1_CleanLargeMultipleExcels.py                # Main script

Functions

read_csv_files(root_directory)

Recursively scans a directory to find all CSV files.

calculate_similarity(source_list, target_list)

Computes cosine similarity between dataset headers and target headers.

get_header_similarity(headers)

Uses DeepSeek AI to match dataset headers to predefined target columns.

is_valid_mobile(value)

Checks if a given value is a valid phone number.

merge_dataframes(file_paths)

Reads, maps, cleans, and merges CSV data.

Future Enhancements

  • Parallel Processing: Implement multiprocessing for handling large datasets.
  • Improved Header Mapping: Train a custom model for better accuracy.
  • Batch Processing: Process CSV files in chunks to optimize memory usage.

License

This project is licensed under the MIT License.


NEEL N SONI:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support