neelnsoni13 commited on
Commit
cc412d8
·
verified ·
1 Parent(s): d9198d9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # CSV Data Processing & Merging Pipeline
6
+
7
+ ## Overview
8
+ This project processes and merges multiple CSV datasets by intelligently mapping column headers, cleaning data, and ensuring consistency. It leverages AI-based semantic similarity to standardize headers and applies filtering rules for valid data entries.
9
+
10
+ ## Features
11
+ - **Recursive CSV File Search:** Collects all CSV files from a specified directory.
12
+ - **AI-Powered Header Mapping:** Uses Sentence-BERT (`paraphrase-MiniLM-L6-v2`) and DeepSeek-R1:14B to map headers to predefined target columns.
13
+ - **Data Cleaning & Validation:**
14
+ - Filters valid phone numbers (7 to 15 digits).
15
+ - Normalizes column names for consistency.
16
+ - **Merging & Structuring:** Combines multiple datasets into a structured format.
17
+
18
+ ## Installation
19
+
20
+ ### Step 1: Create a Virtual Environment
21
+ ```sh
22
+ python -m venv venv
23
+ source venv/bin/activate # On macOS/Linux
24
+ venv\Scripts\activate # On Windows
25
+ ```
26
+
27
+ ### Step 2: Install Dependencies
28
+ ```sh
29
+ pip install -r requirements.txt
30
+ ```
31
+
32
+ ### Step 3: Ensure Required Models are Installed
33
+ ```sh
34
+ python -m sentence_transformers.sentence_transformer
35
+ ```
36
+
37
+ ## Usage
38
+
39
+ ### Running the Script
40
+ ```sh
41
+ python main.py --root_directory path/to/csv/files
42
+ ```
43
+
44
+ ### Arguments
45
+ - `--root_directory`: Path to the folder containing CSV files.
46
+
47
+ ## Project Structure
48
+ ```
49
+ project/
50
+ │── 1_CleanLargeMultipleExcels.py # Main script
51
+ ```
52
+
53
+ ## Functions
54
+
55
+ ### `read_csv_files(root_directory)`
56
+ Recursively scans a directory to find all CSV files.
57
+
58
+ ### `calculate_similarity(source_list, target_list)`
59
+ Computes cosine similarity between dataset headers and target headers.
60
+
61
+ ### `get_header_similarity(headers)`
62
+ Uses DeepSeek AI to match dataset headers to predefined target columns.
63
+
64
+ ### `is_valid_mobile(value)`
65
+ Checks if a given value is a valid phone number.
66
+
67
+ ### `merge_dataframes(file_paths)`
68
+ Reads, maps, cleans, and merges CSV data.
69
+
70
+ ## Future Enhancements
71
+ - **Parallel Processing:** Implement multiprocessing for handling large datasets.
72
+ - **Improved Header Mapping:** Train a custom model for better accuracy.
73
+ - **Batch Processing:** Process CSV files in chunks to optimize memory usage.
74
+
75
+ ## License
76
+ This project is licensed under the MIT License.
77
+
78
+ ---
79
+ **NEEL N SONI:**