{ "repository_url": "https://github.com/ronelsolomon/ETL-pipeline.git", "owner": "ronelsolomon", "name": "ETL-pipeline.git", "extracted_at": "2026-03-02T22:43:46.867502", "files": { "requirements.txt": { "content": "pandas>=1.5.0\npython-dotenv>=0.19.0\npsycopg2-binary>=2.9.0\nsqlalchemy>=1.4.0\npyarrow>=7.0.0\nrequests>=2.28.0\npython-dateutil>=2.8.2\n", "size": 132, "language": "text" }, "README.md": { "content": "# ETL Pipeline with Open Data\n\nThis project demonstrates a simple ETL (Extract, Transform, Load) pipeline that processes public data and loads it into a database.\n\n## Features\n\n- Extracts data from public CSV datasets\n- Transforms and cleans data using Pandas\n- Loads data into PostgreSQL database\n- Environment variable configuration\n- Logging for monitoring the ETL process\n\n## Setup\n\n1. Clone the repository\n2. Install dependencies:\n ```bash\n pip install -r requirements.txt\n ```\n3. Set up your environment variables in `.env` (use `.env.example` as a template)\n4. Run the ETL pipeline:\n ```bash\n python main.py\n ```\n\n## Project Structure\n\n```\nETL-pipeline/\n├── data/ # For storing raw and processed data\n├── src/\n│ ├── __init__.py\n│ ├── extract.py # Data extraction logic\n│ ├── transform.py # Data transformation logic\n│ ├── load.py # Data loading logic\n│ └── utils.py # Utility functions\n├── .env.example # Example environment variables\n├── requirements.txt # Project dependencies\n├── main.py # Main script to run the ETL pipeline\n└── README.md # This file\n```\n\n## Data Source\n\nThis project uses [New York City Taxi Trip Data](https://www.kaggle.com/datasets/elemento/nyc-taxi-trip-dataset) as an example dataset.\n\n## License\n\nMIT\n", "size": 1359, "language": "markdown" }, ".gitattributes": { "content": "# Auto detect text files and perform LF normalization\n* text=auto\n", "size": 66, "language": "unknown" }, "main.py": { "content": "#!/usr/bin/env python3\n\"\"\"\nMain entry point for the ETL pipeline.\n\"\"\"\nimport os\nimport logging\nfrom dotenv import load_dotenv\nfrom src.extract import extract_data\nfrom src.transform import transform_data\nfrom src.load import load_data\nfrom src.utils import setup_logging, get_db_connection\n\ndef main():\n \"\"\"Main function to run the ETL pipeline.\"\"\"\n # Load environment variables\n load_dotenv()\n \n # Set up logging\n log_level = os.getenv('LOG_LEVEL', 'INFO')\n log_file = os.getenv('LOG_FILE', 'etl_pipeline.log')\n setup_logging(log_level=log_level, log_file=log_file)\n logger = logging.getLogger(__name__)\n \n try:\n logger.info(\"Starting ETL pipeline\")\n \n # Extract data\n logger.info(\"Extracting data...\")\n data_url = os.getenv('DATA_SOURCE_URL')\n raw_data = extract_data(data_url)\n \n # Transform data\n logger.info(\"Transforming data...\")\n transformed_data = transform_data(raw_data)\n \n # Load data\n logger.info(\"Loading data to database...\")\n table_name = 'taxi_trips'\n load_data(transformed_data, table_name)\n \n logger.info(\"ETL pipeline completed successfully\")\n \n except Exception as e:\n logger.error(f\"Error in ETL pipeline: {str(e)}\", exc_info=True)\n raise\n\nif __name__ == \"__main__\":\n main()\n", "size": 1376, "language": "python" }, "etl_pipeline.log": { "content": "2025-06-17 21:05:41,456 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 21:05:41,456 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 21:05:41,456 - __main__ - INFO - Extracting data...\n2025-06-17 21:05:41,456 - src.extract - INFO - Extracting data from: None\n2025-06-17 21:05:41,456 - src.extract - ERROR - Error extracting data from None: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:05:41,456 - __main__ - ERROR - Error in ETL pipeline: stat: path should be string, bytes, os.PathLike or integer, not NoneType\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 30, in main\n raw_data = extract_data(data_url)\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/extract.py\", line 32, in extract_data\n elif os.path.exists(source):\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 19, in exists\nTypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:06:13,874 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 21:06:13,874 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 21:06:13,874 - __main__ - INFO - Extracting data...\n2025-06-17 21:06:13,874 - src.extract - INFO - Extracting data from: None\n2025-06-17 21:06:13,874 - src.extract - ERROR - Error extracting data from None: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:06:13,874 - __main__ - ERROR - Error in ETL pipeline: stat: path should be string, bytes, os.PathLike or integer, not NoneType\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 30, in main\n raw_data = extract_data(data_url)\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/extract.py\", line 32, in extract_data\n elif os.path.exists(source):\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 19, in exists\nTypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:06:54,331 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 21:06:54,332 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 21:06:54,332 - __main__ - INFO - Extracting data...\n2025-06-17 21:06:54,332 - src.extract - INFO - Extracting data from: None\n2025-06-17 21:06:54,332 - src.extract - ERROR - Error extracting data from None: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:06:54,332 - __main__ - ERROR - Error in ETL pipeline: stat: path should be string, bytes, os.PathLike or integer, not NoneType\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 30, in main\n raw_data = extract_data(data_url)\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/extract.py\", line 32, in extract_data\n elif os.path.exists(source):\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 19, in exists\nTypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType\n2025-06-17 21:09:20,929 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 21:09:20,929 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 21:09:20,929 - __main__ - INFO - Extracting data...\n2025-06-17 21:09:20,929 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 21:09:25,181 - __main__ - INFO - Transforming data...\n2025-06-17 21:09:25,181 - src.transform - INFO - Starting data transformation\n2025-06-17 21:09:25,276 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 21:09:25,277 - __main__ - INFO - Loading data to database...\n2025-06-17 21:09:25,277 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 21:09:25,306 - src.load - ERROR - Error loading data into taxi_trips: No module named 'psycopg2'\n2025-06-17 21:09:25,306 - __main__ - ERROR - Error in ETL pipeline: No module named 'psycopg2'\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 60, in load_data\n engine = get_db_connection()\n ^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/utils.py\", line 84, in get_db_connection\n return create_engine(connection_string)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 2, in create_engine\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/deprecations.py\", line 281, in warned\n return fn(*args, **kwargs) # type: ignore[no-any-return]\n ^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/create.py\", line 599, in create_engine\n dbapi = dbapi_meth(**dbapi_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py\", line 690, in import_dbapi\n import psycopg2\nModuleNotFoundError: No module named 'psycopg2'\n2025-06-17 21:09:53,917 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 21:09:53,918 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 21:09:53,918 - __main__ - INFO - Extracting data...\n2025-06-17 21:09:53,918 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 21:09:55,987 - __main__ - INFO - Transforming data...\n2025-06-17 21:09:55,987 - src.transform - INFO - Starting data transformation\n2025-06-17 21:09:56,064 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 21:09:56,064 - __main__ - INFO - Loading data to database...\n2025-06-17 21:09:56,065 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 21:09:56,091 - src.load - ERROR - Error loading data into taxi_trips: No module named 'psycopg2'\n2025-06-17 21:09:56,091 - __main__ - ERROR - Error in ETL pipeline: No module named 'psycopg2'\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 60, in load_data\n engine = get_db_connection()\n ^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/utils.py\", line 84, in get_db_connection\n return create_engine(connection_string)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 2, in create_engine\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/deprecations.py\", line 281, in warned\n return fn(*args, **kwargs) # type: ignore[no-any-return]\n ^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/create.py\", line 599, in create_engine\n dbapi = dbapi_meth(**dbapi_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py\", line 690, in import_dbapi\n import psycopg2\nModuleNotFoundError: No module named 'psycopg2'\n2025-06-17 22:07:02,572 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 22:07:02,572 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 22:07:02,572 - __main__ - INFO - Extracting data...\n2025-06-17 22:07:02,572 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 22:07:04,213 - __main__ - INFO - Transforming data...\n2025-06-17 22:07:04,214 - src.transform - INFO - Starting data transformation\n2025-06-17 22:07:04,292 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 22:07:04,292 - __main__ - INFO - Loading data to database...\n2025-06-17 22:07:04,292 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 22:07:04,312 - src.load - ERROR - Error loading data into taxi_trips: No module named 'psycopg2'\n2025-06-17 22:07:04,313 - __main__ - ERROR - Error in ETL pipeline: No module named 'psycopg2'\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 60, in load_data\n engine = get_db_connection()\n ^^^^^^^^^^^^^^^^^^^\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/utils.py\", line 84, in get_db_connection\n return create_engine(connection_string)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"\", line 2, in create_engine\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/deprecations.py\", line 281, in warned\n return fn(*args, **kwargs) # type: ignore[no-any-return]\n ^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/create.py\", line 599, in create_engine\n dbapi = dbapi_meth(**dbapi_args)\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py\", line 690, in import_dbapi\n import psycopg2\nModuleNotFoundError: No module named 'psycopg2'\n2025-06-17 22:11:51,014 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 22:11:51,014 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 22:11:51,014 - __main__ - INFO - Extracting data...\n2025-06-17 22:11:51,014 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 22:11:54,433 - __main__ - INFO - Transforming data...\n2025-06-17 22:11:54,433 - src.transform - INFO - Starting data transformation\n2025-06-17 22:11:54,510 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 22:11:54,510 - __main__ - INFO - Loading data to database...\n2025-06-17 22:11:54,510 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 22:11:55,209 - src.load - ERROR - Error loading data into taxi_trips: 'Engine' object has no attribute 'has_table'\n2025-06-17 22:11:55,209 - __main__ - ERROR - Error in ETL pipeline: 'Engine' object has no attribute 'has_table'\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 64, in load_data\n create_table_from_dataframe(\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 109, in create_table_from_dataframe\n if engine.has_table(table_name, schema=schema):\n ^^^^^^^^^^^^^^^^\nAttributeError: 'Engine' object has no attribute 'has_table'\n2025-06-17 22:14:02,942 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 22:14:02,942 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 22:14:02,942 - __main__ - INFO - Extracting data...\n2025-06-17 22:14:02,942 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 22:14:04,625 - __main__ - INFO - Transforming data...\n2025-06-17 22:14:04,625 - src.transform - INFO - Starting data transformation\n2025-06-17 22:14:04,699 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 22:14:04,699 - __main__ - INFO - Loading data to database...\n2025-06-17 22:14:04,699 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 22:14:05,331 - src.load - ERROR - Error loading data into taxi_trips: (psycopg2.OperationalError) connection to server at \"ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech\" (3.131.64.200), port 5432 failed: ERROR: password authentication failed for user 'alex'\n\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\n2025-06-17 22:14:05,331 - __main__ - ERROR - Error in ETL pipeline: (psycopg2.OperationalError) connection to server at \"ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech\" (3.131.64.200), port 5432 failed: ERROR: password authentication failed for user 'alex'\n\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\nTraceback (most recent call last):\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 146, in __init__\n self._dbapi_connection = engine.raw_connection()\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 3302, in raw_connection\n return self.pool.connect()\n ^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 449, in connect\n return _ConnectionFairy._checkout(self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 1263, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 712, in checkout\n rec = pool._do_get()\n ^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/impl.py\", line 179, in _do_get\n with util.safe_reraise():\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py\", line 146, in __exit__\n raise exc_value.with_traceback(exc_tb)\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/impl.py\", line 177, in _do_get\n return self._create_connection()\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 390, in _create_connection\n return _ConnectionRecord(self)\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 674, in __init__\n self.__connect()\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 900, in __connect\n with util.safe_reraise():\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py\", line 146, in __exit__\n raise exc_value.with_traceback(exc_tb)\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 896, in __connect\n self.dbapi_connection = connection = pool._invoke_creator(self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/create.py\", line 643, in connect\n return dialect.connect(*cargs, **cparams)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/default.py\", line 621, in connect\n return self.loaded_dbapi.connect(*cargs, **cparams)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/psycopg2/__init__.py\", line 122, in connect\n conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\npsycopg2.OperationalError: connection to server at \"ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech\" (3.131.64.200), port 5432 failed: ERROR: password authentication failed for user 'alex'\n\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 64, in load_data\n create_table_from_dataframe(\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 111, in create_table_from_dataframe\n inspector = inspect(engine)\n ^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/inspection.py\", line 140, in inspect\n ret = reg(subject)\n ^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/reflection.py\", line 312, in _engine_insp\n return Inspector._construct(Inspector._init_engine, bind)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/reflection.py\", line 245, in _construct\n init(self, bind)\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/reflection.py\", line 256, in _init_engine\n engine.connect().close()\n ^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 3278, in connect\n return self._connection_cls(self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 148, in __init__\n Connection._handle_dbapi_exception_noconnection(\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 2442, in _handle_dbapi_exception_noconnection\n raise sqlalchemy_exception.with_traceback(exc_info[2]) from e\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 146, in __init__\n self._dbapi_connection = engine.raw_connection()\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 3302, in raw_connection\n return self.pool.connect()\n ^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 449, in connect\n return _ConnectionFairy._checkout(self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 1263, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 712, in checkout\n rec = pool._do_get()\n ^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/impl.py\", line 179, in _do_get\n with util.safe_reraise():\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py\", line 146, in __exit__\n raise exc_value.with_traceback(exc_tb)\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/impl.py\", line 177, in _do_get\n return self._create_connection()\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 390, in _create_connection\n return _ConnectionRecord(self)\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 674, in __init__\n self.__connect()\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 900, in __connect\n with util.safe_reraise():\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py\", line 146, in __exit__\n raise exc_value.with_traceback(exc_tb)\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/pool/base.py\", line 896, in __connect\n self.dbapi_connection = connection = pool._invoke_creator(self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/create.py\", line 643, in connect\n return dialect.connect(*cargs, **cparams)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/default.py\", line 621, in connect\n return self.loaded_dbapi.connect(*cargs, **cparams)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/psycopg2/__init__.py\", line 122, in connect\n conn = _connect(dsn, connection_factory=connection_factory, **kwasync)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nsqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at \"ep-cool-darkness-a1b2c3d4-pooler.us-east-2.aws.neon.tech\" (3.131.64.200), port 5432 failed: ERROR: password authentication failed for user 'alex'\n\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\n2025-06-17 22:16:08,322 - root - INFO - Logging to file: /Users/ronel/Downloads/dev/templates/ETL-pipeline/etl_pipeline.log\n2025-06-17 22:16:08,322 - __main__ - INFO - Starting ETL pipeline\n2025-06-17 22:16:08,322 - __main__ - INFO - Extracting data...\n2025-06-17 22:16:08,322 - src.extract - INFO - Extracting data from: https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet\n2025-06-17 22:16:09,971 - __main__ - INFO - Transforming data...\n2025-06-17 22:16:09,971 - src.transform - INFO - Starting data transformation\n2025-06-17 22:16:10,048 - src.transform - INFO - Data transformation complete. Shape after transformation: (3066766, 19)\n2025-06-17 22:16:10,048 - __main__ - INFO - Loading data to database...\n2025-06-17 22:16:10,048 - src.load - INFO - Loading data into taxi_trips\n2025-06-17 22:16:12,558 - src.load - INFO - Created table taxi_trips\n2025-06-17 22:25:26,193 - src.load - ERROR - Error loading data into taxi_trips: (psycopg2.errors.DiskFull) could not extend file because project size limit (512 MB) has been exceeded\nHINT: This limit is defined by neon.max_cluster_size GUC\n\n[SQL: INSERT INTO taxi_trips (\"VendorID\", tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, \"RatecodeID\", store_and_fwd_flag, \"PULocationID\", \"DOLocationID\", payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, im ... 458885 characters truncated ... vement_surcharge__999)s, %(total_amount__999)s, %(congestion_surcharge__999)s, %(airport_fee__999)s)]\n[parameters: {'congestion_surcharge__0': 2.5, 'RatecodeID__0': 1.0, 'airport_fee__0': 0.0, 'extra__0': 2.5, 'tip_amount__0': 2.9, 'fare_amount__0': 5.8, 'mta_tax__0': 0.5, 'total_amount__0': 12.7, 'trip_distance__0': 0.7, 'tolls_amount__0': 0.0, 'store_and_fwd_flag__0': 'N', 'passenger_count__0': 1.0, 'improvement_surcharge__0': 1.0, 'tpep_dropoff_datetime__0': datetime.datetime(2023, 1, 29, 15, 6, 34), 'tpep_pickup_datetime__0': datetime.datetime(2023, 1, 29, 15, 3, 37), 'VendorID__0': 1, 'payment_type__0': 1, 'DOLocationID__0': 237, 'PULocationID__0': 229, 'congestion_surcharge__1': 2.5, 'RatecodeID__1': 1.0, 'airport_fee__1': 0.0, 'extra__1': 2.5, 'tip_amount__1': 0.0, 'fare_amount__1': 13.5, 'mta_tax__1': 0.5, 'total_amount__1': 17.5, 'trip_distance__1': 2.4, 'tolls_amount__1': 0.0, 'store_and_fwd_flag__1': 'N', 'passenger_count__1': 1.0, 'improvement_surcharge__1': 1.0, 'tpep_dropoff_datetime__1': datetime.datetime(2023, 1, 29, 15, 24, 33), 'tpep_pickup_datetime__1': datetime.datetime(2023, 1, 29, 15, 12, 38), 'VendorID__1': 1, 'payment_type__1': 2, 'DOLocationID__1': 79, 'PULocationID__1': 237, 'congestion_surcharge__2': 2.5, 'RatecodeID__2': 1.0, 'airport_fee__2': 0.0, 'extra__2': 2.5, 'tip_amount__2': 3.85, 'fare_amount__2': 11.4, 'mta_tax__2': 0.5, 'total_amount__2': 19.25, 'trip_distance__2': 0.7, 'tolls_amount__2': 0.0, 'store_and_fwd_flag__2': 'N', 'passenger_count__2': 1.0 ... 18900 parameters truncated ... 'total_amount__997': 28.56, 'trip_distance__997': 3.9, 'tolls_amount__997': 0.0, 'store_and_fwd_flag__997': 'N', 'passenger_count__997': 1.0, 'improvement_surcharge__997': 1.0, 'tpep_dropoff_datetime__997': datetime.datetime(2023, 1, 29, 15, 58, 6), 'tpep_pickup_datetime__997': datetime.datetime(2023, 1, 29, 15, 42, 58), 'VendorID__997': 1, 'payment_type__997': 1, 'DOLocationID__997': 42, 'PULocationID__997': 141, 'congestion_surcharge__998': 0.0, 'RatecodeID__998': 1.0, 'airport_fee__998': 0.0, 'extra__998': 0.0, 'tip_amount__998': 0.0, 'fare_amount__998': 14.2, 'mta_tax__998': 0.5, 'total_amount__998': 15.7, 'trip_distance__998': 2.1, 'tolls_amount__998': 0.0, 'store_and_fwd_flag__998': 'N', 'passenger_count__998': 1.0, 'improvement_surcharge__998': 1.0, 'tpep_dropoff_datetime__998': datetime.datetime(2023, 1, 29, 15, 16, 54), 'tpep_pickup_datetime__998': datetime.datetime(2023, 1, 29, 15, 4, 55), 'VendorID__998': 1, 'payment_type__998': 2, 'DOLocationID__998': 75, 'PULocationID__998': 151, 'congestion_surcharge__999': 2.5, 'RatecodeID__999': 1.0, 'airport_fee__999': 0.0, 'extra__999': 2.5, 'tip_amount__999': 2.65, 'fare_amount__999': 9.3, 'mta_tax__999': 0.5, 'total_amount__999': 15.95, 'trip_distance__999': 1.1, 'tolls_amount__999': 0.0, 'store_and_fwd_flag__999': 'N', 'passenger_count__999': 1.0, 'improvement_surcharge__999': 1.0, 'tpep_dropoff_datetime__999': datetime.datetime(2023, 1, 29, 15, 35, 28), 'tpep_pickup_datetime__999': datetime.datetime(2023, 1, 29, 15, 27, 23), 'VendorID__999': 1, 'payment_type__999': 1, 'DOLocationID__999': 236, 'PULocationID__999': 262}]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\n2025-06-17 22:25:26,206 - __main__ - ERROR - Error in ETL pipeline: (psycopg2.errors.DiskFull) could not extend file because project size limit (512 MB) has been exceeded\nHINT: This limit is defined by neon.max_cluster_size GUC\n\n[SQL: INSERT INTO taxi_trips (\"VendorID\", tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, \"RatecodeID\", store_and_fwd_flag, \"PULocationID\", \"DOLocationID\", payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, im ... 458885 characters truncated ... vement_surcharge__999)s, %(total_amount__999)s, %(congestion_surcharge__999)s, %(airport_fee__999)s)]\n[parameters: {'congestion_surcharge__0': 2.5, 'RatecodeID__0': 1.0, 'airport_fee__0': 0.0, 'extra__0': 2.5, 'tip_amount__0': 2.9, 'fare_amount__0': 5.8, 'mta_tax__0': 0.5, 'total_amount__0': 12.7, 'trip_distance__0': 0.7, 'tolls_amount__0': 0.0, 'store_and_fwd_flag__0': 'N', 'passenger_count__0': 1.0, 'improvement_surcharge__0': 1.0, 'tpep_dropoff_datetime__0': datetime.datetime(2023, 1, 29, 15, 6, 34), 'tpep_pickup_datetime__0': datetime.datetime(2023, 1, 29, 15, 3, 37), 'VendorID__0': 1, 'payment_type__0': 1, 'DOLocationID__0': 237, 'PULocationID__0': 229, 'congestion_surcharge__1': 2.5, 'RatecodeID__1': 1.0, 'airport_fee__1': 0.0, 'extra__1': 2.5, 'tip_amount__1': 0.0, 'fare_amount__1': 13.5, 'mta_tax__1': 0.5, 'total_amount__1': 17.5, 'trip_distance__1': 2.4, 'tolls_amount__1': 0.0, 'store_and_fwd_flag__1': 'N', 'passenger_count__1': 1.0, 'improvement_surcharge__1': 1.0, 'tpep_dropoff_datetime__1': datetime.datetime(2023, 1, 29, 15, 24, 33), 'tpep_pickup_datetime__1': datetime.datetime(2023, 1, 29, 15, 12, 38), 'VendorID__1': 1, 'payment_type__1': 2, 'DOLocationID__1': 79, 'PULocationID__1': 237, 'congestion_surcharge__2': 2.5, 'RatecodeID__2': 1.0, 'airport_fee__2': 0.0, 'extra__2': 2.5, 'tip_amount__2': 3.85, 'fare_amount__2': 11.4, 'mta_tax__2': 0.5, 'total_amount__2': 19.25, 'trip_distance__2': 0.7, 'tolls_amount__2': 0.0, 'store_and_fwd_flag__2': 'N', 'passenger_count__2': 1.0 ... 18900 parameters truncated ... 'total_amount__997': 28.56, 'trip_distance__997': 3.9, 'tolls_amount__997': 0.0, 'store_and_fwd_flag__997': 'N', 'passenger_count__997': 1.0, 'improvement_surcharge__997': 1.0, 'tpep_dropoff_datetime__997': datetime.datetime(2023, 1, 29, 15, 58, 6), 'tpep_pickup_datetime__997': datetime.datetime(2023, 1, 29, 15, 42, 58), 'VendorID__997': 1, 'payment_type__997': 1, 'DOLocationID__997': 42, 'PULocationID__997': 141, 'congestion_surcharge__998': 0.0, 'RatecodeID__998': 1.0, 'airport_fee__998': 0.0, 'extra__998': 0.0, 'tip_amount__998': 0.0, 'fare_amount__998': 14.2, 'mta_tax__998': 0.5, 'total_amount__998': 15.7, 'trip_distance__998': 2.1, 'tolls_amount__998': 0.0, 'store_and_fwd_flag__998': 'N', 'passenger_count__998': 1.0, 'improvement_surcharge__998': 1.0, 'tpep_dropoff_datetime__998': datetime.datetime(2023, 1, 29, 15, 16, 54), 'tpep_pickup_datetime__998': datetime.datetime(2023, 1, 29, 15, 4, 55), 'VendorID__998': 1, 'payment_type__998': 2, 'DOLocationID__998': 75, 'PULocationID__998': 151, 'congestion_surcharge__999': 2.5, 'RatecodeID__999': 1.0, 'airport_fee__999': 0.0, 'extra__999': 2.5, 'tip_amount__999': 2.65, 'fare_amount__999': 9.3, 'mta_tax__999': 0.5, 'total_amount__999': 15.95, 'trip_distance__999': 1.1, 'tolls_amount__999': 0.0, 'store_and_fwd_flag__999': 'N', 'passenger_count__999': 1.0, 'improvement_surcharge__999': 1.0, 'tpep_dropoff_datetime__999': datetime.datetime(2023, 1, 29, 15, 35, 28), 'tpep_pickup_datetime__999': datetime.datetime(2023, 1, 29, 15, 27, 23), 'VendorID__999': 1, 'payment_type__999': 1, 'DOLocationID__999': 236, 'PULocationID__999': 262}]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\nTraceback (most recent call last):\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 2118, in _exec_insertmany_context\n dialect.do_execute(\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n cursor.execute(statement, parameters)\npsycopg2.errors.DiskFull: could not extend file because project size limit (512 MB) has been exceeded\nHINT: This limit is defined by neon.max_cluster_size GUC\n\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/main.py\", line 39, in main\n load_data(transformed_data, table_name)\n File \"/Users/ronel/Downloads/dev/templates/ETL-pipeline/src/load.py\", line 73, in load_data\n df.to_sql(\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/util/_decorators.py\", line 333, in wrapper\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/core/generic.py\", line 3087, in to_sql\n return sql.to_sql(\n ^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 842, in to_sql\n return pandas_sql.to_sql(\n ^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 2018, in to_sql\n total_inserted = sql_engine.insert_records(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 1567, in insert_records\n raise err\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 1558, in insert_records\n return table.insert(chunksize=chunksize, method=method)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 1119, in insert\n num_inserted = exec_insert(conn, keys, chunk_iter)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/pandas/io/sql.py\", line 1010, in _execute_insert\n result = conn.execute(self.table.insert(), data)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 1418, in execute\n return meth(\n ^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/sql/elements.py\", line 515, in _execute_on_connection\n return connection._execute_clauseelement(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 1640, in _execute_clauseelement\n ret = self._execute_context(\n ^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 1844, in _execute_context\n return self._exec_insertmany_context(dialect, context)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 2126, in _exec_insertmany_context\n self._handle_dbapi_exception(\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 2355, in _handle_dbapi_exception\n raise sqlalchemy_exception.with_traceback(exc_info[2]) from e\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/base.py\", line 2118, in _exec_insertmany_context\n dialect.do_execute(\n File \"/opt/anaconda3/lib/python3.12/site-packages/sqlalchemy/engine/default.py\", line 941, in do_execute\n cursor.execute(statement, parameters)\nsqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not extend file because project size limit (512 MB) has been exceeded\nHINT: This limit is defined by neon.max_cluster_size GUC\n\n[SQL: INSERT INTO taxi_trips (\"VendorID\", tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count, trip_distance, \"RatecodeID\", store_and_fwd_flag, \"PULocationID\", \"DOLocationID\", payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount, im ... 458885 characters truncated ... vement_surcharge__999)s, %(total_amount__999)s, %(congestion_surcharge__999)s, %(airport_fee__999)s)]\n[parameters: {'congestion_surcharge__0': 2.5, 'RatecodeID__0': 1.0, 'airport_fee__0': 0.0, 'extra__0': 2.5, 'tip_amount__0': 2.9, 'fare_amount__0': 5.8, 'mta_tax__0': 0.5, 'total_amount__0': 12.7, 'trip_distance__0': 0.7, 'tolls_amount__0': 0.0, 'store_and_fwd_flag__0': 'N', 'passenger_count__0': 1.0, 'improvement_surcharge__0': 1.0, 'tpep_dropoff_datetime__0': datetime.datetime(2023, 1, 29, 15, 6, 34), 'tpep_pickup_datetime__0': datetime.datetime(2023, 1, 29, 15, 3, 37), 'VendorID__0': 1, 'payment_type__0': 1, 'DOLocationID__0': 237, 'PULocationID__0': 229, 'congestion_surcharge__1': 2.5, 'RatecodeID__1': 1.0, 'airport_fee__1': 0.0, 'extra__1': 2.5, 'tip_amount__1': 0.0, 'fare_amount__1': 13.5, 'mta_tax__1': 0.5, 'total_amount__1': 17.5, 'trip_distance__1': 2.4, 'tolls_amount__1': 0.0, 'store_and_fwd_flag__1': 'N', 'passenger_count__1': 1.0, 'improvement_surcharge__1': 1.0, 'tpep_dropoff_datetime__1': datetime.datetime(2023, 1, 29, 15, 24, 33), 'tpep_pickup_datetime__1': datetime.datetime(2023, 1, 29, 15, 12, 38), 'VendorID__1': 1, 'payment_type__1': 2, 'DOLocationID__1': 79, 'PULocationID__1': 237, 'congestion_surcharge__2': 2.5, 'RatecodeID__2': 1.0, 'airport_fee__2': 0.0, 'extra__2': 2.5, 'tip_amount__2': 3.85, 'fare_amount__2': 11.4, 'mta_tax__2': 0.5, 'total_amount__2': 19.25, 'trip_distance__2': 0.7, 'tolls_amount__2': 0.0, 'store_and_fwd_flag__2': 'N', 'passenger_count__2': 1.0 ... 18900 parameters truncated ... 'total_amount__997': 28.56, 'trip_distance__997': 3.9, 'tolls_amount__997': 0.0, 'store_and_fwd_flag__997': 'N', 'passenger_count__997': 1.0, 'improvement_surcharge__997': 1.0, 'tpep_dropoff_datetime__997': datetime.datetime(2023, 1, 29, 15, 58, 6), 'tpep_pickup_datetime__997': datetime.datetime(2023, 1, 29, 15, 42, 58), 'VendorID__997': 1, 'payment_type__997': 1, 'DOLocationID__997': 42, 'PULocationID__997': 141, 'congestion_surcharge__998': 0.0, 'RatecodeID__998': 1.0, 'airport_fee__998': 0.0, 'extra__998': 0.0, 'tip_amount__998': 0.0, 'fare_amount__998': 14.2, 'mta_tax__998': 0.5, 'total_amount__998': 15.7, 'trip_distance__998': 2.1, 'tolls_amount__998': 0.0, 'store_and_fwd_flag__998': 'N', 'passenger_count__998': 1.0, 'improvement_surcharge__998': 1.0, 'tpep_dropoff_datetime__998': datetime.datetime(2023, 1, 29, 15, 16, 54), 'tpep_pickup_datetime__998': datetime.datetime(2023, 1, 29, 15, 4, 55), 'VendorID__998': 1, 'payment_type__998': 2, 'DOLocationID__998': 75, 'PULocationID__998': 151, 'congestion_surcharge__999': 2.5, 'RatecodeID__999': 1.0, 'airport_fee__999': 0.0, 'extra__999': 2.5, 'tip_amount__999': 2.65, 'fare_amount__999': 9.3, 'mta_tax__999': 0.5, 'total_amount__999': 15.95, 'trip_distance__999': 1.1, 'tolls_amount__999': 0.0, 'store_and_fwd_flag__999': 'N', 'passenger_count__999': 1.0, 'improvement_surcharge__999': 1.0, 'tpep_dropoff_datetime__999': datetime.datetime(2023, 1, 29, 15, 35, 28), 'tpep_pickup_datetime__999': datetime.datetime(2023, 1, 29, 15, 27, 23), 'VendorID__999': 1, 'payment_type__999': 1, 'DOLocationID__999': 236, 'PULocationID__999': 262}]\n(Background on this error at: https://sqlalche.me/e/20/e3q8)\n", "size": 36476, "language": "unknown" }, "src/__init__.py": { "content": "\"\"\"\nETL Pipeline package.\n\nThis package contains modules for extracting, transforming, and loading data.\n\"\"\"\n\n__version__ = '0.1.0'\n", "size": 132, "language": "python" }, "src/utils.py": { "content": "\"\"\"\nUtility functions for the ETL pipeline.\n\"\"\"\nimport os\nimport sys\nimport logging\nfrom logging.handlers import RotatingFileHandler\nfrom typing import Optional, Union, Dict, Any\nimport pandas as pd\nfrom sqlalchemy import create_engine\nfrom sqlalchemy.engine import Engine\n\ndef setup_logging(\n log_level: str = 'INFO',\n log_file: Optional[str] = None,\n log_format: str = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'\n) -> None:\n \"\"\"\n Set up logging configuration.\n \n Args:\n log_level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)\n log_file: Optional path to log file. If None, logs will only go to console.\n log_format: Format string for log messages\n \"\"\"\n # Convert string log level to logging constant\n numeric_level = getattr(logging, log_level.upper(), None)\n if not isinstance(numeric_level, int):\n raise ValueError(f'Invalid log level: {log_level}')\n \n # Configure root logger\n root_logger = logging.getLogger()\n root_logger.setLevel(numeric_level)\n \n # Clear any existing handlers\n for handler in root_logger.handlers[:]:\n root_logger.removeHandler(handler)\n \n # Create console handler\n console_handler = logging.StreamHandler(sys.stdout)\n console_handler.setFormatter(logging.Formatter(log_format))\n root_logger.addHandler(console_handler)\n \n # Add file handler if log_file is specified\n if log_file and log_file.strip():\n try:\n # Create directory if it doesn't exist\n log_dir = os.path.dirname(log_file)\n if log_dir: # Only try to create directory if path isn't just a filename\n os.makedirs(log_dir, exist_ok=True)\n \n file_handler = RotatingFileHandler(\n log_file,\n maxBytes=10*1024*1024, # 10MB\n backupCount=5\n )\n file_handler.setFormatter(logging.Formatter(log_format))\n root_logger.addHandler(file_handler)\n logging.info(f\"Logging to file: {os.path.abspath(log_file)}\")\n except Exception as e:\n logging.warning(f\"Could not set up file logging: {e}\")\n \n # Set up SQLAlchemy logging if in debug mode\n if log_level.upper() == 'DEBUG':\n logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)\n\ndef get_db_connection() -> Engine:\n \"\"\"\n Create and return a database connection using SQLAlchemy.\n \n First tries to use DB_CONNECTION_STRING if provided.\n Otherwise falls back to individual connection parameters.\n \n Returns:\n SQLAlchemy engine instance\n \"\"\"\n # Try to get connection string from environment\n connection_string = os.getenv('DB_CONNECTION_STRING')\n \n if connection_string:\n # If using Neon.tech, ensure sslmode is set\n if 'neon.tech' in connection_string and 'sslmode' not in connection_string:\n connection_string += '?sslmode=require' if '?' not in connection_string else '&sslmode=require'\n return create_engine(connection_string)\n \n # Fall back to individual parameters if no connection string is provided\n db_config = {\n 'host': os.getenv('DB_HOST', 'localhost'),\n 'port': os.getenv('DB_PORT', '5432'),\n 'database': os.getenv('DB_NAME', 'etl_pipeline'),\n 'user': os.getenv('DB_USER', 'postgres'),\n 'password': os.getenv('DB_PASSWORD', ''),\n 'sslmode': os.getenv('DB_SSLMODE', 'require' if 'neon.tech' in os.getenv('DB_HOST', '') else 'prefer')\n }\n \n connection_string = (\n f\"postgresql://{db_config['user']}:{db_config['password']}@\"\n f\"{db_config['host']}:{db_config['port']}/{db_config['database']}?\"\n f\"sslmode={db_config['sslmode']}\"\n )\n \n return create_engine(connection_string)\n\ndef save_to_csv(df: pd.DataFrame, filepath: str) -> None:\n \"\"\"\n Save a DataFrame to a CSV file.\n \n Args:\n df: Pandas DataFrame to save\n filepath: Path to save the CSV file\n \"\"\"\n os.makedirs(os.path.dirname(filepath), exist_ok=True)\n df.to_csv(filepath, index=False)\n logging.info(f\"Data saved to {filepath}\")\n\ndef read_csv(filepath: str, **kwargs) -> pd.DataFrame:\n \"\"\"\n Read a CSV file into a pandas DataFrame.\n \n Args:\n filepath: Path to the CSV file\n **kwargs: Additional arguments to pass to pandas.read_csv()\n \n Returns:\n Loaded DataFrame\n \"\"\"\n return pd.read_csv(filepath, **kwargs)\n", "size": 4486, "language": "python" }, "src/transform.py": { "content": "\"\"\"\nData transformation module for the ETL pipeline.\n\"\"\"\nimport logging\nfrom typing import Dict, Any, Optional, List\nimport pandas as pd\nimport numpy as np\nfrom datetime import datetime\n\nlogger = logging.getLogger(__name__)\n\ndef transform_data(\n df: pd.DataFrame,\n config: Optional[Dict[str, Any]] = None\n) -> pd.DataFrame:\n \"\"\"\n Transform the input DataFrame by applying various cleaning and transformation steps.\n \n Args:\n df: Input DataFrame to transform\n config: Optional configuration dictionary with transformation settings\n \n Returns:\n Transformed DataFrame\n \"\"\"\n if config is None:\n config = {}\n \n logger.info(\"Starting data transformation\")\n \n # Make a copy of the DataFrame to avoid modifying the original\n df_transformed = df.copy()\n \n # Apply transformations based on configuration or default behavior\n df_transformed = handle_missing_values(df_transformed, config.get('missing_values', {}))\n df_transformed = convert_data_types(df_transformed, config.get('dtypes', {}))\n df_transformed = standardize_columns(df_transformed, config.get('standardize', {}))\n df_transformed = derive_features(df_transformed, config.get('derived_features', []))\n \n # Apply any custom transformations if specified in config\n if 'custom_transforms' in config and callable(config['custom_transforms']):\n df_transformed = config['custom_transforms'](df_transformed)\n \n logger.info(f\"Data transformation complete. Shape after transformation: {df_transformed.shape}\")\n return df_transformed\n\ndef handle_missing_values(\n df: pd.DataFrame,\n config: Dict[str, Any]\n) -> pd.DataFrame:\n \"\"\"\n Handle missing values in the DataFrame based on the configuration.\n \n Args:\n df: Input DataFrame\n config: Configuration for handling missing values\n - 'strategy': 'drop', 'fill', or 'ignore'\n - 'fill_value': Value to use when strategy is 'fill'\n - 'columns': List of columns to apply to (all if None)\n \"\"\"\n if not config or config.get('strategy') == 'ignore':\n return df\n \n columns = config.get('columns', df.columns)\n strategy = config.get('strategy', 'drop')\n \n logger.info(f\"Handling missing values with strategy: {strategy}\")\n \n if strategy == 'drop':\n # Drop rows with missing values in specified columns\n return df.dropna(subset=columns)\n \n elif strategy == 'fill':\n # Fill missing values with specified fill value\n fill_value = config.get('fill_value')\n if fill_value is None:\n # Default fill values based on column type\n for col in columns:\n if col in df.columns:\n if pd.api.types.is_numeric_dtype(df[col]):\n df[col] = df[col].fillna(0)\n elif pd.api.types.is_datetime64_any_dtype(df[col]):\n df[col] = df[col].fillna(pd.NaT)\n else:\n df[col] = df[col].fillna('')\n else:\n df[columns] = df[columns].fillna(fill_value)\n \n return df\n\ndef convert_data_types(\n df: pd.DataFrame,\n type_mapping: Dict[str, str]\n) -> pd.DataFrame:\n \"\"\"\n Convert data types of DataFrame columns based on the provided mapping.\n \n Args:\n df: Input DataFrame\n type_mapping: Dictionary mapping column names to target data types\n Supported types: 'int', 'float', 'str', 'bool', 'datetime', 'category'\n \"\"\"\n if not type_mapping:\n return df\n \n logger.info(\"Converting data types\")\n \n for col, dtype in type_mapping.items():\n if col in df.columns:\n try:\n if dtype == 'datetime':\n df[col] = pd.to_datetime(df[col], errors='coerce')\n elif dtype == 'category':\n df[col] = df[col].astype('category')\n else:\n df[col] = df[col].astype(dtype)\n except Exception as e:\n logger.warning(f\"Could not convert column '{col}' to {dtype}: {str(e)}\")\n \n return df\n\ndef standardize_columns(\n df: pd.DataFrame,\n config: Dict[str, Any]\n) -> pd.DataFrame:\n \"\"\"\n Standardize column names and values.\n \n Args:\n df: Input DataFrame\n config: Configuration for standardization\n - 'lowercase': bool - Convert column names to lowercase\n - 'replace_spaces': str or bool - Replace spaces in column names\n - 'rename': dict - Mapping of old column names to new names\n \"\"\"\n if not config:\n return df\n \n logger.info(\"Standardizing columns\")\n \n # Rename columns if mapping is provided\n if 'rename' in config and isinstance(config['rename'], dict):\n df = df.rename(columns=config['rename'])\n \n # Apply standardizations to column names\n if config.get('lowercase', True):\n df.columns = df.columns.str.lower()\n \n replace_with = config.get('replace_spaces')\n if replace_with is not None and replace_with is not False:\n if replace_with is True:\n replace_with = '_' # Default to underscore if True is provided\n df.columns = df.columns.str.replace(r'\\s+', replace_with, regex=True)\n \n return df\n\ndef derive_features(\n df: pd.DataFrame,\n features_config: List[Dict[str, Any]]\n) -> pd.DataFrame:\n \"\"\"\n Derive new features based on the configuration.\n \n Args:\n df: Input DataFrame\n features_config: List of feature configurations\n Each config should have 'name', 'type', and 'params' keys\n Supported types: 'datetime', 'categorical', 'numeric'\n \"\"\"\n if not features_config:\n return df\n \n logger.info(\"Deriving new features\")\n \n for feature in features_config:\n name = feature.get('name')\n feature_type = feature.get('type')\n params = feature.get('params', {})\n \n if not name or not feature_type:\n continue\n \n try:\n if feature_type == 'datetime':\n # Extract datetime components\n column = params.get('source_column')\n if column and column in df.columns:\n if 'extract' in params:\n if params['extract'] == 'hour':\n df[name] = pd.to_datetime(df[column]).dt.hour\n elif params['extract'] == 'day_of_week':\n df[name] = pd.to_datetime(df[column]).dt.dayofweek\n elif params['extract'] == 'month':\n df[name] = pd.to_datetime(df[column]).dt.month\n elif params['extract'] == 'year':\n df[name] = pd.to_datetime(df[column]).dt.year\n elif params['extract'] == 'date':\n df[name] = pd.to_datetime(df[column]).dt.date\n \n elif feature_type == 'categorical':\n # Create categorical features\n column = params.get('source_column')\n if column and column in df.columns:\n if 'bins' in params:\n bins = params['bins']\n labels = params.get('labels', range(len(bins) - 1))\n df[name] = pd.cut(df[column], bins=bins, labels=labels, include_lowest=True)\n \n elif feature_type == 'numeric':\n # Create numeric features\n operation = params.get('operation')\n columns = params.get('columns', [])\n \n if operation == 'sum' and all(col in df.columns for col in columns):\n df[name] = df[columns].sum(axis=1)\n elif operation == 'mean' and all(col in df.columns for col in columns):\n df[name] = df[columns].mean(axis=1)\n elif operation == 'difference' and len(columns) == 2 and all(col in df.columns for col in columns):\n df[name] = df[columns[0]] - df[columns[1]]\n elif operation == 'ratio' and len(columns) == 2 and all(col in df.columns for col in columns):\n df[name] = df[columns[0]] / df[columns[1]].replace(0, np.nan)\n \n except Exception as e:\n logger.warning(f\"Failed to create feature '{name}': {str(e)}\")\n \n return df\n", "size": 8515, "language": "python" }, "src/load.py": { "content": "\"\"\"\nData loading module for the ETL pipeline.\n\"\"\"\nimport logging\nfrom typing import Dict, Any, Optional, Union\nimport pandas as pd\nfrom sqlalchemy import Table, Column, MetaData, exc, inspect\nfrom sqlalchemy.types import (\n Integer, Float, String, DateTime, Boolean, Date, Numeric\n)\n\nfrom .utils import get_db_connection\n\nlogger = logging.getLogger(__name__)\n\n# Map pandas dtypes to SQLAlchemy types\nTYPE_MAPPING = {\n 'int64': Integer,\n 'float64': Float,\n 'object': String(255),\n 'bool': Boolean,\n 'datetime64[ns]': DateTime,\n 'datetime64[ns, UTC]': DateTime(timezone=True),\n 'timedelta64[ns]': String(50), # Storing as string for simplicity\n 'category': String(255),\n 'date': Date,\n}\n\ndef load_data(\n df: pd.DataFrame,\n table_name: str,\n connection_string: Optional[str] = None,\n if_exists: str = 'replace',\n index: bool = False,\n dtype: Optional[Dict] = None,\n chunksize: Optional[int] = None,\n method: Optional[str] = None\n) -> None:\n \"\"\"\n Load a pandas DataFrame into a database table.\n \n Args:\n df: DataFrame to load\n table_name: Name of the target table\n connection_string: Database connection string. If None, uses get_db_connection()\n if_exists: What to do if table exists. Options: 'fail', 'replace', 'append'\n index: Whether to write DataFrame index as a column\n dtype: Dictionary specifying the datatype for columns\n chunksize: Number of rows to write at a time\n method: Method to use for SQL insertion\n \"\"\"\n logger.info(f\"Loading data into {table_name}\")\n \n try:\n # Get database connection\n if connection_string:\n from sqlalchemy import create_engine\n engine = create_engine(connection_string)\n else:\n engine = get_db_connection()\n \n # Create table if it doesn't exist and if_exists is 'replace' or 'fail'\n if if_exists in ['replace', 'fail']:\n create_table_from_dataframe(\n df=df,\n table_name=table_name,\n engine=engine,\n if_exists=if_exists,\n dtype=dtype\n )\n \n # Load the data\n df.to_sql(\n name=table_name,\n con=engine,\n if_exists=if_exists,\n index=index,\n dtype=dtype,\n chunksize=chunksize,\n method=method\n )\n \n logger.info(f\"Successfully loaded {len(df)} rows into {table_name}\")\n \n except Exception as e:\n logger.error(f\"Error loading data into {table_name}: {str(e)}\")\n raise\n\ndef create_table_from_dataframe(\n df: pd.DataFrame,\n table_name: str,\n engine,\n if_exists: str = 'fail',\n dtype: Optional[Dict] = None,\n schema: Optional[str] = None\n) -> None:\n \"\"\"\n Create a SQL table from a pandas DataFrame.\n \n Args:\n df: DataFrame to create table from\n table_name: Name of the table to create\n engine: SQLAlchemy engine\n if_exists: What to do if table exists. Options: 'fail', 'replace', 'append'\n dtype: Dictionary specifying the datatype for columns\n schema: Optional schema name\n \"\"\"\n from sqlalchemy import inspect, MetaData, Table, Column\n \n # Create an inspector to check if table exists\n inspector = inspect(engine)\n table_exists = table_name in inspector.get_table_names(schema=schema)\n \n # If table exists and we're not replacing, raise an error\n if table_exists:\n if if_exists == 'fail':\n raise ValueError(f\"Table {table_name} already exists and if_exists='fail'\")\n elif if_exists == 'replace':\n with engine.connect() as conn:\n conn.execute(f\"DROP TABLE IF EXISTS {table_name}\")\n conn.commit()\n \n # If table doesn't exist or we're replacing it, create the table\n if not table_exists or if_exists == 'replace':\n # Prepare column definitions\n columns = []\n metadata = MetaData()\n \n for column_name, dtype_name in df.dtypes.items():\n # Get SQLAlchemy type from mapping or use String as default\n sql_type = TYPE_MAPPING.get(str(dtype_name), String(255))\n \n # Override with user-specified type if provided\n if dtype and column_name in dtype:\n sql_type = dtype[column_name]\n \n columns.append(Column(column_name, sql_type))\n \n # Create table\n Table(table_name, metadata, *columns, schema=schema)\n \n try:\n metadata.create_all(engine)\n logger.info(f\"Created table {table_name}\")\n except Exception as e:\n logger.error(f\"Error creating table {table_name}: {str(e)}\")\n raise\n\ndef execute_sql(\n sql: str,\n connection_string: Optional[str] = None,\n params: Optional[Dict] = None,\n return_results: bool = False\n) -> Optional[pd.DataFrame]:\n \"\"\"\n Execute a SQL query and optionally return results as a DataFrame.\n \n Args:\n sql: SQL query to execute\n connection_string: Database connection string. If None, uses get_db_connection()\n params: Parameters for the SQL query\n return_results: Whether to return results as a DataFrame\n \n Returns:\n DataFrame with query results if return_results is True, else None\n \"\"\"\n try:\n # Get database connection\n if connection_string:\n from sqlalchemy import create_engine\n engine = create_engine(connection_string)\n else:\n engine = get_db_connection()\n \n if return_results:\n # Execute query and return results as DataFrame\n return pd.read_sql_query(sql, engine, params=params)\n else:\n # Execute SQL without returning results\n with engine.connect() as connection:\n connection.execute(sql, params or {})\n \n except Exception as e:\n logger.error(f\"Error executing SQL: {str(e)}\")\n raise\n", "size": 6103, "language": "python" }, "src/extract.py": { "content": "\"\"\"\nData extraction module for the ETL pipeline.\n\"\"\"\nimport os\nimport logging\nimport pandas as pd\nfrom typing import Union, Optional\nfrom urllib.parse import urlparse\n\nlogger = logging.getLogger(__name__)\n\ndef extract_data(source: str, **kwargs) -> pd.DataFrame:\n \"\"\"\n Extract data from a source (URL or file path).\n \n Args:\n source: URL or file path to the data source\n **kwargs: Additional arguments to pass to the appropriate reader\n \n Returns:\n Extracted data as a pandas DataFrame\n \n Raises:\n ValueError: If the source is not a valid URL or file path\n Exception: If there's an error during data extraction\n \"\"\"\n try:\n logger.info(f\"Extracting data from: {source}\")\n \n if is_url(source):\n return extract_from_url(source, **kwargs)\n elif os.path.exists(source):\n return extract_from_file(source, **kwargs)\n else:\n raise ValueError(f\"Source not found or not accessible: {source}\")\n \n except Exception as e:\n logger.error(f\"Error extracting data from {source}: {str(e)}\")\n raise\n\ndef extract_from_url(url: str, **kwargs) -> pd.DataFrame:\n \"\"\"\n Extract data from a URL.\n \n Args:\n url: URL to the data source\n **kwargs: Additional arguments to pass to pandas.read_* functions\n \n Returns:\n Extracted data as a pandas DataFrame\n \"\"\"\n _, ext = os.path.splitext(urlparse(url).path)\n ext = ext.lower()\n \n if ext == '.csv':\n return pd.read_csv(url, **kwargs)\n elif ext in ['.xls', '.xlsx']:\n return pd.read_excel(url, **kwargs)\n elif ext == '.parquet':\n return pd.read_parquet(url, **kwargs)\n elif ext == '.json':\n return pd.read_json(url, **kwargs)\n else:\n raise ValueError(f\"Unsupported file format: {ext}\")\n\ndef extract_from_file(filepath: str, **kwargs) -> pd.DataFrame:\n \"\"\"\n Extract data from a local file.\n \n Args:\n filepath: Path to the data file\n **kwargs: Additional arguments to pass to pandas.read_* functions\n \n Returns:\n Extracted data as a pandas DataFrame\n \"\"\"\n _, ext = os.path.splitext(filepath)\n ext = ext.lower()\n \n if ext == '.csv':\n return pd.read_csv(filepath, **kwargs)\n elif ext in ['.xls', '.xlsx']:\n return pd.read_excel(filepath, **kwargs)\n elif ext == '.parquet':\n return pd.read_parquet(filepath, **kwargs)\n elif ext == '.json':\n return pd.read_json(filepath, **kwargs)\n else:\n raise ValueError(f\"Unsupported file format: {ext}\")\n\ndef is_url(string: str) -> bool:\n \"\"\"\n Check if a string is a valid URL.\n \n Args:\n string: String to check\n \n Returns:\n True if the string is a valid URL, False otherwise\n \"\"\"\n try:\n result = urlparse(string)\n return all([result.scheme, result.netloc])\n except ValueError:\n return False\n", "size": 2986, "language": "python" } }, "_cache_metadata": { "url": "https://github.com/ronelsolomon/ETL-pipeline.git", "content_type": "github", "cached_at": "2026-03-02T22:43:46.868441", "cache_key": "37987ddade14ae2a6b6fa13bc2ff9450" } }