Installing and Setting Up Apache Airflow
This guide provides detailed instructions for installing and configuring Apache Airflow with support for asynchronous tasks, Celery, PostgreSQL, and Kubernetes. The steps below ensure a proper setup for running Airflow, initializing its database, creating an admin user, and starting the scheduler and webserver. This setup is suitable for a local development environment or a scalable production setup with the specified backends.
Prerequisites
Before proceeding, ensure you have the following:
- Python 3.12: Airflow 2.10.3 is compatible with Python 3.12, as specified in the constraint file.
- pip: The Python package manager to install Airflow and its dependencies.
- PostgreSQL: If using PostgreSQL as the metadata database (recommended for production).
- Celery: For distributed task execution (optional, included in the installation).
- Kubernetes: For running Airflow in a Kubernetes cluster (optional, included in the installation).
- Sufficient permissions: To create directories and run background processes.
- Virtual environment (recommended): To isolate dependencies. Create one with:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Installation Steps
1. Install Apache Airflow
Install Airflow for version 2.10.3.
pip install apache-airflow==2.10.3
2. Set Up the Airflow Home Directory
Airflow requires a home directory to store its configuration, logs, and DAGs. The following Python script sets the AIRFLOW_HOME environment variable and creates the directory if it doesn't exist.
import os
import time
# Ensure environment
os.environ['AIRFLOW_HOME'] = '<your_project_path>/airflow'
os.makedirs('airflow', exist_ok=True)
Replace <your_project_path> with the absolute path to your project directory (e.g., /home/user/BTC-USDT-ETL-Pipeline). For example:
os.environ['AIRFLOW_HOME'] = '/home/user/BTC-USDT-ETL-Pipeline/airflow'
This script ensures the airflow directory is created in your project path to store Airflow's configuration files, logs, and SQLite database (if not using PostgreSQL).
3. Initialize the Airflow Database
Initialize the Airflow metadata database, which stores DAG runs, task instances, and other metadata. This step is required before starting the scheduler or webserver.
# Re-init the database (resets metadata but keeps DAGs if any)
airflow db init
Note:
- This command creates a default
airflow.cfgconfiguration file inAIRFLOW_HOME. - If using PostgreSQL, ensure the database is running and update the
sql_alchemy_conninairflow.cfgto point to your PostgreSQL instance (e.g.,postgresql+psycopg2://user:password@localhost:5432/airflow). - Running
airflow db initresets metadata but preserves any DAGs in thedagsfolder.
4. Create an Admin User
The Airflow webserver requires at least one admin user for login. Create an admin user with the following command:
# Create admin user (critical—webserver needs this for login)
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin
This command creates a user with:
- Username:
admin - Password:
admin(change this in production for security) - Role:
Admin(grants full access to the Airflow UI)
To verify the user was created successfully, list all users:
# Verify user creation
airflow users list
5. Start the Airflow Scheduler
The scheduler is responsible for scheduling and executing DAGs. Start it in the background using nohup to ensure it continues running.
# Start scheduler first (it needs DB)
nohup airflow scheduler > airflow/scheduler.log 2>&1 &
Notes:
- The scheduler requires the database to be initialized first.
- Logs are redirected to
scheduler.login the specified directory. - Replace
airflowwith yourAIRFLOW_HOMEpath if different.
6. Start the Airflow Webserver
The webserver provides the Airflow UI for managing DAGs, viewing task logs, and monitoring runs. Start it on port 8081 (or another port if needed).
airflow webserver --port 8081 > airflow/airflow.log 2>&1 &
Notes:
- The webserver runs on
http://localhost:8081by default. - Logs are redirected to
airflow.login theAIRFLOW_HOMEdirectory. - Access the UI by navigating to
http://localhost:8081in your browser and logging in with the admin credentials (username:admin, password:admin).
Additional Notes
- Configuration: After running
airflow db init, review and modifyairflow.cfgin theAIRFLOW_HOMEdirectory to customize settings (e.g., executor type, database connection, or Celery broker). - Celery Setup: If using the Celery executor, ensure a message broker (e.g., Redis or RabbitMQ) is running and configured in
airflow.cfg. - Kubernetes Executor: For Kubernetes, configure the Kubernetes executor in
airflow.cfgand ensure your Kubernetes cluster is accessible. - Security: Change the default admin password and secure the database connection in production environments.
- Logs: Check
scheduler.logandairflow.logfor troubleshooting.
Next Steps
- Place your DAGs in the
AIRFLOW_HOME/dagsfolder to start defining workflows. - Explore the Airflow UI to monitor and manage your DAGs.
- Refer to the Apache Airflow documentation for advanced configurations.