Cron jobs


PostgreSQL Automation: pg_cron

In PostgreSQL, the industry standard for scheduling tasks is the pg_cron extension. Since Postgres doesn't have a native internal scheduler, pg_cron allows you to run maintenance and ETL tasks directly from SQL using standard cron syntax.

Setup and Configuration

pg_cron runs as a background worker and must be loaded when the database starts:

  1. Modify postgresql.conf: Add pg_cron to the shared libraries.

    shared_preload_libraries = 'pg_cron'
    cron.database_name = 'postgres'  -- The DB that metadata is stored in
  2. Restart Postgres: Required for the library to load.

  3. Create the Extension: Run CREATE EXTENSION pg_cron; in your database.

Scheduling Syntax

The syntax follows the standard Linux cron format: (minute, hour, day, month, day-of-week).

-- Schedule a task: SELECT cron.schedule('job_name', 'schedule', 'SQL_command');
SELECT cron.schedule('nightly-maintenance', '0 3 * * *', 'VACUUM ANALYZE;');

Vital Maintenance: The VACUUM

Postgres uses MVCC (Multi-Version Concurrency Control). When you UPDATE or DELETE a row, the old version is not immediately removed—it remains on disk as a "dead tuple" (bloat).

  • VACUUM: Reclaims the space occupied by dead tuples so it can be reused by new data.

  • ANALYZE: Updates the statistics used by the Query Planner. Without fresh stats, the optimizer might choose a slow Table Scan over a fast Index Scan.

  • The "Pro" Strategy: While Postgres has an autovacuum daemon, Data Engineers often schedule a manual VACUUM ANALYZE at 3:00 AM or immediately following a massive bulk data load to ensure peak performance.


Common Data Engineering Use Cases

Use Case

SQL Command Example

Frequency

Reporting

REFRESH MATERIALIZED VIEW CONCURRENTLY sales_sum;

Hourly

Partitioning

CALL create_next_month_partitions();

Monthly

Data Retention

DELETE FROM logs WHERE created_at < NOW() - INTERVAL '90 days';

Weekly

System Health

VACUUM ANALYZE;

Nightly


Monitoring and Alerting

Since cron jobs run in the background, you must monitor the pg_cron metadata tables to catch failures:

  • cron.job: View all currently scheduled tasks.

  • cron.job_run_details: The most critical table for DEs. It shows:

    • Status: (succeeded or failed)

    • Return Message: Contains the specific error if a job crashed.

    • Start/End Time: Helps identify jobs that are taking longer than expected.

Simple Alerting Query

You can create a view or a script that checks for recent failures:


Last updated