Job Execution Logging System

Overview

The Job Execution Logging System tracks ALL scheduler job executions regardless of whether they create alert instances or not. This provides complete visibility into scheduler performance, debugging capabilities, and monitoring.

Key Features

✅ Complete Job Tracking - Logs every job execution, not just triggered alerts
✅ Performance Monitoring - Tracks execution duration, success rates, and failure patterns
✅ Rich Context - Stores execution data, results, and metadata
✅ Error Handling - Captures error messages and stack traces
✅ Statistics & Analytics - Provides aggregated statistics and trends
✅ API Access - RESTful endpoints for querying logs programmatically

Database Schema

Main Table: `logs.job_execution_logs`

-- Core identification
id UUID PRIMARY KEY
job_id VARCHAR NOT NULL           -- APScheduler job ID (e.g., "alert_123")
job_name VARCHAR NOT NULL         -- Human-readable job name
job_type VARCHAR NOT NULL         -- 'alert', 'cleanup', 'sync', etc.

-- Context
alert_id UUID                     -- Reference to app.alerts.id
client_id UUID                    -- Reference to app.clients.id
execution_id UUID NOT NULL        -- Unique execution instance

-- Execution details
started_at TIMESTAMP NOT NULL
completed_at TIMESTAMP
duration_ms INTEGER               -- Execution duration in milliseconds

-- Status tracking
status VARCHAR NOT NULL           -- 'running', 'completed', 'failed', 'cancelled', 'timeout'
result_status VARCHAR             -- 'triggered', 'no_trigger', 'error', 'skipped'

-- Results
instances_created INTEGER DEFAULT 0
tasks_created INTEGER DEFAULT 0
records_processed INTEGER DEFAULT 0

-- Error handling
error_message TEXT
error_traceback TEXT
retry_count INTEGER DEFAULT 0

-- Data storage
execution_data JSONB              -- Data used during execution
result_data JSONB                 -- Results from execution
metadata JSONB                    -- Additional metadata

Views

logs.recent_job_executions - Easy querying with alert/client names
logs.job_execution_stats - Aggregated statistics per job

API Endpoints

1. Get Job Executions

GET /api/v1/alerts/job-executions

Query Parameters:

job_id - Filter by job ID
alert_id - Filter by alert ID
client_id - Filter by client ID
status - Filter by status ('running', 'completed', 'failed', etc.)
job_type - Filter by job type ('alert', 'cleanup', etc.)
limit - Number of results (default: 50, max: 100)
offset - Pagination offset (default: 0)

Example Response:

{
  "status": "success",
  "data": [
    {
      "id": "123e4567-e89b-12d3-a456-426614174000",
      "job_id": "alert_456",
      "job_name": "Alert: High CPL Alert",
      "job_type": "alert",
      "alert_id": "456e7890-e89b-12d3-a456-426614174001",
      "alert_name": "High CPL Alert",
      "client_id": "789e0123-e89b-12d3-a456-426614174002",
      "client_name": "Acme Corp",
      "execution_id": "abc12345-e89b-12d3-a456-426614174003",
      "started_at": "2024-01-15T10:30:00Z",
      "completed_at": "2024-01-15T10:30:05Z",
      "duration_ms": 5000,
      "status": "completed",
      "result_status": "triggered",
      "instances_created": 2,
      "tasks_created": 1,
      "records_processed": 1,
      "error_message": null,
      "retry_count": 0
    }
  ],
  "pagination": {
    "limit": 50,
    "offset": 0,
    "count": 1
  }
}

2. Get Job Execution Statistics

GET /api/v1/alerts/job-executions/statistics

Query Parameters:

job_id - Filter by job ID
alert_id - Filter by alert ID
client_id - Filter by client ID
days - Number of days to analyze (default: 30, max: 365)

Example Response:

{
  "status": "success",
  "data": {
    "total_executions": 150,
    "successful_executions": 145,
    "failed_executions": 5,
    "triggered_count": 23,
    "no_trigger_count": 122,
    "avg_duration_ms": 3250.5,
    "last_execution": "2024-01-15T10:30:00Z",
    "first_execution": "2024-01-01T00:00:00Z",
    "success_rate": 96.67,
    "trigger_rate": 15.33,
    "recent_executions": [...]
  }
}

3. Get Failed Job Executions

GET /api/v1/alerts/job-executions/failed

Query Parameters:

hours - Hours to look back (default: 24, max: 168)
limit - Number of results (default: 50, max: 100)

Usage Examples

1. Check Recent Job Executions

curl -H "Authorization: Bearer YOUR_TOKEN" \
     "http://localhost:8000/api/v1/alerts/job-executions?limit=10"

2. Get Statistics for Specific Alert

curl -H "Authorization: Bearer YOUR_TOKEN" \
     "http://localhost:8000/api/v1/alerts/job-executions/statistics?alert_id=456e7890-e89b-12d3-a456-426614174001&days=7"

3. Check Failed Executions

curl -H "Authorization: Bearer YOUR_TOKEN" \
     "http://localhost:8000/api/v1/alerts/job-executions/failed?hours=48"

4. Monitor Specific Client

curl -H "Authorization: Bearer YOUR_TOKEN" \
     "http://localhost:8000/api/v1/alerts/job-executions?client_id=789e0123-e89b-12d3-a456-426614174002&status=completed"

Database Queries

1. Get All Job Executions for Last 24 Hours

SELECT 
    jel.job_id,
    jel.job_name,
    jel.status,
    jel.result_status,
    jel.started_at,
    jel.duration_ms,
    a.name as alert_name,
    c.name as client_name
FROM logs.job_execution_logs jel
LEFT JOIN app.alerts a ON jel.alert_id = a.id
LEFT JOIN app.clients c ON jel.client_id = c.id
WHERE jel.started_at >= NOW() - INTERVAL '24 hours'
ORDER BY jel.started_at DESC;

2. Get Job Performance Statistics

SELECT 
    job_id,
    job_name,
    COUNT(*) as total_executions,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) as successful,
    COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed,
    AVG(duration_ms) as avg_duration_ms,
    COUNT(CASE WHEN result_status = 'triggered' THEN 1 END) as triggered_count
FROM logs.job_execution_logs
WHERE started_at >= NOW() - INTERVAL '7 days'
GROUP BY job_id, job_name
ORDER BY total_executions DESC;

3. Find Jobs That Haven't Run Recently

SELECT 
    a.name as alert_name,
    a.frequency,
    a.last_checked,
    a.next_check,
    jel.last_execution
FROM app.alerts a
LEFT JOIN (
    SELECT 
        alert_id,
        MAX(started_at) as last_execution
    FROM logs.job_execution_logs
    GROUP BY alert_id
) jel ON a.id = jel.alert_id
WHERE a.is_active = true
  AND (jel.last_execution IS NULL OR jel.last_execution < NOW() - INTERVAL '2 days');

Monitoring & Alerting

1. High Failure Rate Alert

-- Jobs with >10% failure rate in last 24 hours
SELECT 
    job_id,
    job_name,
    COUNT(*) as total_executions,
    COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_executions,
    (COUNT(CASE WHEN status = 'failed' THEN 1 END)::float / COUNT(*) * 100) as failure_rate
FROM logs.job_execution_logs
WHERE started_at >= NOW() - INTERVAL '24 hours'
GROUP BY job_id, job_name
HAVING (COUNT(CASE WHEN status = 'failed' THEN 1 END)::float / COUNT(*) * 100) > 10;

2. Slow Job Performance Alert

-- Jobs taking longer than 30 seconds
SELECT 
    job_id,
    job_name,
    started_at,
    duration_ms,
    error_message
FROM logs.job_execution_logs
WHERE duration_ms > 30000
  AND started_at >= NOW() - INTERVAL '24 hours'
ORDER BY duration_ms DESC;

3. Missing Job Executions

-- Alerts that should have run but didn't
SELECT 
    a.name,
    a.frequency,
    a.next_check,
    jel.last_execution
FROM app.alerts a
LEFT JOIN (
    SELECT 
        alert_id,
        MAX(started_at) as last_execution
    FROM logs.job_execution_logs
    WHERE status = 'completed'
    GROUP BY alert_id
) jel ON a.id = jel.alert_id
WHERE a.is_active = true
  AND a.next_check < NOW()
  AND (jel.last_execution IS NULL OR jel.last_execution < a.next_check);

Maintenance

Cleanup Old Logs

from services.job_execution_logger import cleanup_old_logs

# Clean up logs older than 30 days
deleted_count = await cleanup_old_logs(days_to_keep=30)
print(f"Cleaned up {deleted_count} old log entries")

Automated Cleanup (Cron Job)

# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/backend && python -c "import asyncio; from services.job_execution_logger import cleanup_old_logs; asyncio.run(cleanup_old_logs(30))"

Migration

To set up the job execution logging system:

Run the migration:

cd /path/to/backend
python run_job_logs_migration.py

Restart the application to enable job logging

Verify the setup:

curl -H "Authorization: Bearer YOUR_TOKEN" \
     "http://localhost:8000/api/v1/alerts/job-executions?limit=5"

Benefits

Complete Visibility - See every job execution, not just successful ones
Performance Monitoring - Track execution times and identify slow jobs
Error Tracking - Capture and analyze failure patterns
Debugging - Rich context data for troubleshooting
Analytics - Historical trends and statistics
Compliance - Audit trail of all automated activities
Optimization - Data-driven insights for improving performance

This system ensures you have complete visibility into your scheduler operations and can effectively monitor, debug, and optimize your alert system!

Job Execution Logging System

Overview​

Key Features​

Database Schema​

Main Table: logs.job_execution_logs​

Views​

API Endpoints​

1. Get Job Executions​

2. Get Job Execution Statistics​

3. Get Failed Job Executions​

Usage Examples​

1. Check Recent Job Executions​

2. Get Statistics for Specific Alert​

3. Check Failed Executions​

4. Monitor Specific Client​

Database Queries​

1. Get All Job Executions for Last 24 Hours​

2. Get Job Performance Statistics​

3. Find Jobs That Haven't Run Recently​

Monitoring & Alerting​

1. High Failure Rate Alert​

2. Slow Job Performance Alert​

3. Missing Job Executions​

Maintenance​

Cleanup Old Logs​

Automated Cleanup (Cron Job)​

Migration​

Benefits​

Overview

Key Features

Database Schema

Main Table: `logs.job_execution_logs`

Views

API Endpoints

1. Get Job Executions

2. Get Job Execution Statistics

3. Get Failed Job Executions

Usage Examples

1. Check Recent Job Executions

2. Get Statistics for Specific Alert

3. Check Failed Executions

4. Monitor Specific Client

Database Queries

1. Get All Job Executions for Last 24 Hours

2. Get Job Performance Statistics

3. Find Jobs That Haven't Run Recently

Monitoring & Alerting

1. High Failure Rate Alert

2. Slow Job Performance Alert

3. Missing Job Executions

Maintenance

Cleanup Old Logs

Automated Cleanup (Cron Job)

Migration

Benefits