Job Execution Logging System
Overview​
The Job Execution Logging System tracks ALL scheduler job executions regardless of whether they create alert instances or not. This provides complete visibility into scheduler performance, debugging capabilities, and monitoring.
Key Features​
✅ Complete Job Tracking - Logs every job execution, not just triggered alerts
✅ Performance Monitoring - Tracks execution duration, success rates, and failure patterns
✅ Rich Context - Stores execution data, results, and metadata
✅ Error Handling - Captures error messages and stack traces
✅ Statistics & Analytics - Provides aggregated statistics and trends
✅ API Access - RESTful endpoints for querying logs programmatically
Database Schema​
Main Table: logs.job_execution_logs​
-- Core identification
id UUID PRIMARY KEY
job_id VARCHAR NOT NULL -- APScheduler job ID (e.g., "alert_123")
job_name VARCHAR NOT NULL -- Human-readable job name
job_type VARCHAR NOT NULL -- 'alert', 'cleanup', 'sync', etc.
-- Context
alert_id UUID -- Reference to app.alerts.id
client_id UUID -- Reference to app.clients.id
execution_id UUID NOT NULL -- Unique execution instance
-- Execution details
started_at TIMESTAMP NOT NULL
completed_at TIMESTAMP
duration_ms INTEGER -- Execution duration in milliseconds
-- Status tracking
status VARCHAR NOT NULL -- 'running', 'completed', 'failed', 'cancelled', 'timeout'
result_status VARCHAR -- 'triggered', 'no_trigger', 'error', 'skipped'
-- Results
instances_created INTEGER DEFAULT 0
tasks_created INTEGER DEFAULT 0
records_processed INTEGER DEFAULT 0
-- Error handling
error_message TEXT
error_traceback TEXT
retry_count INTEGER DEFAULT 0
-- Data storage
execution_data JSONB -- Data used during execution
result_data JSONB -- Results from execution
metadata JSONB -- Additional metadata
Views​
logs.recent_job_executions- Easy querying with alert/client nameslogs.job_execution_stats- Aggregated statistics per job
API Endpoints​
1. Get Job Executions​
GET /api/v1/alerts/job-executions
Query Parameters:
job_id- Filter by job IDalert_id- Filter by alert IDclient_id- Filter by client IDstatus- Filter by status ('running', 'completed', 'failed', etc.)job_type- Filter by job type ('alert', 'cleanup', etc.)limit- Number of results (default: 50, max: 100)offset- Pagination offset (default: 0)
Example Response:
{
"status": "success",
"data": [
{
"id": "123e4567-e89b-12d3-a456-426614174000",
"job_id": "alert_456",
"job_name": "Alert: High CPL Alert",
"job_type": "alert",
"alert_id": "456e7890-e89b-12d3-a456-426614174001",
"alert_name": "High CPL Alert",
"client_id": "789e0123-e89b-12d3-a456-426614174002",
"client_name": "Acme Corp",
"execution_id": "abc12345-e89b-12d3-a456-426614174003",
"started_at": "2024-01-15T10:30:00Z",
"completed_at": "2024-01-15T10:30:05Z",
"duration_ms": 5000,
"status": "completed",
"result_status": "triggered",
"instances_created": 2,
"tasks_created": 1,
"records_processed": 1,
"error_message": null,
"retry_count": 0
}
],
"pagination": {
"limit": 50,
"offset": 0,
"count": 1
}
}
2. Get Job Execution Statistics​
GET /api/v1/alerts/job-executions/statistics
Query Parameters:
job_id- Filter by job IDalert_id- Filter by alert IDclient_id- Filter by client IDdays- Number of days to analyze (default: 30, max: 365)
Example Response:
{
"status": "success",
"data": {
"total_executions": 150,
"successful_executions": 145,
"failed_executions": 5,
"triggered_count": 23,
"no_trigger_count": 122,
"avg_duration_ms": 3250.5,
"last_execution": "2024-01-15T10:30:00Z",
"first_execution": "2024-01-01T00:00:00Z",
"success_rate": 96.67,
"trigger_rate": 15.33,
"recent_executions": [...]
}
}
3. Get Failed Job Executions​
GET /api/v1/alerts/job-executions/failed
Query Parameters:
hours- Hours to look back (default: 24, max: 168)limit- Number of results (default: 50, max: 100)
Usage Examples​
1. Check Recent Job Executions​
curl -H "Authorization: Bearer YOUR_TOKEN" \
"http://localhost:8000/api/v1/alerts/job-executions?limit=10"
2. Get Statistics for Specific Alert​
curl -H "Authorization: Bearer YOUR_TOKEN" \
"http://localhost:8000/api/v1/alerts/job-executions/statistics?alert_id=456e7890-e89b-12d3-a456-426614174001&days=7"
3. Check Failed Executions​
curl -H "Authorization: Bearer YOUR_TOKEN" \
"http://localhost:8000/api/v1/alerts/job-executions/failed?hours=48"
4. Monitor Specific Client​
curl -H "Authorization: Bearer YOUR_TOKEN" \
"http://localhost:8000/api/v1/alerts/job-executions?client_id=789e0123-e89b-12d3-a456-426614174002&status=completed"
Database Queries​
1. Get All Job Executions for Last 24 Hours​
SELECT
jel.job_id,
jel.job_name,
jel.status,
jel.result_status,
jel.started_at,
jel.duration_ms,
a.name as alert_name,
c.name as client_name
FROM logs.job_execution_logs jel
LEFT JOIN app.alerts a ON jel.alert_id = a.id
LEFT JOIN app.clients c ON jel.client_id = c.id
WHERE jel.started_at >= NOW() - INTERVAL '24 hours'
ORDER BY jel.started_at DESC;
2. Get Job Performance Statistics​
SELECT
job_id,
job_name,
COUNT(*) as total_executions,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as successful,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed,
AVG(duration_ms) as avg_duration_ms,
COUNT(CASE WHEN result_status = 'triggered' THEN 1 END) as triggered_count
FROM logs.job_execution_logs
WHERE started_at >= NOW() - INTERVAL '7 days'
GROUP BY job_id, job_name
ORDER BY total_executions DESC;
3. Find Jobs That Haven't Run Recently​
SELECT
a.name as alert_name,
a.frequency,
a.last_checked,
a.next_check,
jel.last_execution
FROM app.alerts a
LEFT JOIN (
SELECT
alert_id,
MAX(started_at) as last_execution
FROM logs.job_execution_logs
GROUP BY alert_id
) jel ON a.id = jel.alert_id
WHERE a.is_active = true
AND (jel.last_execution IS NULL OR jel.last_execution < NOW() - INTERVAL '2 days');
Monitoring & Alerting​
1. High Failure Rate Alert​
-- Jobs with >10% failure rate in last 24 hours
SELECT
job_id,
job_name,
COUNT(*) as total_executions,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_executions,
(COUNT(CASE WHEN status = 'failed' THEN 1 END)::float / COUNT(*) * 100) as failure_rate
FROM logs.job_execution_logs
WHERE started_at >= NOW() - INTERVAL '24 hours'
GROUP BY job_id, job_name
HAVING (COUNT(CASE WHEN status = 'failed' THEN 1 END)::float / COUNT(*) * 100) > 10;
2. Slow Job Performance Alert​
-- Jobs taking longer than 30 seconds
SELECT
job_id,
job_name,
started_at,
duration_ms,
error_message
FROM logs.job_execution_logs
WHERE duration_ms > 30000
AND started_at >= NOW() - INTERVAL '24 hours'
ORDER BY duration_ms DESC;
3. Missing Job Executions​
-- Alerts that should have run but didn't
SELECT
a.name,
a.frequency,
a.next_check,
jel.last_execution
FROM app.alerts a
LEFT JOIN (
SELECT
alert_id,
MAX(started_at) as last_execution
FROM logs.job_execution_logs
WHERE status = 'completed'
GROUP BY alert_id
) jel ON a.id = jel.alert_id
WHERE a.is_active = true
AND a.next_check < NOW()
AND (jel.last_execution IS NULL OR jel.last_execution < a.next_check);
Maintenance​
Cleanup Old Logs​
from services.job_execution_logger import cleanup_old_logs
# Clean up logs older than 30 days
deleted_count = await cleanup_old_logs(days_to_keep=30)
print(f"Cleaned up {deleted_count} old log entries")
Automated Cleanup (Cron Job)​
# Add to crontab to run daily at 2 AM
0 2 * * * cd /path/to/backend && python -c "import asyncio; from services.job_execution_logger import cleanup_old_logs; asyncio.run(cleanup_old_logs(30))"
Migration​
To set up the job execution logging system:
-
Run the migration:
cd /path/to/backend
python run_job_logs_migration.py -
Restart the application to enable job logging
-
Verify the setup:
curl -H "Authorization: Bearer YOUR_TOKEN" \
"http://localhost:8000/api/v1/alerts/job-executions?limit=5"
Benefits​
- Complete Visibility - See every job execution, not just successful ones
- Performance Monitoring - Track execution times and identify slow jobs
- Error Tracking - Capture and analyze failure patterns
- Debugging - Rich context data for troubleshooting
- Analytics - Historical trends and statistics
- Compliance - Audit trail of all automated activities
- Optimization - Data-driven insights for improving performance
This system ensures you have complete visibility into your scheduler operations and can effectively monitor, debug, and optimize your alert system!