Datadog, Inc. announced the general availability of Data Jobs Monitoring, a new product that helps data platform teams and data engineers detect problematic Spark and Databricks jobs anywhere in their data pipelines, remediate failed and long-running-jobs faster, and optimize overprovisioned compute resources to reduce costs. Data Jobs Monitoring immediately surfaces specific jobs that need optimization and reliability improvements while enabling teams to drill down into job execution traces so that they can correlate their job telemetry to their cloud infrastructure for fast debugging. Data Jobs Monitoring helps teams to: Detect job failures and latency spikes: Out-of-the-box alerts immediately notify teams when jobs have failed or are running beyond automatically detected baselines so that they can be addressed before there are negative impacts to the end user experience.

Recommended filters surface the most important issues that are impacting job and cluster health, so that they can be prioritized. Pinpoint and resolve erroneous jobs faster: Detailed trace views show teams exactly where a job failed in its execution flow so they have the full context for faster troubleshooting. Multiple job runs can be compared to one another to expedite root cause analysis and identify trends and changes in run duration, Spark performance metrics, cluster utilization and configuration.

Identify opportunities for cost savings: Resource utilization and Spark application metrics help teams identify ways to lower compute costs for overprovisioned clusters and optimize inefficient job runs. Data Jobs Monitoring is now generally available.