Skip to main content

Auto-Investigate Datadog Alerts

Wire PagerDuty or Datadog alerts to Devin for automatic incident investigation.
AuthorCognition
CategoryIncident Response
FeaturesAPI, MCP
1

Enable the Datadog MCP

Devin needs access to your Datadog account to query logs, metrics, and monitors during an investigation.
  1. Go to Settings > MCP Marketplace and find Datadog
  2. Click Enable and enter your Datadog API key and application key — generate these in Datadog > Organization Settings > API Keys
  3. Click Test listing tools to verify Devin can connect
Once enabled, Devin can query error logs, pull metric timeseries, list active monitors, and search traces — all within a session. Learn more about connecting MCP servers.
2

Build the alert-to-Devin bridge

You need a small service that receives alert webhooks and starts a Devin session via the Devin API. Deploy this as a serverless function (AWS Lambda, Cloudflare Worker) or a lightweight container:
from flask import Flask, request, jsonify
import requests, os

app = Flask(__name__)

@app.route("/alert", methods=["POST"])
def handle_alert():
    payload = request.json

    # Datadog webhook payload fields
    alert_title = payload.get("title", "Unknown alert")
    tags_str = payload.get("tags", "")
    service = next(
        (t.split(":", 1)[1] for t in tags_str.split(",") if t.strip().startswith("service:")),
        "unknown-service"
    )
    alert_url = payload.get("link", "")

    org_id = os.environ["DEVIN_ORG_ID"]
    response = requests.post(
        f"https://api.devin.ai/v3/organizations/{org_id}/sessions",
        headers={"Authorization": f"Bearer {os.environ['DEVIN_API_KEY']}"},
        json={
            "prompt": (
                f"Datadog alert fired: '{alert_title}'\n"
                f"Service: {service}\n"
                f"Alert link: {alert_url}\n\n"
                "Using the Datadog MCP:\n"
                "1. Pull error logs for this service from the past 30 min\n"
                "2. Identify the top error messages and stack traces\n"
                "3. Check if this correlates with a recent deploy\n"
                "4. If the root cause is clear, open a hotfix PR\n"
                "5. Post your findings to #incidents on Slack"
            ),
            "playbook_id": "14fed18b89d44713a26e673cf258f548",
        }
    )
    return jsonify(response.json()), 200
Create a service user in Settings > Service Users at app.devin.ai with ManageOrgSessions permission. Copy the API token shown after creation and store it as DEVIN_API_KEY on your bridge service. Set DEVIN_ORG_ID to your organization ID — get it by calling GET https://api.devin.ai/v3/enterprise/organizations with your token.The code above uses the !triage template playbook — duplicate it and customize the investigation steps for your stack, then update the playbook_id in your bridge service.
3

Route alerts to the webhook

From Datadog directly:
  1. In your Datadog dashboard, go to Integrations > Webhooks
  2. Click New Webhook and set the URL to your bridge endpoint (e.g., https://your-bridge.example.com/alert)
  3. In any monitor’s notification message, add @webhook-devin-bridge — Devin investigates whenever that monitor fires
From PagerDuty:
  1. In PagerDuty, go to Services > [your service] > Integrations
  2. Add a Generic Webhooks (v3) integration
  3. Set the webhook URL to your bridge endpoint and filter by event type incident.triggered
Start with warning-level monitors to test the pipeline before routing critical alerts.
4

What Devin investigates

When an alert triggers a session, Devin uses the Datadog MCP to run a structured investigation — querying logs, correlating with deploys, and tracing the error to source code.Example investigation Devin posts to Slack:
Alert Investigation: payments-service error rate spike

Timeline:
- 14:28 UTC — Deploy #492 released (commit abc123f)
- 14:31 UTC — Error rate jumped from 0.3% to 5.2%
- 14:32 UTC — Alert triggered

Root cause: Deploy #492 refactored the Stripe webhook handler
(src/webhooks/stripe.ts) to async/await but removed the try/catch
around handlePaymentIntent(). Unhandled rejections are returning
500s on ~4% of checkout requests.

Fix: Added error boundary with structured logging and proper 4xx
responses for client errors.

PR #493 opened → https://github.com/acme/payments/pull/493
5

Extend the pipeline

Once basic investigation works, layer on more automation:Customize the triage playbook. The bridge code already uses the !triage template playbook. Duplicate it and tailor the investigation checklist to your team’s stack — add service-specific runbooks, escalation paths, and conventions for hotfix PRs.Scope by severity. Route P1 alerts for immediate investigation and hotfix. Route P3 alerts for root-cause analysis only. Use different prompts or playbooks per severity level.Add Knowledge about your services — normal thresholds, architecture, on-call runbooks — so Devin’s investigation starts from your team’s context instead of from scratch.