Alerts

Alerting isn’t just a feature. It is a keystone of monitoring infrastructure, and an infamously difficult thing to do well. When it goes wrong, it has a profoundly negative impact on the productivity and happiness of the engineers who rely on it. We aim to provide tools that allow you to create actionable alerts that have a high signal-to-noise ratio.

Our alerting system allows you to link alerts with multiple notification services, so you can emit alerts via email to pagers, or via web-services to your favorite escalation partner, or both at the same time.

Alerts are first-class entities in AppOptics and can be accessed from the Alerts icon in your navigation bar.

Alerts Central

Clicking on the Alerts icon in the menu bar takes you to the main view - a list of all enabled alerts and their current state. From here you can drill down into alerts, edit them, create new ones, sort, search by name… you get the picture - that’s why we call this Alerts Central.

alerts-central

Create New Alert

Clicking on the “Create Alert” button opens up a wizard that starts with a choice of which type of metrics you want to alert on.

alerts-new

  • Host Metrics: Monitor health across one or more hosts to understand vitals and know when host(s) have stopped.
  • APM Metrics: Alert on service health metrcis like latency, error rate, and request volume.
  • Custom Metrics: Set up alerts based on any metric, including APM and Host metrics.

Select which type of metric you want to alert on and click the corresponding tile to proceed to defining the alert conditions.

Defining Alert Conditions

The second step is to create the Alert Conditions. Here you can create new conditions or edit existing ones. If you have several alert conditions, they ALL have to be met before the alert triggers.

Note

Alert conditions are completely independent, so if condition 1 is triggered by tag set X and condition 0 is triggered by tag set Y, the alert will fire. To create alerts that are tag set dependent we recommend using Alerts on Composite Metrics.

alerts-condition

  • Select a metric to monitor: Select a metric (or persisted composite metric). If you chose Host or APM metrics, you will see some common metrics available for 1-click selection.

  • Filter metric: Define a tag set to filter the metric on. For example adding the tag environment with the value production will only evaluate metric streams that match that condition. Each stream will be checked independently against the alert conditions. This means that alerts will fire when any of the streams violates the conditions.

  • Select aggregation function: For gauges, choose the statistic to alert on: average, sum, minimum, maximum, count, or derivative. Please note that if you select “derivative” for a gauge, the derivative of the sums will be used. For the stops reporting condition, this field is not shown. Also, the metrics can be stream based (Any) or an aggregation of all streams that match the tag filter (All).

  • Trigger alert: The threshold settings for the alert.

    • condition type: The type of threshold you are setting. You can trigger alerts when values exceed or fall below a threshold or you can use the “stops reporting” option to check a metric’s “heartbeat”.
      • goes above: Alert will fire if the incoming measurement value exceeds the threshold.
      • falls below: Alert will fire if the incoming measurement value falls below the threshold.
      • stops reporting: Alert will fire if a metric that has reported at least once in the past hour is now absent.
    • crosses threshold: the value to be checked against (for each metric stream) which determines whether the alert will be fired. There are two trigger options:
      • immediately: Fire the alert as soon as any measurement violates the condition.
      • for a duration: Set the time window within which the trigger condition must be met for every measurement coming in. For example, if you only want to be notified if CPU is over 80 for at least 5 minutes, set the field to 5 minutes. NOTE: The maximum value of this field is 60 minutes.

There is also a preview that shows the recent activity for the selected metric. This way you can use previous data to set a relevant threshold.

In the right hand column there will be a sentence explaining how the condition will be met. Example: The average of system.cpu.utilization for any host goes above 20 for a duration of 5 mins

Click the “Next” button in the lower right hand corner to select which notification services you would like to use.

Notification Services

Under the Notification Services step you can link an alert to any number of notification services. To tie an alert to a notification service just click on the service and select any of the destinations that are configured. You have to have at least one notification service selected before you can save the alert.

notification_services

These are the services that AppOptics supports:

Adding Details to an Alert

The final step is to finish adding details to the alert.

alerts-details

  • alert name: Pick a name for your alert. Alert names now follow the same naming conventions as metrics. We suggest using names that describe the environment, application tier & alert function clearly such as production.frontend.response_rate.slow or staging.frontend.response_rate.slow.
  • description: This is an optional field - keep in mind that others may need to know what this alert is for so we recommend using it if you are setting up alerts for a team.
  • rearm time: This re-notify timer lets you specify how long to wait before the alert can be fired again. NOTE: The re-notify timer is global, so if the timer is set to 60 minutes and you are alerting on a metric that has a cardinatily > 1 and two streams trigger the alert consecutively, say within a few minutes, the first stream will trigger a notification whereas the 2nd will not.
  • runbook URL: To ensure that alerts are actionable, add a url to a document (wiki page, gist, etc.) that describes what actions should be taken when the alert fires.

Clicking “Enable” alert in the lower right hand corner will set the alert active and AppOptics will begin monitoring the conditions.

Automated Alert Annotations

Every time an alert triggers a notification, we automatically create a matching annotation event in a special appoptics.alerts annotation stream. This enables you to overlay the history of any given alert (or set of alerts) on any chart as depicted below:

alerts-annotation

The naming convention for a given alert will take the form: appoptics.alerts.#{alert_name}. For each notification that has fired, you will find an annotation in the annotation stream with its title set to the alert name, its description set to the list of conditions that triggered the notification, and a link back to the original alert definition. This feature will enable you to quickly and easily correlate your alerts against any combination of metrics.

Automatic Clearing of Triggered Alerts

With alert clearing you will receive a clear notification when the alert is no longer in a triggered state. For threshold and windowed alert conditions, any measurement that has returned to normal levels for all affected metric streams will clear the alert. For absent alert conditions, any measurement that reappears for all affected metric streams will clear the alert.

When an alert clears it sends a special clear notification. These are handled differently based on the integration. Email and Slack integrations will see an “alert has cleared” message. For PagerDuty customers, the open incident will be resolved. OpsGenie customers will see their open alert closed. Webhook integrations will contain a clear attribute in the payload as follows:

{
  "payload": {
    "alert": {
       "id": 6268092,
       "name": "a.test.name",
       "runbook_url": "",
       "version": 2
    },
    "account": "youremail@yourdomain.com",
    "trigger_time": 1457040045,
    "clear": "normal"
  }
}

When you view Alerts Central, the alerts are grouped based on status (Triggered, All Good, Disabled). Under normal conditions there will be no Triggered alerts which indicates that everything is all good.

alert-ok

If the alert has actively triggered and has not cleared yet, it will include a resolve button that will manually clear the alert. This can be useful in cases where a source reports a measurement that violates a threshold, but then subsequently stops reporting.

alert-triggered

If the condition of the alert is still actively triggering, the alert will return to the triggered state on the next incoming measurement(s).

Auto-clearing when metric streams stop reporting on a threshold alert

For time-windowed threshold alerts, if all metric streams that were in violation subsequently stop reporting, the alert will be cleared after 1 threshold period. Example: If the alert condition is “goes above 42 for 5 minutes” and “metric stream 1” violates the condition, the alert will trigger. If metric stream 1 later stops reporting measurements, the alert will be cleared 5 minutes afterward.  Similarly, if metric stream 1 and metric stream 2 are both in violation, and metric stream 1 stops reporting but metric stream 2 continues to be in violation, the alert will remain in the triggered state.  If metric stream 2 then stops reporting, the alert will be cleared. All triggered alerts are automatically cleared after a period of 10 days if they fail to report any new measurements.

Alerts on Composite Metrics

alert_composite_metric

Under the Metrics view you can use the blue Create Composite Metric button to create a composite metric with a persisted definition, which will be available globally like any other metric. It can also be used inside an alert.

Warning

When the alerting system polls for values on a composite metric, it currently is limited to the past 60 minutes of data.  Therefore, if for example, you have an alert set on a derive() function, there must be 2 points within the last 60 minutes in order for your alert to trigger.

Saved composites must result in a single time-series to be alerted on, so..

[s("metric1", {"name":"foo"}), s("metric2", {"name":"bar"})]

…will not work.  A metric returning multiple metric streams will work, e.g.:

s("metric1", "name":"*")