# Alerting Overview

Contents

* [Alerting Overview](/alerting-guide/alerting-overview.md)
  * [Creating An Alert](#creating-an-alert)
    * [Alert Name and Metric](#alert-name-and-metric)
    * [Alert Criteria Panel](#alert-criteria-panel)
      * [Composite Alerts](#composite-alerts)
    * [Notification Panel](#notification-panel)
      * [Email](#email)
      * [PagerDuty](#pagerduty)
      * [Slack](#slack)
      * [Microsoft Teams](#microsoft-teams)
      * [VictorOps](#victorops)
      * [OpsGenie](#opsgenie)
      * [Webhook](#webhook)
      * [Notification JSON Model](#notification-json-model)
  * [Alert States](#alert-states)
  * [Managing An Alert](#managing-an-alert)
  * [Scheduled Mutes](#scheduled-mutes)
  * [Troubleshooting Your Alerts](#troubleshooting-your-alerts)

> **You can create alerts in Hosted Graphite, Hosted Grafana, or both. However, we offer limited support into Grafana Alerting as it operates on a separate alerting engine that we do not manage.** Hosted Graphite's internal alerting system also has a quicker response time because it is triggered from values upon ingestion, rather than upon render.

### [Creating An Alert](#creating-an-alert)

<figure><img src="/files/pC7wSK2G1GKlTuV6wGxt" alt=""><figcaption><p>Create an Alert</p></figcaption></figure>

#### [Alert Name and Metric](#alert-name-and-metric)

From within your Hosted Graphite account, click the “Alert” icon to open the alert creation panel.

> * **Alert Name**
>
>   This name is used in notifications. It is a reminder of why you added it, so make it clear and descriptive! e.g. “EU Servers CPU usage”.
> * **Graphite Alerting Metric**
>
>   This queries the data that is tested against your criteria:
>
>   * **Wlidcard** patterns are accepted, evaluated, and a list of Triggered Metrics are returned
>   * **Tagged Metrics** can also be used for alerting metrics, e.g. seriesByTag("name=myapp.response","code=400")
> * **Alert Info**
>
>   Alert message sent with notifications. Can contain arbitrary strings like description of the alert, resolution steps to follow, or links to documentation.

It is **recommended** to check your alerting metric with the “Check Metric Graph” button to confirm it is rendering the data that you expect. When you’re finished, proceed to the Alert Criteria tab.

#### [Alert Criteria Panel](#alert-criteria-panel)

There are three ways to define the criteria that will result in a notification being sent.

* **Below / Above a Threshold**

  If you just enter one of the above or below values, it will check whichever one you use. This is useful when there’s an upper or lower bound that this data should not cross. You can evaluate the metric to trigger after crossing the threshold **FOR** x minutes, or **EVER**.
* **Missing**

  An alert notification will be sent to you if the metric does not arrive at all for a certain time period. This is useful for detecting when a system goes down entirely.
* **Outside of Bounds**

  An alert notification will be sent if the metric data you’ve selected goes either above the “above” threshold, or below the “below” threshold. This is useful when your data fits inside an expected range, and can be configured to trigger after crossing the threshold FOR x minutes, or EVER.

#### [Composite Alerts](#composite-alerts)

Composite alerts allow you to combine multiple conditions into a single alert using AND / OR logic, and up to four conditions can be configured. This helps reduce alert fatigue by correlating multiple metric evaluations before triggering a notification. So instead of alerting on one noisy metric in isolation, composite alerts allow you to build higher-confidence alerts around overall service health.

For example:

* Condition A: CPU usage is elevated
* Condition B: request latency is increasing
* Condition C: database connection count is high

These conditions can then be evaluated together using AND / OR expressions such as:

* `A && B`
* `A || B || C`
* `(A && B) || C`
* `A || (B && C)`

Each condition also supports the same alert criteria types available in standard alerts (above, below, missing, outside of bounds).

Composite alerts still support wildcard metric queries. When wildcard conditions are used, all matching metrics are evaluated and a Triggered Metrics list is sent within the alert notification payload. This additional context makes it easier to identify exactly which services, hosts, or metric groups contributed to the alert state.

<figure><img src="/files/OTKMF6V9cggL1BrYdMGo" alt=""><figcaption></figcaption></figure>

**Service-Level Alerting**

Composite alerts allow a more **service**-level approach to alerting by combining multiple metric **signals** into a single evaluation. This helps reduce noisy alerts while improving signal quality and incident visibility.

For example, a temporary CPU spike on a single host may not represent a real incident. However, elevated CPU combined with increased latency and database saturation may indicate a genuine service degradation event worth notifying on. For a deeper overview of this alerting philosophy and real-world examples from our internal alerting infrastructure, see our case study [HERE](https://www.metricfire.com/blog/a-real-world-graphite-alerting-case-study-reducing-noise-at-metricfire/).

#### [Notification Panel](#notification-panel)

Defining a notification channel allows you to receive notifications when an alert triggers. Below are the available notification channel types, and create new ones on your [Notification Channel Page](https://www.hostedgraphite.com/app/alerts/notification-channels/). Click the '+ Add Channel' button to configure a new alerting channel, then 'Save' and apply them to any of your Graphite alerts.

“If the query fails” lets you control the behavior if the graphite function query fails. This option only appears for alerts that use graphite functions as part of their metrics. Graphite function queries can fail due to timeouts from matching too many metrics, being malformed, or if it returns duplicate metrics due to aliasing.

* **Notify me**

  A notification is sent when the query fails with a description of the reason.
* **Ignore**

  Notifications are ignored but the alert still changes state and the failure is visible in the event history log.

"Alerting Notification Interval" lets you control how often you want to be notified of an alert:

* **On state change**\
  A notification will be sent only when the alert transitions state from healthy to triggered - or vice versa. An alert that continues alerting will not send subsequent notifications.
* **Every**

  A notification will be sent each time the alert triggers and recovers. Subsequent notifications will then be paused for the configured time period. This allows you to stop the ‘flapping’ behavior that would give you lots of notifications in a short period of time.

#### [Email](#email)

Send one or multiple emails to your team when an alert is triggered.

<figure><img src="/files/wfmPg9HUExdQo4pxkXLl" alt="" width="306"><figcaption></figcaption></figure>

#### [PagerDuty](#pagerduty)

Send your alerts to your centralized PagerDuty incident monitoring and alerting system. Reference the PD documentation to create/locate your service or integration key required to configure this notification channel.

<figure><img src="/files/514SU1IGFGzAmRffO5J9" alt="" width="306"><figcaption></figcaption></figure>

#### [Slack](#slack)

Send an immediate notification to one of your Slack channels. This requires a Slack Webhook endpoint for your channel, see the [Slack documentation](https://slack.com/apps/new/A0F7XDUAZ-incoming-webhooks) for details on how to create this.

<figure><img src="/files/3zfHNUZCMrr7sNkPaDOk" alt="" width="306"><figcaption></figcaption></figure>

#### [Microsoft Teams](#microsoft-teams)

Send alerts to your chosen Microsoft Teams channel by creating a Microsoft Teams webhook. See their [documentation](https://learn.microsoft.com/en-us/microsoftteams/platform/webhooks-and-connectors/how-to/add-incoming-webhook?tabs=dotnet) for details on creating a Teams webhook.

<figure><img src="/files/OmH7C66JNTwYG0mQTUtM" alt="" width="306"><figcaption></figcaption></figure>

#### [VictorOps](#victorops)

(Now owned by Splunk) Send your alerts into your VictorOps hub to integrate with all your existing monitoring and alerting infrastructure. See the Splunk [documentation](https://help.victorops.com/knowledge-base/escalation-webhooks/) for more details on creating escalation webhooks.

<figure><img src="/files/gSnQI5DoMYeg47LnjoRS" alt="" width="306"><figcaption></figcaption></figure>

#### [OpsGenie](#opsgenie)

Send alerts to this incident management tool allowing your team to respond to incidents, outages, and other events. Reference the OpsGenie [docs](https://support.atlassian.com/opsgenie/docs/integrate-opsgenie-with-hosted-graphite/) for details around api key management.

<figure><img src="/files/sPVRM4aEe5M6idVGiLy9" alt="" width="306"><figcaption></figcaption></figure>

#### [Webhook](#webhook)

Allows you to configure your own webhook that we will notify with real-time information on your defined alerts. See the JSON format below as that is the data reported to all notification channels, including your webhook URL.

<figure><img src="/files/T1oV4ejznu8du9Qh2fB9" alt="" width="306"><figcaption></figcaption></figure>

#### [Notification JSON Model](#notification-json-model)

Each notification will be JSON encoded in the following format:

```json
{
 "name": "The name of the triggered alert.",
 "criteria": "The defined alert criteria for the alert.",
 "graph": "PNG of the rendered graph.",
 "value": "The last evaluated value of the alerting metric.",
 "metric": "The graphite metric that triggered an alert.",
 "logical_expression": "The AND / OR expression used to evaluate composite alerts.",
 "condition_met": {
   "A": ["Metric query rule for condition A"],
   "B": ["Metric query rule for condition B"]
 },
 "triggered_metrics": [
   "List of metrics contributing to the triggered alert state."
 ],
 "status": "The current status of the alert.",
 "backoff_minutes": false | 123,
 "info": null | "Info saved with the alert."
}
```

**NOTE**: Composite alerts include additional notification payload fields for **logical expressions**, and **conditions met**. When wildcard queries are used, the triggered metrics list contains the metrics that contributed to the alert state.

Slack notifications also include a Condition Matrix view which displays the evaluation status of each configured condition, making it easier to understand exactly why the composite alert triggered.

<figure><img src="/files/rupRpIETrQWiGSXKptQg" alt="" width="482"><figcaption></figcaption></figure>

### [Alert States](#alert-states)

Your alerts are listed in the [Alert Overview](https://www.hostedgraphite.com/app/alerts/) section of the Hosted Graphite application. We list them in four categories:

> * **Healthy Alerts**
>
>   Alerts that are currently running and within acceptable boundaries.
> * **Triggered Alerts**
>
>   Alerts that are currently running and outside acceptable boundaries, this alert will have already notified you via the set notification channel.
> * **Muted Alerts**
>
>   Alerts which have been silenced manually or by schedule. These alerts will not notify you until they become active again.
> * **Inactive Alerts**
>
>   Alerts that use graphite function metrics but have failed due to the query taking taking too long, being malformed or returning duplicate metrics due to aliasing.

**NOTE**: Manually updating an alert will reset the state to *Healthy*.

### [Managing An Alert](#managing-an-alert)

From the Alert Overview page, you can hover your mouse over an individual alert to see actions related to managing it.

<figure><img src="/files/GzQgzWvizGPGtbMJQ17a" alt=""><figcaption><p>Managing an Alert</p></figcaption></figure>

* **View an Alert**

  Click the eye icon to open the overview popup for an alert. This displays an embedded graph and a history log of the last 3 days of data. There is also a link to the dashboard composer allowing you to view more detailed information on the metric being alerted on. From within the dashboard composer view, alert events will be displayed as annotations. You can hover over the base of the annotation to see the details of the alerting event.

<figure><img src="/files/NTPn2h5SY0x20rCejm0K" alt=""><figcaption><p>View an Alert</p></figcaption></figure>

* **Edit an Alert**

  An alert can be edited to change its metric, criteria, or notification channel and changes may take several minutes to take effect. Updating alert criteria will place it back into the ‘Healthy’ list in the Graphite Alerts UI, but does not change the state of the alert.
* **Mute an Alert**

  An alert can be silenced from notifying you for a specified time period. Currently, the available times are 30 mins, 6hrs, 1 day, and 1 week.
* **Delete an Alert**

  An alert can be deleted from your panel here and this action is irreversible. If an alert was built within the Dashboard UI, you will be unable to edit or delete it from within the Hosted Graphite UI. Feel free to contact our [support](mailto:support%40hostedgraphite.com) for advice on managing alerts using the Hosted Graphite [*alert API*](/alerting-guide/alerts-api.md), or the [Dashboard API](https://grafana.com/docs/grafana/v7.5/http_api/alerting/).

### [Scheduled Mutes](#scheduled-mutes)

Defining a scheduled mute allows you to silence alerts on a one-time or recurring basis for scheduled maintenance or downtime. You can see the available scheduled mutes and add new ones in the [Alerts UI](https://www.hostedgraphite.com/app/alerts/scheduled-mutes/).

Once a scheduled mute is created, it must be attached to alerts so that they may be silenced by the scheduled mute - this can be done at the alert [create](#creating-an-alert) and [update](/alerting-guide/alerts-api.md#updating-alerts) endpoints, or the Hosted Graphite [UI](https://www.hostedgraphite.com/app/alerts/scheduled-mutes/).

* **One-time**

  You can silence alerts on a one-time basis by creating a scheduled mute with no repeat days.
* **Recurring**

  By providing a list of days of the week for the scheduled mute to repeat, you can silence alerts on a recurring basis.

### [Troubleshooting Your Alerts](#troubleshooting-your-alerts)

Please contact [support](mailto:support%40hostedgraphite.com) if you think you’ve found a bug, or have any questions, concerns, or suggestions.

> * **Is your metric arriving?**
>
>   If are not receiving notifications as expected, please check the [Alert Overview](https://www.hostedgraphite.com/app/alerts/) page and select the alert in question. You can use this to check the metric values for the last few hours are as expected. You can also inspect the Alert History for any recent alerting events.
> * **Are some events being ignored?**
>
>   We alert on a 30 second resolution. This means the finer data (5s) is averaged and we alert off the 30 second aggregate.
> * **Is your alert not triggering as expected?**
>
>   Alerts built in the Grafana UI may not work as expected, a simple fix would be recreating this alert in the Hosted Graphite Alerts UI.
> * **Is your alert not resolving as expected?**
>
>   Unless your alert criteria is set to 'data is missing', Graphite alerts will not trigger or resolve from *null* data. If alerting metric reports intermittent data (for example, 1 datapoint every 10min), null values can be reported between each datapoint. Try wrapping your alerting metric in a Graphite function like [transformNull()](https://graphite.readthedocs.io/en/latest/functions.html?highlight=transformNull#graphite.render.functions.transformNull), [keepLastValue()](https://graphite.readthedocs.io/en/latest/functions.html?highlight=keeplast#graphite.render.functions.keepLastValue), or [movingAverage()](https://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.movingAverage). If the alert criteria is set too low (FOR less than 10min), your alert might not resolve as expected.
> * **Is your alert not resolving after updating the criteria?**
>
>   An alert’s state is not changed after the criteria is updated. So while your alert might move to the ‘healthy’ list in our UI, it will remain in a triggered state until new data resolves the alert naturally. If you are looking to quickly resolve an alert by updating the criteria, you could simply delete and recreate the alert.
> * **Is your alert triggering but not sending Slack notifications?**
>
>   Check the ‘alert description’ field on the alert configuration. If the description contains an invalid character, like a double "quotation", this could malform the json payload of the Slack webhook and cause the request to return an error. Test your webhook with the following command:

```
curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' <your-slack-webhook-url>
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.hostedgraphite.com/alerting-guide/alerting-overview.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
