Cloud Infrastructure

OCI: Receive notification when your instance is down

Absolutely everybody:

– I would like to be notified when my instances go down, please? (≧︿≦)

This is one of the most common questions I’ve heard, so let’s see what’s required to put such notifications in place in Oracle Cloud Infrastructure (OCI).

0. What are the prerequisites?

  • Quite obviously, a subscription to OCI. If you don’t have it yet, please go and grab one – you’ll have access to $300 you can spend on any IaaS or PaaS service during one month, but when this month (or credits) expire, you will still have access to “always free” resources. More details here:
    Everything described here will work with always free subscription, none of the services used require upgrading to a paid subscription.
  • You already have basic knowledge of OCI – you know how to navigate the OCI console and create and run instances.
  • Rights to “Manage” the Notifications Service and rights to “Use” the OCI Metrics. If you are the tenancy admin, then you already have those; if you aren’t – more on this later in the article.
  • Bare metal or Virtual instances which state you’re going to track. The instances should have the metrics enabled.
  • A valid email address to receive the notifications.

TLDR: We will send notifications when the instance stops producing metrics. This is typically true in 3 cases: 1) when the instance is stopped; 2) when it crashed or its network connection was severed for some reason; 3) when the metrics agent stopped working.

1. Initial setup

1.1 Monitored instances

In a real-world scenario, you’ll already have your instances running, but for the sake of the tutorial let’s create a couple. I’ll be creating them in a compartment “Sandbox” according to this diagram:

Of the two instances, I want to track the status of the webserver only.

By default, the instances on OCI created from Oracle-provided images already have the metrics agent installed, and on the instance launch the metrics collection is enabled by default. Verify this when creating the instance: on the “Create Compute Instance” page go to the bottom and open the “Show advanced options” section: be sure the “Enable Monitoring” and “Use Oracle Cloud Agent to manage this instance” are checked.

Let’s capture the instances’ OCIDs (you can also see that the Cloud Agent is enabled, allowing the metrics collection):


1.2 Security policies

If you have the tenancy admin rights (you are the tenancy admin), then you can skip this section.

If you don’t, you’ll need your admin to have the following policies attached to your user’s security group (suppose this group is called “UserGroup1”):

Allow group UserGroup1 to read metrics in tenancy
Allow group UserGroup1 to manage alarms in tenancy
Allow group UserGroup1 to manage ons-topics in tenancy
Allow group UserGroup1 to manage ons-subscriptions in tenancy

Wish to know more about how policies work? Follow this documentation link:

2. Making it work

2.1 Choosing a metric

We will base out alert on one of the compute agent’s metrics – but which one?

If you go to Main menu > Monitoring > Service Metrics, you’ll see that the oci-computeagent namespace has 8 metrics collected, on CPU, disk and network utilization. But there is no metric for “instance up” or “down”, right?

Or is there? Let’s take a look: in this example below, I had two instances in the “sandbox” compartment that were stopped between 12:00 and 15:10. While there is no specific metric that tracks the instance’s state, we can instead use the “absence” of the metric as an equivalent to the “Down” state.

Any of the compute metrics can be used for the “instance down” detection, I’ll use the “CPU utilization” but you can select any of the eight available – they all go absent when the instance is down.

2.2 Creating Notification topic and Email Subscription

The OCI Notifications is a managed service that consists of Topics that collect the individual notifications and Subscriptions that process the messages from the topics. It’s possible to create a subscription of type “email”, it will sent an email every time there is a new message in the Notification topic’s queue. More about Notifications service:

Let’s go ahead and create a new Notification topic:
Go to Main Menu > Application Integration > Notifications

Then click Create Topic button

I’ll name mine “Instance_State_Notification”

Next, let’s create an email Subscription for the “Instance_State_Notification” topic:

Go to Main Menu > Application Integration > Subscriptions and click the “Create Subscription” button

Then select the topic that was created in the previous step, “Email” as protocol and fill-in the address to which to send the alerts to.

An email will be sent to this address with a confirmation link, once this confirmation is done, the subscription will be activated, until then the subscription will stay in “Pending Confirmation” state:

2.3 Creating alarm definition

Now that the message delivery infrastructure is created, let’s create the actual alarm that will detect the instance down events and send the message to the notification topic.

Go to Main menu > Monitoring > Alarm Definitions then press “Create Alarm” button:

I’ll name this alarm “Webserver is offline”

An appropriate metric is the one that goes “true” when the researched condition (“instance down”) is reached.
I’m choosing the oci_computeragent metric namespace, CpuUtilization metric, 1 minute collection interval and Mean statistic aggregation.

My “sandbox” compartment contains 2 instances, but I would like to track the state of only the webserver instance. To do that, I’ll add a filter using “Dimension Name” and “Dimension Value” fields.
As dimension name I’m choosing “resourceId” (OCID) and as value, the webserver’s OCID collected earlier (section 1.1)

The alarm will be triggered when the CPU Utilization metrics are absent for more than 3 minutes:

And finally, the Notifications section underneath lets configure the destination of the alarm: the Notification Service topic “Instance_State_Notification

2.4 Testing the alarm

First, let’s go to the Main menu > Monitoring > Alarm Status

There are no alarms currently in the “Firing” state – the webserver instance is still running and happily producing its metrics so the alarm criteria of “absent” metrics isn’t satisfied yet:

Let’s stop the webserver:

Once the instance stops producing metrics for more than 3 minutes (+1 minutes of aggregation period), the alarm goes into “Firing” state

And you receive an email with the alert

The email contains JSON payload that can be used for further analysis or automatic ingestion by incident management systems:

You’ll receive another email when the alert status transitions from “Firing” to “Non-Firing” state (after the instance was restarted back).

You can also try stopping the another “appserver” instance and observe that the alarm wasn’t fired – this will prove the filtering we’ve put in place in the alarm definition, is working.

This is all for today, I hope this was useful!

Keep hacking (•̀ᴗ•́)و ̑̑

Leave a Reply