Device Down Issue Incident Automation

Riverbed IQ Ops surfaces DeviceClosed An entity type representing network devices or hardware components deployed in the customer environment that are monitored for performance and anomalies. Down Issue IncidentsClosed A collection of one or more related triggers. Relationships that cause triggers to be combined into incidents include application, location, operating system, or a trigger by itself. when it detects anomaliesClosed An unexpected event or measurement that does not match the expected model. in Key MeasurementsClosed A measurement or data point that is monitored and analyzed to detect anomalies and generate incidents. streaming from Entity type: Device (e.g. Device Status and Device Uptime). This topic explains the workflow and processing logicClosed A runbook node category that adds conditions to branch the runbook, enabling conditional execution paths based on data and context. of the Device Down Issue and directs you how to navigate to the issue in Riverbed IQ Ops.

Workflow and Processing Logic of the Device Down Issue

This section reviews the parallel execution-paths present in the Automation Name: “Device Down Issue (Default)” Runbook and provides a high-level explanation.

Automation Name: “Device Analysis” - Annotated

Entry: Initialization All Runbooks require a Triggering Entity as an “entry-point”. This entry-point provides a mechanism for Riverbed IQ Ops to pass all data/context gathered by the Analytics Pipeline (which generated the Incident) into the Runbook Automation. For Triggering Entity: Devices, the gathered Analytics Pipeline data/context includes Primary- and Correlated-Indicators for the following Key Measurements:

  1. Device Status

  2. Device Uptime

Path-1: Impacted Locations Immediately tags the Location of the source Entity as impacted.

Path-2: Gather Device Data Gathers additional Device detail for the Source Entity, and:

  1. Visualizes the Device detail as a table in the Runbook Analysis.

  2. Checks the Device detail to see if the Source Entity “Is Gateway” (i.e. critical resource), and sets either:

    1. Priority: Critical if the Source Entry "Is Gateway".

    2. Priority: Low.

Path-3: Gather Primary Indicator Detail Decides the path of execution based on the Primary Indicator (i.e. the Key Measurement metric that triggered this event) to gather additional relevant data/context:

  1. If the triggering metric is Device Status, then:

    1. The Runbook gathers Device Status information for the period prior to the event (and comparison data from prior week).

    2. The Runbook visualizes the gathered Device Status information as a Timeseries chart.

  2. If the triggering metric is Device Uptime, then:

    1. The Runbook gathers Device Uptime information for period just prior to the event (and comparison data from previous day).

    2. Visualize the gathered Device Uptime information as a Timeseries chart.

Path-4 through Path-7 visualize as a Timeseries chart if the information is available.

Path-4: Is CPU a possible cause? Gathers additional data/context CPU Utilization. For example, a sudden CPU spike might indicate a problem leading to a device becoming unresponsive.

Path-5: Is Memory a possible cause? Gathers additional data/context Memory Usage.For example, high memory usage might indicate a memory leak or an overload issue leading to a device becoming unresponsive:

Path-6: Is Disk a possible cause? Gathers additional data/context Disk Utilization. For example, a sudden increase in disk activity might indicate a problem leading to a device becoming unresponsive:

Path-7: Is config-change a possible cause? Gathers additional data/context configuration Changes. For example, configuration changes on a device might cause a device to be temporarily unavailable:

Path-8: Determine any potentially impactedClosed Uniform Resource Locator. The address used to access resources on the internet, such as webhook endpoints or API endpoints for runbook automation. Apps Gathers additional top-100 ApplicationClosed An entity type representing software applications deployed in the customer environment that are monitored for performance and anomalies.-related data/context for the source Entity (based on Throughput) to assess potential impacts. If the data is available, it will:

  1. VisualizeClosed A runbook node category that shows data in a chart, graph, table, or note, providing visual representation of analysis results in runbook output. the list of top-100 potentially impacted Applications as a Table

  2. Tag the top-100 Applications as potentially impacted.

Path-9: Determine any potentially impacted Clients/Users Gathers additional top-100 Client-related data/context for the source Entity (based on Throughput) to assess potential impacts. If the data is available, it will:

  1. Visualize the list of top-100 potentially impacted Clients as a Table.

  2. Tag the top-100 Clients as potentially impacted Users.

  3. Tag the top-100 Client LocationsClosed An entity type representing physical or logical locations in the customer environment where entities are deployed and monitored. as potentially impacted Locations.

Path-10: Determine any potentially impacted Locations Gathers additional top-10 Client-Server Location Pairs for the source Entity (based on Throughput) to assess potential impacts. If the data is available, it will:

  1. Visualize the list of top-10 potentially impacted Locations as a Table.

  2. Tag the top-10 Locations as potentially impacted Locations.

Location of the Device Down Issue in Riverbed IQ Ops

You can find the associated Device Down Issue Incident Runbook automation in the Riverbed IQ Ops UI. From main-menu:

  1. Click Automation, then select Automation Management to open the Automation Management page.

    • The Automation Management page contains a summary-view of all supported Automations and their associated Runbooks.

  2. In the "New Incident Triggers" area of the Automation Management page, click on the Device Down Issue panel to open a detail-view for this type of Automation.

    • Each row in this detail-view represents an Automation that can execute to investigate this type of Incident.

  3. Find the row with Automation Name: “Device Down Issue (Default)”. Click on the cell with Runbook: “Device Analysis” to open this “out-of-the-box” automation in the Runbook Editor and see its constituent nodes and structure.

Constituent Nodes and Structure

This section explains the constituent nodes and structure of the Device Down Issue Incident Automation. The following diagram contains the automation's entire constituent nodes and structure.

The automation contains:

  1. The required single entry-point.

    • The left-most light-green node: Triggering Entity: Devices. This entry-point passes supporting data/context (gathered in the Analytics Pipeline, e.g. Primary Indicator) into the Runbook

  2. A set of interconnected nodes which stem from the entry-point.