Understanding how Alerts 2.0, health check frequencies, and cloud ping checks work with time delays can create a more robust method to gauge network health and reduce noise for your support teams.
This document will help you understand how health checks determine if a device is offline, how the health check priority works, how time delays give you better real-time accuracy, and how site suppression can be used to reduce noise. It will also give you a summary of the tests and checks that can be utilized to test the health of your network using entities outside of your network.
Historically, noise reduction came from a prolonged health check strategy Essentially, a device must fail multiple times before Auvik can determine if it is offline. A device could be offline for 5 minutes before Auvik displays it as offline. This can be frustrating for Auvik customers because they could get a call about something being offline before Auvik displays it or even alerts it.
Another problem arises with Alerts 2.0, which has an extended health check frequency and high failure count. The delay only adds to the time when a customer may be unaware of the status.
How a device is determined to be offline
- 1st Ping fails
- After 15 seconds, 2nd ping fails
- After 30 seconds, 3rd ping fails
- After 45 seconds, 4th ping fails
- After 60 seconds, 5th ping fails
- Device is determined to be offline.
In Alerts 1.0, an alert fires immediately.
In Alerts 2.0, a time delay may be added so that if the device self-heals, no alert will fire. Think of the time delay as a “must be true for X number of minutes.”
Fixing the Health Check Priority and Defaults
The default values of Auvik indicate that Access Points have a higher priority than Network Elements. If the network is large or if the customer utilizes a shared collector, lower-priority devices may not have health checks run against network elements, which are often billable devices.
In the above example, access points have the highest priority. They may go offline, but it won’t be reflected as offline until 5 minutes have passed. Network Elements will be checked second in priority but in 15-second intervals.
Alerts 2.0 “Time Delays” Allow for Better Real-Time Accuracy
Additionally, lower frequencies can improve Auvik's real-time accuracy. Users can create rules for higher priority devices with lower frequencies and minimum failures.
In the example below, the user adjusts the minimum failures and utilizes the Time Delay in Alerts 2.0.
5 Failures + Instant Alert (Alerts 1.0)(1-minute frequency and minimum failure of 5) |
3 Failures + 5 min Alert Time Delay (Alerts 2.0)(15-second frequency and minimum failure of 3) |
|
|
Auvik does not display the device as offline for 1 minute. A device may self-heal shortly after the alert triggers. |
Auvik displays the device as offline within 30 seconds. Alert triggers after 5 minutes in case the device heals. |
Functionally, the two do similar things, but the first masks offline devices until the frequency and failure count are met. It may then alert prematurely.
With alert delays, a device can appear offline sooner, but an alert fires later. In the example, a device is marked as offline after 30 seconds, and an alert fires after 5 minutes and 30 seconds after the device goes offline initially.
SNMP resets Health Check counts
A collector polling a device successfully will reset any health check counts. Network devices will often cease responding to pings if they are overloaded. SNMP will still succeed when ping fails. Allowing SNMP polls to reset the health checks prevents false alarms.
How to Improve Real-Time Accuracy by Adjusting Healthcheck Frequencies and Minimum Failure
You can adjust health check frequencies and minimum failures in combination with the alert time delay.
-
Adjust Health Check Frequencies
-
Network Elements re-prioritized to be zero above Access Points
- Change the minimum failures to 3.
-
Network Elements re-prioritized to be zero above Access Points
- The perfect opportunity to utilize the time delay functionality is to check for non-default values. If your organization has a 5-minute delay set in health check frequencies, optimize them for better efficiency.
- Modify the offline alerts in Alerts 2.0 for a 3-5 minute Alert Time Delay.
Don’t Forget Internet Connection Tests and Cloud Ping Checks!
Internet Connection Test Improvements
Internet connection cloud ping tests follow similar logic. An IP address can be checked at a specific frequency and must fail a certain number of times before it is determined to be offline.
In the scenario above, three checks will be performed once a minute. In Alerts 2.0, this can be shortened and utilize the Time Delay.
Internet Connection Tests with Alert Delays
A valuable method to improve alerting here would be to have the internet connection show up offline after 2 minutes and have the alert fire after 3-5 minutes.
This will quickly solve the problem of the internet connection being down and you not seeing it in Auvik.
Shared Collectors and Site Suppression
The health check frequencies can be modified to declare an entity on a site offline sooner than other devices. This is important for suppression to be effective. Auvik needs to know that the site's parent device is offline before other devices are able to suppress alerts effectively.
Health Check Frequencies should look something like this on a site level.
- Parent device
- Network Elements
- Access Points