Wednesday, March 07, 2018

How to use vROps to find a VM causing a broadcast storm

A customer recently reached out to me with the question if vROps could help him find a particular VM obviously causing a broadcast storm.
The only information he got from the networking department was the host name and NIC of one of his ESXi hosts. The vCenter metrics did not reveal any valuable information and he had to check every single VM (more than 30 per ESXi host) one by one. At the end he asked if vROps is capable of finding that bad guy.

First step was to check if we have such a metric for VMs and whether it is active (collecting data) or not.
As you can see in the following figures, vROps knows that metric but it is disabled by default.

Fig. 1: disabled metrics in the default policy

Now, you could go and just activate these metrics for all of your VMs and check the values. But, in an environment with several thousands of VMs it will add additional load and you will need these metrics only for some few VMs and for a limited period of time.

Let us make it more dynamic and configurable for future use, just in case your NOC may come to you with another ESXi host you have to check.

The idea is pretty simple, we need a policy which enables the needed metrics and we need a group of VMs we would like this policy to be applied to.

Step 1: create a new policy and activate the broadcast metrics for Virtual Machine object type.

The following figure shows you the filters and settings to activate the right things.

Fig. 2: enable metrics in a new policy

We want to get this policy applied only to a dynamic group of VMs we would like to investigate. This is where the concept of Custom Groups comes into play.

Custom Groups work as a container for any objects you may have and the settings of a Custom Group allow the membership to by dynamic based on a wide range of properties, relations etc.

Now we could go to vROps, create a new custom group and define that group to contain all VMs which are children of a particular predefined ESXi host. This would be semi-dynamic.

Let's re-think this strategy.

I many cases the admin dealing with a broadcast storm in a vSphere environment do not have to be the vROps admin in his org.
Wouldn't it be better if the vSphere admin set "something" in vCenter and at the end he will see a dashboard or receive a report in/from vROps?

Exactly, we go for the vSphere Tags.

Our new tag will designate a host as being "under investigation", time for the next step.

Step 2: create a vSphere Tag

Fig. 3: create a vSphere Tag
As we have our tag we continue with a "two-staged-custom-group".
The first group will dynamically contain ESXi hosts under investigation, and the second group will contain Virtual Machines which run on those hosts.
This will give us the freedom of creating multiple "second-stage-groups" which may have different policies assigned, in case we would like to investigate another behaviour which requires another metrics etc.

Step 3: create a new custom group for the ESXi host(s)

Fig. 4: "first-stage-custom-group" - Host System

Anytime we assign our new vSphere Tag to a ESXi host, this host will become member of this group.

Step 4: assign the tag to a ESXi host and wait a collection cycle to get the custom group populated:

Fig. 5: vSphere tag assigned to a host

Fig. 6: "first-stage-custom-group" - dynamically populated

Time for the "second-stage", the custom group containing the VMs.

Step 5: create a new custom group for the VMs:

Fig. 7: "second-stage-custom-group" - dynamically populated

Once we created the custom group for the VMs, this group gets populated with VMs which run on tagged ESXi hosts.
We see that the metrics we need for our investigation get collected:

Fig. 8: Collecting metrics

At this point we have everything to create a dashboard for our vSphere admin to quickly help him find the bad guy:

Fig. 9: Dashboard with the results of the investigation

Hope, this post will help others during their RCAs.


No comments:

Post a Comment