For a Free Quote...

callus

Request a Quote

Latest Blog Posts

Network Visibility vs Security: Key Differences Explained

Five Things That Keep NCCM/Network Engineers Awake At Night.

Network TAPs & Packet Brokers

Telnet Networks News

Telnet Networks News - We'll keep you up to date with what's happening in the industry.

5 minutes reading time (981 words)

The 3 Most Important KPI's to Monitor On Your Windows Servers

Network Management

Monday, 17 August 2015

5012 Hits

0 Comments

Much like monitoring the heath of your body, monitoring the health of your IT systems can get complicated. There are potentially hundreds of data points that you could monitor, but I am often asked by customers to help them decide what they should monitor. This is mostly due to there being so many available KPI options that can be implemented.

However, once you begin to monitor a particular KPI, then to some degree you are implicitly stating that this KPI must be important (since I am monitoring it) and therefore I must also respond when the KPI creates an alarm. This can easily (and quickly) lead to “monitor sprawl” where you end up monitoring so many data point and generating so many alerts that you can’t really understand what is happening – or worse yet – you begin to ignore some alarms because you have too many to look at.

In the end, one of the most important aspects of designing a sustainable IT monitoring system is to really determine what the critical performance indicators are, and then focus on those. In this blog post, I will highlight the 3 most important KPI's to monitor on your windows servers. Although, as you will see, these same KPI’s would be suited for any server platform.

1. Processor Utilization

Most monitoring systems have a statically defined threshold for processor utilization somewhere between 75% and 85%. In general, I agree that 80% should be the “simple” baseline threshold for core utilization.

However, there is more than meets the eye to this KPI. It is very common for a CPU to exceed this threshold for a short period of time. Without some consideration for the length of time that this mark is broken, a system could easily generate a large number of alerts that are not actionable by the response team.

I usually recommend a “grace period” of about 5 minutes before an alarm should be created. This provides enough time for a common CPU spike to return to an OK state, but is also short enough that when a real bottleneck occurs due to CPU utilization, the monitoring team is alerted promptly.

It is also important to take into consideration the type of server that you are monitoring. A well scoped out VM should in fact see high average utilization. In that case, it may be useful to also monitor a value like the total percentage interrupt time. You may want to alarm when total percentage interrupt time is greater than 10% for 10 minutes. This value, combined with the standard CPU utilization mentioned above can provide a simple but effective KPI for CPU health.

2- Memory Utilization

Similar to CPU, memory bottlenecks are usually considered to take place at around 80% memory utilization. Again, memory utilization spikes are common enough (especially in VM’s) that we want to allow for some time before we raise an alarm. Typically, memory utilization over 80-85% for 5 minutes is a good criteria to start with.

This can be adjusted over time as you get to understand the performance of particular servers or groups of servers. For example, Exchange servers typically have a different memory usage pattern compared to Web servers or traditional file servers. It is important to baseline these various systems and make appropriate deviations in the alert criteria for each.

The amount of paging on a server is also a memory related KPI which is important to track. If your monitoring system is able to track memory pages per second, then I recommend also including this KPI in your monitoring views. Together with standard committed memory utilization these KPI’s provide a solid picture of memory health on a server.

3- Disk Utilization

Disk Drive monitoring encompasses a few different aspects of the drives. The most basic of course is drive utilization. This is commonly measured as an amount of free disk space (and not as an amount of used disk space).

This KPI can should be measured both as a percentage of free space – 10% is the most common threshold I see – as well as an absolute value, for example 200MB free. Both of these metrics are important to watch and should have individual alerts associated with their capacity KPI. It is also key to understand that a system drive might need a different threshold as compared to nonsystem drives.

A second aspect of HDD performance is the KPI’s associated with the time it takes for disk reads and writes. This is commonly described as “average disk seconds per transfer” although you may see this described in other terms. In this case the hardware that is used greatly influences the correct thresholds for such a KPI, so I cannot make a recommendation here. However, most HDD manufacturers will provide a KPI for their drives that is appropriate. You can usually find information on the vendors website for your specific drives.

The last component of drive monitoring seems obvious, but I have seen many monitoring systems that unfortunately ignore it (usually because it is not enabled by default and nobody ever thinks to check) and that is pure logical drive availability. For example checking the availability on a server of the C:\ , D:\ and E:\ Drives (or whatever should exist). This is simple, but can be a lifesaver when a drive is lost for some reason and you want to be alerted quickly.

Summary:

In order to make sure that your Windows servers are fully operational, there are few really critical KPIs that I think you should focus on. By eliminating some of the “alert noise” you can make sure that important alerts are not lost.

Of course each server has some application / service functions that also need to be monitored. We will explore the best practices for server application monitoring in a further blog post.

Thanks to NMSaaS for the article.

Tags:

Comments

No comments made yet. Be the first to submit a comment

Contact Us

Address:

Telnet Networks Inc.
4145 North Service Rd. Suite 200
Burlington, ON L7L 6A3
Canada

Phone:

(800) 561-4019

Fax:

613-498-0075

Email:

sales@telnetnetworks.ca

For More Information about Telnet Networks, our products, or our services, or to request a quote please feel free to contact us directly.

Latest Blog Posts

Network Visibility vs Security: Key Differences Explained

Five Things That Keep NCCM/Network Engineers Awake At Night.

Network TAPs & Packet Brokers

NDR an Indispensable Piece of an Overall Security Operation Strategy.

Network Packet Brokers at the Inflection Point

Newsletter

For a Free Quote...

Latest Blog Posts

Blog Categories

Telnet Networks News

The 3 Most Important KPI's to Monitor On Your Windows Servers

Comments

Contact Us

Latest Blog Posts

Key Links

Latest Blog Posts

Newsletter

For a Free Quote...

Latest Blog Posts

Blog Categories

Telnet Networks News

The 3 Most Important KPI's to Monitor On Your Windows Servers

Related Posts

Troubleshooting Cheat Sheet: Layers 1-3

Load Balancing Your Security Solution for Fun and Profit!

Year-End Network Monitoring Assessment

Managing Your Application Performance

Measuring IPTV Quality

Comments

Contact Us

Latest Blog Posts

Key Links

Latest Blog Posts