Too often, companies scramble for answers, fixes and solutions after they’ve been bitten by a meltdown in their IT infrastructure. Collaboration between IT managers and those in the IT trenches is the key to a proactive approach, but the complex landscape of technologies and monitoring solutions available presents a fundamental challenge when it comes to framing the conversation for everyone involved. The following tips provide a framework for discussing and implementing an effective and proactive monitoring strategy for the year ahead.
1) Monitor every level of your IT infrastructure.
Often the first outwardly visible sign of a problem in your technology infrastructure is that something, hopefully metaphorically speaking, “blows up.” An outage occurs, an application slows, a website goes down, or a service (e-mail, etc.) stops working. A flurry of e-mails, calls, texts and (heaven forbid!) tweets, ensues. One of the major challenges to determining a measured response when this happens is that the root cause can originate anywhere in the infrastructure stack, so unless you can immediately identify what happened where, your problem will persist.
For example, an application slowing down may be the fault of the application itself. But it could also be a network issue, or the VM where the application lives, or the storage device that underlies the VM, and so on. Bottom line: all of the primary levels of your IT infrastructure (storage, servers, networks, applications, and virtualization) are inter-related, so make double sure every critical node, service and app are being properly monitored.
2) Tune alert thresholds; ensure critical alerts go out (to the right person!) well ahead of disaster.
Effective monitoring enables you to quickly isolate and resolve root causes, while effective alerting will ensure you do so before a problem impacts business.
Setting alert thresholds is a critical, yet often overlooked, step to minimizing the likelihood of IT performance problems. Set alerts thresholds too high, and IT gets alerted too late. Too low, and IT gets blasted with dozens of meaningless alerts (noise) which they begin to ignore. Getting thresholds just right is a lengthy topic of its own – there’s no magic pill. Setting thresholds correctly may involve consulting your monitoring solution provider (some have pre-set thresholds), referring to manufacturer guidelines, or conducting your own research on best practices – all while taking into account your unique infrastructure and business needs.
The next step is to make sure alerts get routed to the right people at the right time. Take into account who the go-to person is for each layer and each node monitored – for example sending a storage alert to the network guy is not optimal. Also take into account daily schedules, vacations and out of office times. Sending a critical alert to the database admin while she’s off the grid backpacking won’t prevent an outage.
3) Ensure you can quickly cross-correlate performance metrics across technologies.
Effectively monitoring every level of your IT is one thing, but having the ability to cross-correlate data can be entirely another matter. You may have great network monitoring and great database monitoring, but with two separate tools in two separate company divisions that don’t regularly coordinate with each other, your uptime may be at risk.
Every level of your IT stack is interrelated, so when a problem occurs you may need to cross-correlate data to quickly hone in on the root cause in order to avoid a costly and frustrating finger-pointing session between departments.
4) Ensure that you have a monitoring system, not a monitoring person.
Helpful hint: Ensure that systems and processes are well-documented, and that a disaster recovery plan is in place not just for your IT, but for your IT people. Employ monitoring solutions that can be easily learned and accessed by other people in the organization, and not solutions that require such a high level of expertise that they make one or two individuals virtually irreplaceable.
As a stakeholder in your organization’s future success, ask yourself this critical question: Where does the “intellectual property” of our monitoring system(s) reside? If the answer is “in the minds of one or two people,” you don’t have a monitoring system, you have a monitoring person! And how scalable, accessible, and reliable will this paradigm prove to be over the long run? Are you betting your company’s future on the answer to these questions?
5) Consider integrated monitoring when moving into the cloud.
Following the tips listed above can be a tall order. Move part of your infrastructure to another location or into the cloud and complexity can suddenly multiply by an order of magnitude. Are your current monitoring tools adequate for the job or will you need to seek out yet another solution to monitor your cloud infrastructure?
It may be time to rethink your overall strategy and consider a more holistic approach to monitoring – one that has the ability to unify monitoring/alerting for your entire IT infrastructure regardless of what technology it incorporates or where it lives.
Kevin McGibben, CEO of LogicMonitor
Kevin brings strong global experience in market strategy and channel implementation. He previously founded 32 South (News - Alert) to assist mobile and new-media technology companies in strategically expanding into international markets. Kevin also founded firms in the VOIP, consumer products and Web 2.0 industries. His experience includes international marketing, channel development and general management positions with technology companies Fujistu, TEKELEC and CIDCO.
Want to learn more about the latest in communications and technology? Then be sure to attend ITEXPO Miami 2013, Jan 29- Feb. 1 in Miami, Florida. Stay in touch with everything happening at ITEXPO (News - Alert) (News - Alert). Follow us on Twitter