Keeping the Trains Running With Application Performance Monitoring

To keep up with the growing dependence on always-on applications, Harrison College upgraded its application performance monitoring and found increased user satisfaction in the process.

At Harrison College, a for-profit institution with 11 campuses in Indiana, one campus in Ohio, a culinary program, and an online studies program, rising online enrollment has led to increased reliance on technology for delivering education. For example, last year the school rolled out a $2 million initiative called KnowU to provide a Facebook-like social media setting with online tools and resources to help better engage students and improve retention. But what happens when the IT infrastructure behind those operations breaks down? Particularly in an online world, where students often are working after standard campus hours, it's critical for IT to monitor application performance and make sure networks and servers are running the way they should.

While Harrison had an application monitoring system in place, recent IT infrastructure improvement projects--including an extensive network redesign, a refresh of switches and routers, migration of e-mail systems, and a shift to server virtualization in the central IT operations--had rendered it inadequate, says Systems Engineer Damien Solodow. For one, the legacy monitoring system was Windows-centric, which meant it wasn't useful for the institution's VMware or Linux systems. In addition, it lacked template capabilities to allow an administrator to specify what to watch for on all servers dedicated to a specific task, such as those running SharePoint or IIS. Plus, its alerting capabilities were "severely limited," he adds. "It's a necessary thing for the monitoring system to be able to know that something is down or not working. But if it can't tell the administrator that something went down, then it's not very useful if it keeps that information to itself."

Some IT staff members were familiar with Network Performance Monitor from Solarwinds, so the college brought that into the IT environment to stay on top of network issues. Convinced of the advantages of that tool, Solodow also made the case for adding the company's Server and Application Monitor or SAM (previously named Orion Application Performance Monitor) as well.

How It Works
The use of SAM allows Solodow and his colleagues to monitor the performance of software and hardware, including hard drives and power supplies, domain controllers, common applications such as Exchange and SQL Server, as well as more arcane systems, such as the school's self-service password reset program.

When the administrator points the monitoring software to a particular application, it looks at the list of services and performance counters to establish what application they're part of. Once that identification is determined, the program applies the appropriate template, whether that be for an operating system, application, some kind of common computing service, or something custom. (SolarWinds runs a community site, Thwack, where network engineers can share their own templates and get and give answers to technical questions.)

Once the template is applied for a specific server, SAM monitors those services for failure, slowdown, port unavailability, and other potential problems. For example, says Solodow, in Harrison's environment, the SQL Server templates monitor "not just the services necessary for SQL to operate, but also for SQL-related performance counters." If a service is under an unusually high load, there could be something going on--such as a lock on a record--"that is maybe a little more subtle than a whole server being down," notes Solodow. Those kinds of problems can cause "significant performance and user experience issues," he continues. "Or in some cases this is a warning sign: If you don't take care of this, it's going to eventually lead to a system down event."

The alerts generated by SAM--typically texts or e-mails--can be delegated to specific IT staffers, depending on their roles. "If a switch goes down, it only notifies our network and telecoms team," explains Solodow. "It doesn't bother our application support people because that wouldn't be their responsibility."

Those alerts also provide a specific description of what the issue is. For example, power at one of Harrison's campuses recently went out due to weather problems. SAM detected that a battery backup was getting its power not from the wall but from its battery, and it alerted the college's incident response team to that fact. That team responded by checking in with the campus to confirm the power outage and to determine just how long it might last. The team also sent out a communication notifying the whole college community about the outage.

SAM works just as effectively in the college's virtual environment as it does in the physical environment, says Solodow. The program can monitor virtual machine hardware and performance metrics, including CPU, memory, storage, and network bandwidth utilization. "It's actually able to make use of the hardware agents that are included with vSphere. It's able to tell us, there's a power supply failure on this VMware host server." (Although Harrison is running VMware, SAM also works with Microsoft Hyper-V 2008 and 2008 R2, the company reports.)

Also, SAM can receive communications from agents provided with major server platforms, such as Harrison's HP Proliant hardware, for reporting on hardware events, such as a hard drive or power supply failure. When the college brings new servers online, Solodow explains, IT staffers configure SNMP or WMI depending on what operating system is being used. Then the node is added to SAM, which polls the server to find out what type it is (physical or virtual), what version of OS it's running, what drivers it contains, and where it fits into the network topology.

Faster Fixes and More Satisfied Users
Before SAM was put into place at Harrison, the clue that something was wrong "would have been calls or tickets coming into the helpdesk," says Solodow. "You'd have a bunch of people from different locations saying, 'Hey, I can't log into this,' or 'This isn't working right,' or 'This isn't loading.'"

Now, the automated notifications arrive "almost as soon as the system goes down." Frequently, Solodow adds, "We've been able to take corrective action even before there's an issue."

Recently, for example, one of the college's SQL servers had a hardware failure, taking that server offline. Because the server was part of a cluster, it moved that SQL instance over to another server in the cluster. "But there was a brief period where that SQL instance was not available, and some of the applications that depended on databases on that SQL server didn't handle that brief outage gracefully," he says. In other words, they stopped running. SAM notified the appropriate IT people that the application wasn't responding anymore. "We were able to then go in and restart the appropriate services and bring everything back online."

That kind of service has been "very beneficial," Solodow notes. "Not only does it really reduce the amount of downtime the users have to deal with, but it enables us to be a great deal more proactive." By the time users have noticed the problem, "We've already fixed it or we can tell them, we're already aware of it and we're working on it. Our uptime and our user satisfaction has definitely reflected positive change as a result of this."

Featured