Simplify monitoring console for optimal performance
1/4/2017 12:00:48 AM |
By Ernie Gilman
While working with many customers as an IBM technical specialist, I have discovered what makes a monitoring console effective and simple to use. Rather than providing recommended guidelines, I will highlight several customer examples where I saw a dramatic improvement in customer satisfaction and effectiveness.
Although these examples leveraged the IBM OMEGAMON Tivoli Enterprise Portal (TEP), I am focusing on concepts that resulted in the best results.
A Monitoring View of Storage
In my first example, the system programmer was telling me it was taking too long to navigate through key volumes or storage groups, as there were just too many mouse clicks to get to the resource in question. My solution was a custom technical storage dashboard that highlighted where to look for possible issues. The system programmer was so impressed that he created many new custom views of his own.
Although the system programmer fell in love with the custom dashboard, the manager wanted something simpler. The manager told us “Just tell me that everything is working.” We therefore came up what we thought would do the trick: What could be simpler than colored dots (see Figure 1)? The manager loved it, as did the system programmers. One glance and everyone knew the high-level status of the enterprise storage environment on their z Systems platform.
Although not a new concept, “less is more” is a lesson that has to be relearned with every new generation. Sometimes even very technical people can be confused with too many technical details.
Figure 1: Simplified Enterprise Storage Dashboard
Understand Application Impact From a CICS Issue
With my next customer, the operators had issues understanding when a given application was being impacted by an issue with a given CICS region. In many cases, they would wait for the users to call and complain before they would know which application was being impacted by slow performance or being down. Management wanted something more proactive. In order to accomplish this, we created application views showing which application was being impacted with any CICS issue. Again, the solution was to simplify operations with colored dots representing key application health. Now the operators do not have to call anyone to figure out the impact or wait for users to complain. Operators loved it.
Visualize Impact of High CPU Utilization on the CPC and LPARs
One common challenge with many of my customers is understanding how the central processing complexes (CPCs) are doing and what could be impacting their overall CPU utilization. I developed CPU utilization trend plot charts for the CPCs and the LPARs that make up the CPCs (see Figure 2). For operators and managers, I used the simple dots health view (see Figure 3) and then they could drill down to the trend charts if they saw high CPU on the CPCs or LPARs.
Figure 2: CPCs and LPAR Health View
Figure 3: CPCs and LPAR Health View
Shooting Star Chart
Continuing with the same customer, operators were looking to more quickly identify which address space could impact the LPAR CPU utilization. Showing all running address spaces, over time, it was too busy. To simplify this, I came up with the “shooting star chart.” This is a chart that can quickly identify the address spaces causing the highest CPU utilization, at different points of time. I did this by filtering the history plot graph to show only applications over a given CPU utilization (see Figure 4).
Figure 4: Application CPU Shooting Star Chart
With a shooting star chart, if only a dot appears, then the high application CPU wasn’t persistent and therefore isn’t a problem. However, if a line or, what I call a “shooting star,” appears, then the high CPU is persistent and may be the problem. If the line is angled up, the CPU utilization is getting worse. Sometimes you see two shooting stars dancing with each other, which can result from applications calling each other. Now we had a quick visual indication of which address spaces are impacting the LPAR at any given time.
With my next customer, the z/OS networking team wanted to see if there were any weekly spikes in TCP/IP connections. With historical baselines, users can see hidden trends that appear at specific times, every day (e.g., such as a TCP/IP spike three times a day in Figure 5). The z/OS networking team identified the cause of these spikes to hard coded DB2 connect requests.
Figure 5: TCP/IP Connections Historical Baseline
My final client had a health view of their whole enterprise with different components represented with colored dots. However, noone normally looked at the health view because they knew they would be automatically notified if there was an issue with their component. This automatic notification was developed over time by reviewing every major problem, making certain that an alert was always issued and that it was sent to the right team to resolve. Finally, here was a customer who found their nirvana.
Simplification Increases Usability
The discovery that simplification was key to increased usability and effectiveness of their monitoring console was common among all my customers was. The other key is the ability to easily customize the monitoring console, which is the case for the OMEGAMON TEP GUI.
Feel free to contact me (firstname.lastname@example.org) if you have questions about some of the TEP views highlighted in this article.
Ernie Gilman is an IBM senior consulting technical specialist of z Systems Middleware solutions. Ernie can be reached at email@example.com.