Improving CPU utilization on zEC12
4/10/2013 1:01:01 AM |
By Bob Rogers
It’s time again to take a look at one of the ideas that keeps System z ahead of the competition in performance and efficient utilization of resources. This idea, first implemented on the IBM zEnterprise EC12 (zEC12) processor, helps maximize physical CPU utilization without causing the so-called “short engine” effect. Historically, the short engine effect has arisen in two forms: “horizontal short engines” before HiperDispatch and “vertical short engines” in HiperDispatch mode.
Short Engines’ Long History
Horizontal short engines have been a problem since the introduction of Processor Resource/Systems Manager (PR/SM) itself. The problem arises when trying to optimize physical CPU utilization on a system where the capacity requirements of the different partitions vary over time. From the beginning, PR/SM provided a choice for what it should do when the OS on a logical processor entered a wait state. One option keeps the logical processor bound to the physical processor until the end of the time slice, even though no work is being done while the logical processor is in a wait state. This option is almost never used. The other option is to remove the logical processor from the physical one so the physical processor can be used to dispatch some other logical processor. This enables much more efficient use of the physical processors because the CPU capacity that’s been allocated for a partition, but isn’t used by that partition, can potentially be used by other partitions if they have work to consume it.
However, in order for a partition to consume additional capacity that’s “left on the table” by other partitions, it needs to have sufficient logical processors to consume the extra capacity. This means that one must define more logical processors than are needed to consume the guaranteed share allocated to the partition. That is the percentage of all the processor capacity indicated by the weight assigned to the partition. Over-configuring the logical processors like this decreases the percentage of time each will actually be running. This is because PR/SM traditionally distributed the capacity share of a partition equally across all of its online logical processors. So, if a small partition has 25 percent of a CPU’s guaranteed share and is defined with two logical processors, each will typically be running only 12.5 percent of the time. This might be sufficient to run the workload but the way that PR/SM delivers it might be problematic.
To reduce the overhead caused by dispatching logical processors, PR/SM is willing to give a fairly generous time slice, ranging from 12.5 to 25 milliseconds, to a logical processor once it’s dispatched. The downside is that after having run for its time slice, a logical processor must wait its turn while others run for their time slices. Since each of the logical processors in this example is guaranteed to run only one-eighth of the time, PR/SM might not run it again for a whole 0.10 second. This can be bad because an OS like z/OS doesn’t know the logical processor isn’t actually running.
The short engine phenomenon can lead to poor response time because supposedly expeditious work can be stranded on a logical processor that won’t run again for what, at modern CPU speeds, is a very long time. It also wastes cycles on the running processors spinning for system locks held by processors that aren’t running. HiperDispatch can manage the number of online logical processors to mitigate the horizontal short engine effect.
HiperDispatch’s Own Short Engine Effect
Unfortunately, while reducing the horizontal short engine effect, HiperDispatch creates the vertical short engine effect. To see how, here’s a quick sketch of how HiperDispatch works.
Among other things, HiperDispatch entails a different way of distributing CPU share to the logical processors of a partition to reduce PR/SM overhead and get better value from the processor caches. Basically, it tries to give a 100-percent share to some of the logical processors and distribute the remainder of the guaranteed share across one or two additional logical processors, each of which have a share of at least 50 percent. Any logical processors beyond these are kept “parked” unless and until some additional CPU share becomes available that’s been left on the table by other partitions (sometimes called white space). When in HiperDispatch mode, the guaranteed share is distributed in a way that minimizes the horizontal short effect. There’s no short-engine problem for a partition as long as it only consumes its guaranteed share. However, the additional logical processors that are unparked to consume share beyond the guaranteed share create an even worse short-engine problem.
The processors that z/OS unparks to consume the extra share are running on borrowed time, which really belongs to another partition not currently able to consume its guaranteed share. The word “currently” here is very important. At any moment (and completely out of the control of the partition using the extra share) the rightful owner may step up and start using it. This leaves the unparked processor high and dry, and strands any z/OS unit of work running on it for what could be a considerable length of time. That’s bad if the work is a critical transaction-processing region or important OS function. Even if the extra capacity the unparked engine is running on doesn’t completely evaporate, it still might be so small it creates a severely short engine.
For this reason, the original HiperDispatch design was very conservative about unparking processors to consume instantaneously available white space. This inability to unpark processors to consume small dollops of white space causes lower-than-optimal physical CPU utilization.
Warning Track’s Simple Solution
The solution to both short-engine problems is maddening in its simplicity. Like several other recent inventions on System z, it involves communication between the PR/SM hypervisor and z/OS. PR/SM knows when it’s about to undispatch a logical processor from its physical processor. Likewise, z/OS controls when it’s running work on a logical processor. So, if PR/SM can let z/OS know it’s about to undispatch one of z/OS’s logical processors, z/OS can, in turn, undispatch the work from that logical processor so no z/OS work will be stranded. This communication is accomplished by a “Warning Track Interrupt,” which gives the solution its name. It’s that simple but still, it’s an elegant solution to an age-old problem. It improves HiperDispatch physical CPU utilization without the risk of leaving important work stranded on a processor that just won’t run.
I’ll finish by giving credit to the IBMers that worked it out: Mark Farrell, Chuck Gainey, Jeff Kubala, Jim Mulder, Bernie Pierce and Don Schmidt.
Bob Rogers worked on mainframe system software for 43 years at IBM before retiring as a Distinguished Engineer in 2012. He started with IBM as a computer operator in 1969. After receiving a B.A. in mathematics from Marist College two years later, he became a computer programmer at the IBM Poughkeepsie Programming Center, where he worked on the OS/360 OS. Rogers continued to work on mainframe OS development for his entire career at IBM. He contributed to the transitions to XA-370 and ESA/370, and was lead software designer for the transition to the 64-bit z/Architecture. More recently, he implemented the support for single z/OS images with more than 16 CPUs and was a lead designer of the z/OS support for the zAAP and zIIP specialty engines. He has been a popular and frequent speaker at SHARE for many years.
Warning Track Facts
The term “short CP” was coined by the IBM Washington Systems Center staff for the performance phenomenon created by the Processor Resource/Systems Manager (PR/SM) hypervisor enforcing partition weights on busy processors. PR/SM ensures each partition has access to the amount of processor capacity specified by the partition weight. This can reduce the capacity delivered to the logical processors in the partition and cause performance problems that are now addressed by Warning Track on the zEnterprise EC12.
For those interested in more details, here are some facts provided by IBM Distinguished Engineer Kathy Walsh of the Washington Systems Center from her Hot Topics presentation at the February SHARE
conference in San Francisco.
When PR/SM determines a logical processor has consumed its current time slice, it tries to present a Warning Track Interrupt (external interrupt code 1007) to the guest z/OS system. It then gives z/OS 50 microseconds to react before it rips the physical processor away. If z/OS is enabled for interrupts within this grace period, it will receive the interrupt and react by undispatching the unit of work currently running on that logical processor. z/OS then signals to PR/SM (Diagnose 49C) that it’s now safe to remove the logical processor from the physical processor it’s running on.
If z/OS is on time, which it will be almost all of the time, it’s not possible for important work to be stranded on a processor that’s not running. z/OS no longer needs be so conservative about unparking processors to consume white space. As an additional benefit, it’s also impossible for a z/OS system lock to be held by the processor while it’s not running. For processors that are running, this eliminates the wasted cycles spent spinning for locks held by processors that are not running. Both benefits lead to superior virtualization and optimal CPU utilization for System z with z/OS.