The Jiu Jitsu of Reduced Preemption
How MVS designers turned one multiprocessing challenge on its head
1/9/2013 12:01:10 AM |
By Bob Rogers
I’ve retired from IBM, but it has not given me nearly the free time I had expected. But, I have a moment now, and I’d like to tell the story of a critical time for the mainframe and MVS, and how we got through it.
The background was IBM’s decision long ago to deliver capacity growth not just through faster processors but also by deploying multiple processors in a symmetric multiprocessing (SMP) configuration. With SMP, multiple processors share main memory and work together to run a workload. This arrangement causes a number of overheads that must be mitigated as the number of processors in the complex increases. But, before discussing them in detail, there’s another concern that we need to address.
It’s often said that z/OS can successfully run mixed workloads on a single system. One reason why is that the MVS OS had a “preemptive dispatcher.” That means higher-priority work that has become ready to run will preempt lower-priority work that’s already running. This is a simple process on a uniprocessor. The supervisor simply compares the priority of the work that’s become ready to the priority of the work already running. If the running work is lower priority, it’s taken off the processor and the work that became ready is dispatched onto the processor.
On a multiprocessor, things are a bit more complex. Now, the dispatcher must compare the priority of the work that has become ready to the priority of the work that is running on each of the processors. It then can decide if the work running on one of the processors (possibly itself) should be preempted in order to run the work that has become ready. If work on one of the other processors should be preempted, then the current processor must signal that processor to cause an interrupt to get the dispatcher invoked on that processor. There are other complications, but for our purposes it’s enough to understand that this process entails a considerable amount of overhead. The overhead is not just the processing of the algorithm but also the cache disruption caused by accessing and updating shared data fields, and the hardware signaling itself.
The crisis arose because the overhead of this approach to preemptive dispatching grows faster than linearly. This approach consumes an ever-greater percentage of the total capacity of the system as the number of processors increases. In the late 1980s, for some workloads, the percentage overhead was already in double digits on an image that had six CPUs. At the time, MVS was designed to support up to 16 CPUs in a configuration, but there was no way that was ever going to be achieved unless something was done about preemptive dispatching. Fortunately, something was done—something very clever.
My longtime friend and esteemed colleague Bernie Pierce, who recently passed away, was, without question, the best MVS performance designer at the time and perhaps of all time. I was not privy to his early thinking on the problem, but I think I can pretty well reconstruct it from what followed.
The problem, again, was that increasing the number of processors caused preemptive dispatching to become ever more expensive. But, if you have a lot of processors, is preemptive dispatching really needed? All of the processors are constantly interrupted for I/O and timer events. If there’s only a small number of processors, it could be a while before one of them happens to be interrupted. However, with many processors, the average delay until one of them is naturally interrupted gets shorter and shorter as the number of processors increases. So, here lies the basis of a solution to the problem. If the processors stop interrupting each other when work becomes ready but depend upon some processors looking at the work queue after a naturally occurring interrupt, it would flatten the overhead caused by preemptive dispatching.
However, complete elimination of preemption would be going too far. A mechanism was required to ensure that high-priority work was not delayed too long before being dispatched on some processor. The mechanism that Bernie chose was the “minor timeslice.” That is, the CPU timer is set to a relatively low value so if no other interrupt naturally occurs before the timer expires, then this “timer pop” provides the opportunity to see if there’s higher-priority work that should preempt the currently running work. Bernie invented algorithms to maintain appropriate responsiveness for high-priority work while minimizing the number of extra interrupts caused by these minor timeslice timer pops. Of course, other complexities to the full solution exist that we need not be concerned with here.
What I find so beautiful about this idea is that it’s what I call a “jiu jitsu solution.” Jiu jitsu is said to turn the enemy’s strength against him. I call something a jiu jitsu solution if it takes the underlying problem and uses it as the basis of the solution as Bernie did here. With more processors, there’s less need for the expensive shoulder-tapping that is required if there’s only a few processors.
This improved dispatching approach is called reduced preemption. It was extremely effective when it was introduced, and today’s systems of 10 or more processors couldn’t even run without it. It was introduced in MVS V3.1, which was said to have a performance delta of minus-3 percent to plus-12 percent compared to the previous version of MVS. I was asked at the time, “How much of that 12 percent is due to reduced preemption?” My answer was, “15 of that 12. Reduced preemption improves performance by 15 percent on a six-way processor, but other changes cost 3 percent. That is where the plus-12 percent comes from. Reduced preemption does not help a uniprocessor so that’s where the minus-3 percent comes from.”
This was just one of many bumps in the road that MVS has had to get over to maintain its position as the OS that Western Civilization runs on. As time permits and spirit moves, I will try to give the inside story of some of the more interesting ones.
Bob Rogers worked on mainframe system software for 43 years at IBM before retiring as a Distinguished Engineer in 2012. He started with IBM as a computer operator in 1969. After receiving a B.A. in mathematics from Marist College two years later, he became a computer programmer at the IBM Poughkeepsie Programming Center, where he worked on the OS/360 OS. Rogers continued to work on mainframe OS development for his entire career at IBM. He contributed to the transitions to XA-370 and ESA/370, and was lead software designer for the transition to the 64-bit z/Architecture. More recently, he implemented the support for single z/OS images with more than 16 CPUs and was a lead designer of the z/OS support for the zAAP and zIIP specialty engines. He has been a popular and frequent speaker at SHARE for many years.