« About That "Expensive" Mainframe | The Mainframe Blog Home | Mass Mainframe Consumption »
2+2=4. Really.
Many companies are deploying packaged multi-function ERP applications, including one from a three letter company that's not IBM. My colleagues and I are hearing of an interesting problem: the numbers sometimes don't add up. After running through millions of calculations to balance the corporate books, the books don't balance. Is it a software problem? Where? Is there fraud within the company? If running the numbers again provides a different result, which result is correct?
I'll let you in on a dirty little industry secret: most server hardware makes mistakes. A billion years ago, a billion light years away, a star fell into a black hole. The star screamed, expelling intense radiation in its last mortal moments. A tiny fraction of that radiation dodged earth's atmosphere to tickle your server's CPU. Or maybe it's a bad solar flare day. Or maybe a jolt of static electricity knocked a bit out of place. Whatever the reason, CPUs make mistakes from time to time. Unfortunately these mistakes may be getting more frequent as clock speeds, densities, heat, and electromagnetic interference all increase. Back in 1994 the error rate for microprocessors, e.g. X86, was estimated at once per year per 75 CPUs.
There are few systems you can buy that don't make mistakes. The IBM mainframe is by far the most popular of these precision systems. (The others tend not to be general purpose computers.) In effect every instruction runs twice for comparison, providing execution integrity. There's no way you can shut off this feature. z/OS, Linux, and all the other mainframe operating systems enjoy this benefit transparently, continuously. 2+2 always equals 4, even if the cleaning crew commits the grave sin of carrying a mobile telephone within six feet of your mainframe.
Do you care? Maybe. It depends on the application. If Shrek gets an extra one pixel mole on his face in the next digitally rendered sequel, no worries. If your stock portfolio trade gets a zero added, well....
The "three letter ERP" package runs extremely well on mainframe Linux, accessing DB2 via Hipersocket. No one running this way has any mysterious math errors, but we are hearing such reports on other platforms. Any one else hearing the same? Does the boss care?
| by Timothy Sipples | April 22, 2006 Permalink |
TrackBack
TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d834521c8469e200e5507dd8b38834
Listed below are links to weblogs that reference 2+2=4. Really.:
Comments
my colleagues and I are hearing - the very definition of hearsay evidence? its a very interesting theory - but i haven't ever heard of it before - can you provide any more evidence at all?
Posted by: James Governor | Apr 24, 2006 6:57:07 AM
I am not sure what math errors you may be hearing about James. I have not heard of any since the infamous Pentium floating point division bug. That arose because of corner cases and miscoding of the internal tables used by the microprocessor's divide unit for interpolation, rather than by external events.
It is absolutely true that random cosmic rays can flip a bit or a dipole in any semi-conductor. It is also true that the IBM System z has probably the most astonishing level of RAS engineering. Tim only touched the surface. For more in depth information readers may want to review mainframe-related articles published in the IBM Journal of Research and Development over recent years.
That said (pulling out my former hardware-guy hat) the probability of such data corruption is related to the memory density and most systems, even lowly PC systems have at least parity checking if not full EDC/ECC circuitry. Modern microprocessors themselves are also subject to these concerns, but the guys that build them are savvy enough to make RAS proviosions in the circuit design. In the worst case, there will be a parity fault and the processor will check-stop or the memory controller will check-stop. In the best case with EDC/ECC the error is corrected by the hardware and software doesn't even hear about it.
In more elaborate systems like mainframes and (probably most of) the competitive server platforms, particularly the IBM-sourced ones, EDC/ECC (error detecting and error correcting codes) are used to detect multi-bit errors and correct (typically) single bit errors seamlessy. This doesn't mean you won't get math errors but you're billions of times more likely to encounter math errors from the carbon-based processor units than the silicon ones. When math errors appear - suspect the software first and second and third...
Posted by: Chris Craddock | Apr 24, 2006 4:20:22 PM
any information to support your claim Timothy?
Posted by: James Governor | May 4, 2006 9:22:55 AM
The real issue is not whether microprocessors make mistakes but can they correct and detect them. Lets assume that workstations to mid-range and through to mainframes all have the same intrinsic failure rate as they are based upon common technology.
The difference comes in the higher level design layers. System z (nee zSeries) has much more error detection and correction circuitry and logic than any other commercial processor. The conscious design trade-off of mainframe engineers is not to compromise integrity and reliability for sheer single CP performance, but to balance performance with both mixed workload support and RAS functions.
And the changes do not stop there. The I/O design of System z has continuous validation of the I/O path end-to-end to end that DATA WITH INTEGRITY is deliver to the RIGHT DEVICE. The FICON architecture confirms the I/O device is the correct address, even after transiting a fibre fabric. The data is checked by redundant hardware in the channel subsystem for accuracy using state-of-the-art algorithms.
Then the layer above the hardware, the operating system also provides integrity. Since 1973 IBM has honoured its public commitment to system integrity for the MVS system (and its derivatives).
The earlier conversations on IMS paranoia about data integrity are spot on. Without detracting from IMS, this maniacal focus on data integrity would not be possible without the same standards in the hardware design and the operating system. And IMS is not alone. Subsystems such as DB2 for z/OS are built with the same emphasis and the tight integration of WebSphere for z/OS with z/OS and the System z architecture are the staples which provide the intrinsic security characteristics of this approach.
Regardless of this underlying infrastructure, integrity is still dependent upon application design and operational processes. The former is becoming platform independent but the latter is much simpler [and therefore inherently more trustworthy and robust] when less independent elements are involved. By definition, consolidating as many elements as possible into a single trusted environment is a best of breed approach. This to me is the mainframe.
John Crooks
Posted by: John Crooks | Jun 6, 2006 6:57:15 PM
There is one area that can lead to different results - this i where there is no "single copy of the truth" for corporate data. As data is dispersed for processing and later aggegated for compliance reporting there is the potential for its integrity to be compromised, in fact, unless great effort is invetsed to architect data management correctly it is almost a certainty.
Within this context I have "heard" of challenges in complying with emerging regulations - the data is in many places, in many forms and may have been summarsied in a manner which is not conducive to reporting.
Posted by: John Crooks | Jun 6, 2006 7:02:39 PM
The comments to this entry are closed.
