Concise Explanation of the Pentium Flaw (fwd)

Autor: Chad Turek (turek_at_pontryagin.aa.washington.edu)
Data: Fri 16 Dec 1994 - 00:31:14 MET


ponizej przekazuje pare informacji na temat Pentium bug i patch na Matlab
do poprawienia problemu.

                -------------------?-------------------
                Chad Turek
                turek_at_aa.washington.edu
                University of Washington
                dept. of Aeronautics and Astronautics
                Seattle, WA
                -------------------?-------------------
---------- Forwarded message ----------
Date: Thu, 15 Dec 94 14:05:58 -0800
From: brian_at_aa.washington.edu
To: students_at_aa.washington.edu
Subject: Concise Explanation of the Pentium Flaw

Following is an excerpt from a Mathworks digest explaining in detail
the technical origin and history of the recent flap over the Intel
Pentium chip design flaw.

-Brian
------- Forwarded Message

Drea's Desk: Pentium problems.

 By now, many of you have heard about the problems doing
 certain floating point operations on the Pentium, Intel's
 flagship CPU. Here is a summary of where we are, how we
 got here, and where we're going.

 It all began with a posting to a compuserve forum of a
 personal e-mail from Prof. Thomas Nicely (which was
 then cross-posted to comp.sys.intel),

   It appears that there is a bug in the floating point unit
   (numeric coprocessor) of many, and perhaps all, Pentium
   processors.

   In short, the Pentium FPU is returning erroneous values for
   certain division operations. For example,

                        1/824633702441.0

   is calculated incorrectly (all digits beyond the eighth
   significant digit are in error).

 That is, the pentium produced results that indicated that
 the division had been carried out with no greater than
 single precision.

 You might ask where the number 824633702441 came from and how
 did Prof. Nicely notice that the result was in error. Nicely
 was working on an area of number theory that involved twin primes
 (pairs of prime numbers that differ by 2, like 11 and 13). The
 sum of 1/n where n goes from 1 to infinity diverges. The sum of
 1/p where p's are the prime numbers also diverges. But 1/t, where
 t's are twin primes, converges. 824633702441 and 824633702443 turn
 out, of course, to be twin primes.

 Partial sums of this series have been published and Nicely
 was comparing his results with them. He discovered
 that his results differed and he started a long search for
 problems in his code, compiler bugs, hardware problems, etc.
 Finally, by the process of elimination and extensive testing, he
 concluded that the problem was with the pentium chip itself.

 The post created a firestorm on comp.sys.intel. During the
 midst of the storm, hundreds of messages a week poured in.
 The signal to noise ratio got progressively smaller, but
 there were some gems.

 First, Terje Mathisen, a PC programming expert from Norsk Hydro in
 Norway, confirmed Nicely's result and wrote a test program,
 p87test, that he posted to comp.sys.intel.

 Then, after a series of postings about other numbers that
 were computed incorrectly, Tim Coe, a semiconductor design
 engineer from Vitesse Semiconductor, found a pattern that
 led him to the worst case pair of operands,

  5244795/3932159

 For these numbers,

    x = 5244795
    y = 3932159
    z = x - (x/y)*y

 should be zero (within eps*x.. which would be about 1e-9) but
 on the pentium,
 
    z = 256

 The relative error in this case was 5e-5, which represents
 an error in the 4th decimal digit. 10 orders of magnitude
 greater than you would expect due to roundoff error.

 By now, word of the problem had spread into the mainstream press.
 The New York Times, Associated Press, The San Jose Mercury News,
 and countless other papers began carrying stories about it.
 CNN even came to the MathWorks to interview Cleve Moler (our
 chief scientist).

 We believe that the full extent and cause of the problem are
 now known. To explain what happened, let me first describe how
 division is done on the pentium. The pentium does division
 similar to the way you would do it by hand. Take the most
 significant digit of the numerator and denominator, from
 those, guess the first digit of the quotient. Then, multiply
 the quotient guess by the divisor and subtract it from the
 dividend. Now, repeat the process on the remainder. [The
 details are more complex, but the basic idea is the same]
 The way to decide the next quotient digit is by consulting a
 lookup table. You see 8 divided by 3, look in a table and the
 (8,3) element is a 2. The problem with the pentium was that the
 lookup table was missing 5 elements (well.. they were zero when
 they should have been something else). Unlike regular long
 division, there is some margin for error in the choice of
 quotient digits so most of the time when a bad choice is made,
 it will be corrected by subsequent guesses. So a necessary
 condition for a "bad divisor" is that it has one of the 5
 missing bit patterns somewhere in it and it has an unfortunate
 series of bits afterwards. In this case, that means a series
 of 1's.

 Let's take a look at Nicely's prime in hex (format hex in MATLAB),

   824633702441 <=> 4267fffff7052000
                         ^^^^^^^^^^^^^
 The first 3 digits of the hex number is the exponent (you can see
 that by multiplying it by a factor of 2). The five missing
 entries in the lookup table on the chip correspond to values
 of the first mantissa digit of 1, 7, 4, a, and d. For the
 bug to produce the largest relative error, the suspect bit
 pattern has to occur in the most significant mantissa digit
 and must be followed by a string of binary 1's (f's in hex).
 For comparison, look at Coe's divisor,

        3145727 <=> 4147ffff80000000

 With the extent of the problem known, Cleve Moler has been
 working with Tim Coe, Terje Mathisen, and Intel to develop a
 software workaround that produces a minimum degradation in
 performance. The current proposal is to detect whether a
 divisor is "at risk" by examining its bit pattern. If a
 divisor is found to be at risk, rescale the numerator and
 denominator by 15/16 before doing the division. That way, you
 can be sure it will lie outside the region of risk.

 There has been a lot of "discussion" about the frequency with
 which one might encounter this bug. You'll hear estimates from
 once every 27,000 years to once an hour. Who is right? Well,
 both. The error occurs in 1 out of every 9 billion random
 mantissas (the exponent is irrelevant). The 27,000 year estimate
 is for a spreadsheet user doing 1000 divisions a day. For the
 worst case, take a 90 Megahertz pentium doing nothing but random
 divisions. Double precision divisions take something like 30 clock
 cycles, so you can do about 3 million a second. Using that
 number, you could get reduced precision about once an hour.
 
 We have announced that we are going to release a "pentium
 aware" version of MATLAB that provides a software workaround
 for the bug as soon as possible. When it is ready, we'll
 announce it here and on comp.soft-sys.matlab.

 For an comprehensive archive of Pentium related articles and
 information take a look at our "Pentium Papers", accessible
 through our web site,

   http://www.mathworks.com/

 or via anonymous ftp from,

   ftp.mathworks.com in /pub/pentium/

 or by e-mail by sending message to matlib_at_mathworks.com
 with the body containing the command to be executed. Ex:

   cd /pub/pentium
   dir
   get FAQ.txt

  -----------------------------------------------------------

------- End of Forwarded Message



To archiwum zostało wygenerowane przez hypermail 2.1.7 : Wed 19 May 2004 - 15:47:24 MET DST