polip: Concise Explanation of the Pentium Flaw (fwd)

Concise Explanation of the Pentium Flaw (fwd)

Autor: Chad Turek (turek_at_pontryagin.aa.washington.edu)
Data: Fri 16 Dec 1994 - 00:31:14 MET

Następna wiadomość: Lech Borkowski: "(no subject)"
Poprzednia wiadomość: Chad Turek: "Re: Nie na temat."
Wiadomości sortowane wg: [ datay ] [ wątku ] [ tematu ] [ autora ] [ załącznik ]

ponizej przekazuje pare informacji na temat Pentium bug i patch na Matlab
do poprawienia problemu.

                -------------------?-------------------
                Chad Turek
                turek_at_aa.washington.edu
                University of Washington
                dept. of Aeronautics and Astronautics
                Seattle, WA
                -------------------?-------------------
---------- Forwarded message ----------
Date: Thu, 15 Dec 94 14:05:58 -0800
From: brian_at_aa.washington.edu
To: students_at_aa.washington.edu
Subject: Concise Explanation of the Pentium Flaw

Following is an excerpt from a Mathworks digest explaining in detail
the technical origin and history of the recent flap over the Intel
Pentium chip design flaw.

-Brian
------- Forwarded Message

Drea's Desk: Pentium problems.

By now, many of you have heard about the problems doing
certain floating point operations on the Pentium, Intel's
flagship CPU. Here is a summary of where we are, how we
got here, and where we're going.

It all began with a posting to a compuserve forum of a
personal e-mail from Prof. Thomas Nicely (which was
then cross-posted to comp.sys.intel),

   It appears that there is a bug in the floating point unit
   (numeric coprocessor) of many, and perhaps all, Pentium
   processors.

In short, the Pentium FPU is returning erroneous values for
certain division operations. For example,

1/824633702441.0

is calculated incorrectly (all digits beyond the eighth
significant digit are in error).

That is, the pentium produced results that indicated that
the division had been carried out with no greater than
single precision.

You might ask where the number 824633702441 came from and how
did Prof. Nicely notice that the result was in error. Nicely
was working on an area of number theory that involved twin primes
(pairs of prime numbers that differ by 2, like 11 and 13). The
sum of 1/n where n goes from 1 to infinity diverges. The sum of
1/p where p's are the prime numbers also diverges. But 1/t, where
t's are twin primes, converges. 824633702441 and 824633702443 turn
out, of course, to be twin primes.

Partial sums of this series have been published and Nicely
was comparing his results with them. He discovered
that his results differed and he started a long search for
problems in his code, compiler bugs, hardware problems, etc.
Finally, by the process of elimination and extensive testing, he
concluded that the problem was with the pentium chip itself.

The post created a firestorm on comp.sys.intel. During the
midst of the storm, hundreds of messages a week poured in.
The signal to noise ratio got progressively smaller, but
there were some gems.

First, Terje Mathisen, a PC programming expert from Norsk Hydro in
Norway, confirmed Nicely's result and wrote a test program,
p87test, that he posted to comp.sys.intel.

Then, after a series of postings about other numbers that
were computed incorrectly, Tim Coe, a semiconductor design
engineer from Vitesse Semiconductor, found a pattern that
led him to the worst case pair of operands,

5244795/3932159

For these numbers,

    x = 5244795
    y = 3932159
    z = x - (x/y)*y

should be zero (within eps*x.. which would be about 1e-9) but
on the pentium,

z = 256

The relative error in this case was 5e-5, which represents
an error in the 4th decimal digit. 10 orders of magnitude
greater than you would expect due to roundoff error.

By now, word of the problem had spread into the mainstream press.
The New York Times, Associated Press, The San Jose Mercury News,
and countless other papers began carrying stories about it.
CNN even came to the MathWorks to interview Cleve Moler (our
chief scientist).

We believe that the full extent and cause of the problem are
now known. To explain what happened, let me first describe how
division is done on the pentium. The pentium does division
similar to the way you would do it by hand. Take the most
significant digit of the numerator and denominator, from
those, guess the first digit of the quotient. Then, multiply
the quotient guess by the divisor and subtract it from the
dividend. Now, repeat the process on the remainder. [The
details are more complex, but the basic idea is the same]
The way to decide the next quotient digit is by consulting a
lookup table. You see 8 divided by 3, look in a table and the
(8,3) element is a 2. The problem with the pentium was that the
lookup table was missing 5 elements (well.. they were zero when
they should have been something else). Unlike regular long
division, there is some margin for error in the choice of
quotient digits so most of the time when a bad choice is made,
it will be corrected by subsequent guesses. So a necessary
condition for a "bad divisor" is that it has one of the 5
missing bit patterns somewhere in it and it has an unfortunate
series of bits afterwards. In this case, that means a series
of 1's.

Let's take a look at Nicely's prime in hex (format hex in MATLAB),

824633702441 <=> 4267fffff7052000
^^^^^^^^^^^^^
The first 3 digits of the hex number is the exponent (you can see
that by multiplying it by a factor of 2). The five missing
entries in the lookup table on the chip correspond to values
of the first mantissa digit of 1, 7, 4, a, and d. For the
bug to produce the largest relative error, the suspect bit
pattern has to occur in the most significant mantissa digit
and must be followed by a string of binary 1's (f's in hex).
For comparison, look at Coe's divisor,

3145727 <=> 4147ffff80000000

With the extent of the problem known, Cleve Moler has been
working with Tim Coe, Terje Mathisen, and Intel to develop a
software workaround that produces a minimum degradation in
performance. The current proposal is to detect whether a
divisor is "at risk" by examining its bit pattern. If a
divisor is found to be at risk, rescale the numerator and
denominator by 15/16 before doing the division. That way, you
can be sure it will lie outside the region of risk.

There has been a lot of "discussion" about the frequency with
which one might encounter this bug. You'll hear estimates from
once every 27,000 years to once an hour. Who is right? Well,
both. The error occurs in 1 out of every 9 billion random
mantissas (the exponent is irrelevant). The 27,000 year estimate
is for a spreadsheet user doing 1000 divisions a day. For the
worst case, take a 90 Megahertz pentium doing nothing but random
divisions. Double precision divisions take something like 30 clock
cycles, so you can do about 3 million a second. Using that
number, you could get reduced precision about once an hour.

We have announced that we are going to release a "pentium
aware" version of MATLAB that provides a software workaround
for the bug as soon as possible. When it is ready, we'll
announce it here and on comp.soft-sys.matlab.

For an comprehensive archive of Pentium related articles and
information take a look at our "Pentium Papers", accessible
through our web site,

http://www.mathworks.com/

or via anonymous ftp from,

ftp.mathworks.com in /pub/pentium/

or by e-mail by sending message to matlib_at_mathworks.com
with the body containing the command to be executed. Ex:

   cd /pub/pentium
   dir
   get FAQ.txt

-----------------------------------------------------------

------- End of Forwarded Message

Następna wiadomość: Lech Borkowski: "(no subject)"
Poprzednia wiadomość: Chad Turek: "Re: Nie na temat."
Wiadomości sortowane wg: [ datay ] [ wątku ] [ tematu ] [ autora ] [ załącznik ]

To archiwum zostało wygenerowane przez hypermail 2.1.7 : Wed 19 May 2004 - 15:47:24 MET DST