|
The following text is a replica of chapter 7.2 of the Intel Architecture Software Developer's Manual Volume 1: Basic Architecture. If you are interested in the complete manual you can download it from http://www.intel.com, Order Number 243190.
|
||||||||||||||||||||||||
|
1. Real Numbers and Floating-Point Formats
|
||||||||||||||||||||||||
1. Real Numbers and Floating-Point Formats |
||||||||||||||||||||||||
|
1.1 Real Number SystemAs shown in Figure 1, the real-number system
comprises the continuum of real numbers from minus infinity (-
Figure 1: Binary Real Number System
Because the size and number of registers that any computer can have is limited, only a subset of the real-number continuum can be used in real-number calculations. As shown at the bottom of Figure 1, the subset of real numbers that a particular FPU supports represents an approximation of the real number system. The range and precision of this real-number subset is determined by the format that the FPU uses to represent real numbers.
1.2 Floating-Point FormatTo increase the speed and efficiency of real-number computations, computers or FPUs typically represent real numbers in a binary floating-point format. In this format, a real number has three parts: a sign, a significand, and an exponent. Figure 2 shows the binary floating-point format that the IA FPU uses. This format conforms to the IEEE standard.
Figure 2: Binary Floating-Point Format
The sign is a binary value that indicates whether the number is positive (0) or negative (1). The significand has two parts: a 1-bit binary integer (also referred to as the J-bit) and a binary fraction. The J-bit is often not represented, but instead is an implied value. The exponent is a binary integer that represents the base-2 power that the significand is raised to. Table 1 shows how the real number 178.125 (in ordinary decimal format) is stored in floating-point format. The table lists a progression of real number notations that leads to the single-real, 32-bit floating-point format (which is one of the floating-point formats that the FPU supports). In this format, the significand is normalized (refer to Section Normalized Numbers) and the exponent is biased (refer to Section Biased Exponent). For the single-real format, the biasing constant is +12710.
Table 1. Real Number Notation
1.3 Normalized NumbersIn most cases, the FPU represents real numbers in normalized form. This means that except for zero, the significand is always made up of an integer of 1 and the following fraction:
For values less than 1, leading zeros are eliminated. (For each leading zero eliminated, the exponent is decremented by one.) Representing numbers in normalized form maximizes the number of significant digits that can be accommodated in a significand of a given width. To summarize, a normalized real number consists of a normalized significand that represents a real number between 1 and 2 and an exponent that specifies the number's binary point.
1.4 Biased ExponentThe FPU represents exponents in a biased form. This means that a constant is added to the actual exponent so that the biased exponent is always a positive number. The value of the biasing constant depends on the number of bits available for representing exponents in the floating-point format being used. The biasing constant is chosen so that the smallest normalized number can be reciprocated without overflow. For 32-bit real numbers the bias of the exponent is +12710.
1.5 Real Number and Non-number EncodingsA variety of real numbers and special values can be encoded in the FPUÕs floating-point format. These numbers and values are generally divided into the following classes:
(The term NaN stands for "Not a Number.") Figure 3 shows how the encodings for these numbers and non-numbers fit into the real number continuum. The encodings shown here are for the IEEE single-precision (32-bit) format, where the term "S" indicates the sign bit, "E" the biased exponent, and "F" the fraction. (The exponent values are given in decimal.) The FPU can operate on and/or return any of these values, depending on the type of computation being performed. The following sections describe these number and non-number classes.
Figure 3: Real Numbers and NaNs
1.6 Signed ZerosZero can be represented as a +0 or a -0 depending on the sign bit. Both encodings are equal in value. The sign of a zero result depends on the operation being performed and the rounding mode being used. Signed zeros have been provided to aid in implementing interval arithmetic. The sign of a zero may indicate the direction from which underflow occurred, or it may indicate the sign of an infinity (°) that has been reciprocated.
1.7 Normalized and Denormalized Finite NumbersNon-zero, finite numbers are divided into two classes: normalized
and denormalized. The normalized finite numbers comprise all the non-zero
finite values that can be encoded in a normalized real number format
between zero and infinity ( When real numbers become very close to zero, the normalized-number format can no longer be used to represent the numbers. This is because the range of the exponent is not large enough to compensate for shifting the binary point to the right to eliminate leading zeros. When the biased exponent is zero, smaller numbers can only be represented by making the integer bit (and perhaps other leading bits) of the significand zero. The numbers in this range are called denormalized (or tiny) numbers. The use of leading zeros with denormalized numbers allows smaller numbers to be represented. However, this denormalization causes a loss of precision (the number of significant bits in the fraction is reduced by the leading zeros). When performing normalized floating-point computations, an FPU normally operates on normalized numbers and produces normalized numbers as results. Denormalized numbers represent an underflow condition. A denormalized number is computed through a technique called gradual underflow. Table 2 gives an example of gradual underflow in the denormalization process. Here the single-real format is being used, so the minimum exponent (unbiased) is -12610. The true result in this example requires an exponent of -12910 in order to have a normalized number. Since -12910 is beyond the allowable exponent range, the result is denormalized by inserting leading zeros until the minimum exponent of -12610 is reached.
Table 2: Denormalization Process
In the extreme case, all the significant bits are shifted out to the right by leading zeros, creating a zero result. The FPU deals with denormal values in the following ways:
When a denormal number in single- or double-real format is used as a source operand and the denormal exception is masked, the FPU automatically normalizes the number when it is converted to extended-real format.
1.8 Signed Infinities The two infinities, + The signs of infinities are observed, and comparisons are possible.
Infinities are always inter-preted in the affine sense; that is, ∆ Whereas denormalized numbers represent an underflow condition, the two infinity numbers represent the result of an overflow condition. Here, the normalized result of a computation has a biased exponent greater than the largest allowable exponent for the selected result format.
1.9 NaNsSince NaNs are non-numbers, they are not part of the real number line. In Figure 3, the encoding space for NaNs in the FPU floating-point formats is shown above the ends of the real number line. This space includes any value with the maximum allowable biased exponent and a non-zero fraction. (The sign bit is ignored for NaNs.) The IEEE standard defines two classes of NaN: quiet NaNs (QNaNs) and signaling NaNs (SNaNs). A QNaN is a NaN with the most significant fraction bit set; an SNaN is a NaN with the most significant fraction bit clear. QNaNs are allowed to propagate through most arithmetic operations without signaling an exception. SNaNs generally signal an invalid-operation excep­tion whenever they appear as operands in arithmetic operations.
1.10 IndefiniteFor each FPU data type, one unique encoding is reserved for representing the special value indefinite. For example, when operating on real values, the real indefinite value is a QNaN. The FPU produces indefinite values as responses to masked floating-point exceptions.
|
Floating Point
32 Bit Single
Precision
64 Bit Double Precision