Floating-Point Representation in C Programming

0
Floating-point representation is a method used in computing to approximate real numbers. It is crucial for handling a wide range of values efficiently and accurately. This article will discuss the conversion between double and decimal types and provide examples of floating-point representations that meet the requirements of the International Standard and IEC 60559.

Conversion Between Double and Decimal

In computing, a conversion from double to decimal with `DECIMAL_DIG` digits and back should be the identity function. This means that if you convert a number from double to decimal and then back to double, the final value should be the same as the initial value.

Example 1: Artificial Floating-Point Representation

The following describes an artificial floating-point representation that meets the minimum requirements of the International Standard. The appropriate values in a `<float.h>` header for type float are:
C
x = s16e P6
k=1
fk16−k, −31 ≤ e ≤ +32
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38

Example 2: Single-Precision and Double-Precision Numbers

The following describes floating-point representations that also meet the requirements for single-precision and double-precision numbers in IEC 60559. The appropriate values in a `<float.h>` header for types float and double are:
C
xf = s2
e P24
k=1
fk2−k, −125 ≤ e ≤ +128
xd = s2
e P53
k=1
fk2−k, −1021 ≤ e ≤ +1024
FLT_RADIX 2
DECIMAL_DIG 17
FLT_MANT_DIG 24
FLT_EPSILON 1.19209290E-07F // decimal constant
FLT_EPSILON 0X1P-23F // hex constant
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -125
FLT_MIN 1.17549435E-38F // decimal constant
FLT_MIN 0X1P-126F // hex constant
FLT_TRUE_MIN 1.40129846E-45F // decimal constant
FLT_TRUE_MIN 0X1P-149F // hex constant
FLT_HAS_SUBNORM 1
FLT_MIN_10_EXP -37
FLT_MAX_EXP +128
FLT_MAX 3.40282347E+38F // decimal constant
FLT_MAX 0X1.fffffeP127F // hex constant
FLT_MAX_10_EXP +38
DBL_MANT_DIG 53
DBL_EPSILON 2.2204460492503131E-16 // decimal constant
DBL_EPSILON 0X1P-52 // hex constant
DBL_DECIMAL_DIG 17
DBL_DIG 15
DBL_MIN_EXP -1021
DBL_MIN 2.2250738585072014E-308 // decimal constant
DBL_MIN 0X1P-1022 // hex constant
DBL_TRUE_MIN 4.9406564584124654E-324 // decimal constant
DBL_TRUE_MIN 0X1P-1074 // hex constant
DBL_HAS_SUBNORM 1
DBL_MIN_10_EXP -307
DBL_MAX_EXP +1024
DBL_MAX 1.7976931348623157E+308 // decimal constant
DBL_MAX 0X1.fffffffffffffP1023 // hex constant
DBL_MAX_10_EXP +308
If a type wider than double were supported, then `DECIMAL_DIG` would be greater than 17. For example, if the widest type were to use the minimal-width IEC 60559 double-extended format (64 bits of precision), then `DECIMAL_DIG` would be 21.
Tags

Post a Comment

0Comments
Post a Comment (0)