Floating Point and the support for IEEE 754 in H++   Leave a comment

As promised, the post that will reveal the details of the support of IEEE 754 in H++.

If after reading this post, you think you will need a more in depth discusion of IEEE 754 and its relation with the programming languages, please read Michael L. Overton’s excellent book called: Numerical Computing with IEEE Floating Point Arithmetic. A revision of the current IEEE standard can be located in IEEE 754: Revision of 2008.

Numerical Computing with IEEE Floating Point Arithmetic

Numerical Computing with IEEE Floating Point Arithmetic

As Overton explains in his book, the support for IEEE 754 and floating point in general has been inconsistent in both the hardware (the cpus), and finally the software (the programming languages). Programming Languages like C/C++ and Fortran are among others the ones with full support for the IEEE 754 specification. But that was not this way from the beginning.

The Floating Point support in H++ is very simple: I just have float and double types both with the same size and precision: 64 bits (1 bit for sign, 11 bits for the exponent, 52 bits for the significant or the mantissa). This was made this way just for simplicity when implementing the H++ Code Generator for x86 32-bits Assembly.

So, you can declare local or static variables with the keywords float and double, and will have the same double precision known from the double data type. In the future, that will change to reflect the standard: float will have 32-bits and double 64-bits of size.

Some of the floating point support tests I was writing in sample H++ programs, are provided as C programs in the Michael L. Overton’s book; if you want to do the same testing for C/C++ programs, please visit the links : Programs in this book.

Another lack of feature in H++ is in the print format; H++ supports only Fixed Format, which is very restrictive, but it works. The Exponential Format is the missing feature actually in my implementation. Here are examples of what I’m saying:

Fixed Format:            0.000542368921   (the only print format supported in H++ actually)

Exponential Format:    5.42368921E-4     (a more easy to handle print format, also known as Scientific Notation)

I’m thinking in dedicating a couple of days to implement  that format, but I really don’t know when I’ll start to make it.

Testing the H++ Compiler

One of the tests you can find in this book, is the reciprocal power of 2, like in this H++ program:

Calculus of Reciprocal Power Of 2

Calculus of Reciprocal Power Of 2

This little program outputs the reciprocal powers of 2; so one detail I have to explain that shows the differences of this implementation with the original found in Overton, is that the condition was made against this tiny number: 0.00000000000000001 and not against 0.0 as the original one; the reason is simple: because I just implemented the Fixed Format for output, all the number below this 1.0E-17 are all printed as 0.0 until x really underflows.

The output:

Resuls of Reciprocal Power of 2

Resuls of Reciprocal Power of 2

The final results are:

The problem with Fixed Format in H++

The problem with Fixed Format in H++

After the power 2^-55 and down, the number printed is zero;  if the number would be 0.0 in the while condition expression, the results are all zero from here until it really underflows to 0.0; the reason is the precision I’m using: 64-bits of precision.

A test that passed Ok, was the one involving 1.0 added to the reciprocal powers of 2:

Calc One Plus Reciprocal Power Of 2

Calc One Plus Reciprocal Power Of 2

The results were as expected:

Results passed Ok for 1.0 + 2^x

Results passed Ok for 1.0 + 2^x

This phenomena is very interesting because when you have isolated the reciprocal power in H++, it does not behave very well; but when you add it to 1.0, the results are as expected.

One interesting test, is the one related to the approximations as explained in Overton; here is the test H++ program:

Find One From Quotient

Find One From Quotient

As you can see from the formula, 1/ (1/x), the result should be X in all cases:

One Div Reciprocal X Results

One Div Reciprocal X Results

This test was successfull in H++. Here is why this test is very important: The binary representation for 1.0 and 0.99999999 are totally different besides they are the same value; let’s see these numbers as 32-bits float values in their binary format:

Short Real    sign exponent mantissa                                                  hex-value

1.000           =  0        01111111    00000000000000000000000   0x3F800000  <—- 1.0
1.000           =  0        01111110   11111111111111111111110   0x3F7FFFFE   <—- 0.9999999

As you could see, the number and its approximation have different binary representations; but if you continue adding more digits to the approximation, the process of rounding will take place here, and the number finally will change from .9999999…. to 1.0; you just have to add another digit 9, to convert the 0.99999999 in 1.0:

Short Real    sign exponent mantissa                                                  hex-value

1.000           =  0        01111111    00000000000000000000000   0x3F800000  <—- 1.0
1.000           =  0        01111111    00000000000000000000000   0x3F800000  <—- was once 0.99999999

If you want to have an in depth explanation of this phenomena, please read Overton’s book: Numerical Computing with IEEE Floating Point Arithmetic.

Fractions with Expressions of Zero Result, as the Divisor

Another interesting subject, is the one that relates to fractions with zero as divisor, one operation which is invalid in mathematics.

It’s well known that in an expression with fractions or divisions, where the denominator expression would result in zero, cannot be resolved inmediatelly; so, from Calculus, you know that some limits must be approximated, when they cannot be solved by simple substitution of x in the polynomiuns.

They are a lot of tools in calculus to solve that problem, and one is approximation; if you cannot use 3 as value of x, you could use values from the left like these: 2.4, 2.5, 2.7, 2.9, 2.95, 2.98, 2.995, 2.9995; also you can use values from the right, like these: 3.0000099, 3.00009, 3.0001, 3.001, 3.01, 3.1. Also, tools like L’Hopital’s rule, is well known to be applied in this cases, to avoid the approximation or numerical analysis.

But, one of the features of the IEEE 754 is that you can have two different approaches: allow this result to become infinite, or throw an exception, following the rules of Division by Zero.

In the following example, what you can have as result, is zero; that will help you understand the problem in your formula, or in your program:

y = 1/x;  //this will result in y = +oo|-oo (plus|minus infinite)

Based on the first approach, that expression would result in +oo (plus infinite) or -oo (minus infinite); so, if you continue operating with this value, the result could be infinite. If you then do this to the previous result:

z = 1/y;  //this will result in z == 0.0

The new result would be zero.

The second approach, is to throw an IEEE 754 Floating Point exception called: Division by Zero; so you’ll need to catch the exception to understand the problem, and avoid the program to terminate unexpectedly:

try{

y = 1/x;  //this expression causes the exception when x == 0.0;

z = 1/y;

}catch()
{
printf(“Division by Zero has ocurred!”);

}

Here is an H++ program that shows the results of the Total Resistance, for different values of R1 and R2 (the two resistances in the circuit):

Total Resistance in H++

Total Resistance in H++

To get the programs in C, please visit Michael L. Overton’s site.

The output of this little program is:

Results of Total Resistance in H++

Results of Total Resistance in H++

As you can see, when R2 == 0.0, the total resistance in the circuit is 0.0. That means that:

Total Resistance Formula

Total Resistance Formula

More advanced tests like Approximating a Derivative by a Difference Quotient, can be found in the Overton’s book. Also, a phenomena called Cancellation is well explained in that book, so I recommend you to read it to have a better understanding of the issues involved with Compilers and the expected results.

Here is the output:

Approximating a Derivative from A Diff Quotient

Approximating a Derivative from A Diff Quotient

This is the H++ program that generates that output:

AproxADerivByDiffQuot Program in H++

ApproxADerivByDiffQuot Program in H++

Approximating the Euler’s e Constant

Let’s do another test: let’s find Euler’s e constant in an H++ program:

Approximating Euler e Constant

Approximating Euler e Constant

As you observe, the result depends on the size of n; when n–>oo, e will have a more accurate value; here is the result:

Approximating Euler e Results

Approximating Euler e Results

You can see that, when I applied the Natural Logarithm to the value of e I found with n = 10E+8, the result was close to 1.0; the approximation was very good for that value of n.

Now, let’s do the same thing, but this time, using the Taylor series formula:

Approximating e using Taylor Series

Approximating e using Taylor Series

Now, making a few changes to return the e value from each function, and compare the cancellation digits, to determine how many digits are correct in the two aproximations, we get these results:

Approximating e Methods

Approximating e Methods

Now you can notice that the two aproximations are valid for the first 6 digits after the decimal point. In this example, the more accurate was the Taylor Series formula.

H++ Floating Point Math Library

As a explained in older posts, the Math library supports almost all the fundamental functions, primitives and intrinsic functions; most of them were implemented at the x86 assembly level, to leaverage the existing implementation in the Floating Point Unit (FPU) in the x86 microprocessors.

Fundamental Math Functions

Here are some of the implemented functions:

Hpp Math Lib - Part I

Hpp Math Lib - Part I

Most of these functions are already implemented in the FPU.

Hpp Math Lib - Part II

Hpp Math Lib - Part II

You can also note the constants that are already computed in the FPU, and the most common Trigonometric Functions.

I also implemented the y^x function, and e^x:

Hpp Math Lib - Part III

Hpp Math Lib - Part III

Inverse Functions

Hpp Math Lib - Part IV

Hpp Math Lib - Part IV

Here you can see, how I implemented the Inverse ArcSin in x86 Assembly Language, for the H++ Library:

Implementation of Inverse ArcSin Function

Implementation of Inverse ArcSin Function

Hyperbolic Functions

Finally, the other more advanced functions, I decided to implement them directly in H++ because of matter of time:

Hyperbolic Functions

Hyperbolic Functions

This code is compiled and is available in the H++ library for any math operations that require them.

Inverse Hyperbolic Functions

Hyperbolic Inverse Functions

Some aliases you can use, without referring to the full name of a static function for these hyperbolic and inverse hyperbolic functions, was defined:

Hyperbolic and Inverse Hyperbolic Function Shortcuts

Having implemented these later functions in H++ saved me a lot time, which I dedicated to finish some important issues including bug fixes.

Conclusion

The full support of IEEE 754 is still very distant in H++, but as you could see, there is a lot of work already done in the H++ compiler, and fewer things are missing. I highly recommend you to read Overton‘s book: Numerical Computing with IEEE Floating Point Arithmetic, if you are serious about Floating Point and the IEEE 754 Standard.

I hope this post can help you understand the basics of Floating Point, and the support in H++.

Bye, bye.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: