Floating point Issue faced practically !!!

Floating point Issue faced practically !!!

I wanted to brush up few Machine learning concepts, one of the basic concepts I was keen to practice is linear regression in depth. Typical project is Bike Sharing demand project, in that we have 730 rows of data. I don't want to get into details about this project much. Please refer in Kaggle for further information.

In this 730 size, I wanted to split into train_size as 0.7 and test_size as 0.3 i.e 70:30 ratio

Expected Train Size = 730 X 0.7 = 511 and Test Size 703 X 0.3 = 219, but to my surprise the train size has come out as 510 and test size as 219, adds together 729 against 730

df_train, df_test = train_test_split(bike, train_size = 0.7, test_size=0.3, random_state = 100)        

If I give either train size or test size, it will calculate accordingly and subtract the other value with the total rows

so first I gave only (test_size = 0.3)

df_train, df_test = train_test_split(bike, test_size = 0.3, random_state = 100)        

It calculated test size as 219 and train size as 511, Thats great, this is what I expected. It first calculated test sample size as 219 and subtracted 219 from 730 for train same size hence 511

But I tried only with train size to check (train_size = 0.7)

df_train, df_test = train_test_split(bike, train_size = 0.7, random_state = 100)        

To my surprise it calculated train size as 510 and test size as 220, in this case it first calculated train sample size as 510 and subtracted 510 from 730 hence 220 as test size

From Machine learning perspective, training 511 samples or 510 samples should not make difference, but as a software engineer perspective, wanted to dig deep down why is this discrepancy and where I am loosing that 1 value if I give parameters train_size 0.7

if test_size_type == "f":
    n_test = ceil(test_size * n_samples)
elif test_size_type == "i":
    n_test = float(test_size)

if train_size_type == "f":
    n_train = floor(train_size * n_samples)
elif train_size_type == "i":
    n_train = float(train_size)

if train_size is None:
    n_train = n_samples - n_test
elif test_size is None:
    n_test = n_samples - n_train        

This is the code in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/model_selection/_split.py

When I debugged I saw the library flooring the value (train_size * n_samples). Flooring of (511.0) should be 511, but I got 510, then I dug further to understand, then I figured out the below issue,

Article content

In python if we multiply 0.7 * 730 instead of giving straight 511.0 it gives 510.99999999999994 and flooring that as in the library makes the value to 510 !!!!

This is the floating point error in Python, I verified in having coded in multiple languages, Java and C. (Of course different languages handle floating points in different ways)

See below outputs from various languages

Java Code

class Calculate {
    public static void main(String[] args) {
        float x = 0.7f;
        int y = 730;
        float z = x * y;
        System.out.println(z);
    }
}        

C Code

#include <stdio.h>

int main(){

	int x = 730;
	float y = 0.7;
	printf("%f\n", x*y);
	return 0;
}        

Golang Code

package main

func main() {
	var a int
	var b float64
	a = 730
	b = 0.7
	result := float64(a) * b
	println(result)
}        

Python code

x = 730
y = 0.7
print(x * y)        

The result from all the languages,


Article content
Article content

This one clarified my doubts, Python is producing 510.99999999999994 instead of 511 and the library floors this value hence we get 510.


How to resolve this in python,

One of the ways is to use Decimal module

x = 730
y = 0.7
from decimal import Decimal
print(format(Decimal.from_float((x * y)), '.5'))        

output

jithrock@tech:~/fp_issues$ python3 Calculate.py 
511.00
jithrock@tech:~/fp_issues$         

If you have come across these kinds of issues, please share !! and also please suggest how to solve in better way, since python is used predominantly in scientific and data science world, these areas are full of numbers crunching how python handles these issues ?


In next article I will explore on how floating point issues are solved in various languages and share my observation



To view or add a comment, sign in

Others also viewed

Explore content categories