Floating point Issue faced practically !!!
I wanted to brush up few Machine learning concepts, one of the basic concepts I was keen to practice is linear regression in depth. Typical project is Bike Sharing demand project, in that we have 730 rows of data. I don't want to get into details about this project much. Please refer in Kaggle for further information.
In this 730 size, I wanted to split into train_size as 0.7 and test_size as 0.3 i.e 70:30 ratio
Expected Train Size = 730 X 0.7 = 511 and Test Size 703 X 0.3 = 219, but to my surprise the train size has come out as 510 and test size as 219, adds together 729 against 730
df_train, df_test = train_test_split(bike, train_size = 0.7, test_size=0.3, random_state = 100)
If I give either train size or test size, it will calculate accordingly and subtract the other value with the total rows
so first I gave only (test_size = 0.3)
df_train, df_test = train_test_split(bike, test_size = 0.3, random_state = 100)
It calculated test size as 219 and train size as 511, Thats great, this is what I expected. It first calculated test sample size as 219 and subtracted 219 from 730 for train same size hence 511
But I tried only with train size to check (train_size = 0.7)
df_train, df_test = train_test_split(bike, train_size = 0.7, random_state = 100)
To my surprise it calculated train size as 510 and test size as 220, in this case it first calculated train sample size as 510 and subtracted 510 from 730 hence 220 as test size
From Machine learning perspective, training 511 samples or 510 samples should not make difference, but as a software engineer perspective, wanted to dig deep down why is this discrepancy and where I am loosing that 1 value if I give parameters train_size 0.7
if test_size_type == "f":
n_test = ceil(test_size * n_samples)
elif test_size_type == "i":
n_test = float(test_size)
if train_size_type == "f":
n_train = floor(train_size * n_samples)
elif train_size_type == "i":
n_train = float(train_size)
if train_size is None:
n_train = n_samples - n_test
elif test_size is None:
n_test = n_samples - n_train
This is the code in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/model_selection/_split.py
When I debugged I saw the library flooring the value (train_size * n_samples). Flooring of (511.0) should be 511, but I got 510, then I dug further to understand, then I figured out the below issue,
In python if we multiply 0.7 * 730 instead of giving straight 511.0 it gives 510.99999999999994 and flooring that as in the library makes the value to 510 !!!!
This is the floating point error in Python, I verified in having coded in multiple languages, Java and C. (Of course different languages handle floating points in different ways)
See below outputs from various languages
Java Code
class Calculate {
public static void main(String[] args) {
float x = 0.7f;
int y = 730;
float z = x * y;
System.out.println(z);
}
}
C Code
Recommended by LinkedIn
#include <stdio.h>
int main(){
int x = 730;
float y = 0.7;
printf("%f\n", x*y);
return 0;
}
Golang Code
package main
func main() {
var a int
var b float64
a = 730
b = 0.7
result := float64(a) * b
println(result)
}
Python code
x = 730
y = 0.7
print(x * y)
The result from all the languages,
This one clarified my doubts, Python is producing 510.99999999999994 instead of 511 and the library floors this value hence we get 510.
How to resolve this in python,
One of the ways is to use Decimal module
x = 730
y = 0.7
from decimal import Decimal
print(format(Decimal.from_float((x * y)), '.5'))
output
jithrock@tech:~/fp_issues$ python3 Calculate.py
511.00
jithrock@tech:~/fp_issues$
If you have come across these kinds of issues, please share !! and also please suggest how to solve in better way, since python is used predominantly in scientific and data science world, these areas are full of numbers crunching how python handles these issues ?
In next article I will explore on how floating point issues are solved in various languages and share my observation