An Applied Machine Learning Approach: Classifying wearable device data into movements and body postures

Ishaan Jaffer

Published Dec 14, 2018

Globally, wearable devices have had a significant impact on our lives primarily due to their fitness monitoring and other-health related metrics. However, the accuracy and correctness will play a major role in their popularity and adoption rate. [1] The wearable devices market is expected to exceed more than USD $51.0 Billion by 2022, and ensuring that these devices can accurately measure their user’s health metrics is of immense relevance. One of the major challenges in the process of measuring a user’s metrics is activity recognition. This paper proposes and implements an applied machine learning approach to classify a user’s activity as: sitting, sitting down, standing, standing up, walking. A dataset(165,663 instances) with data collected from 4 subjects wearing accelerometers placed on their waist, left thigh, right arm and right ankle is used. The classifier model is built using the J48 decision tree machine learning algorithm.

Introduction

The future of genuine and legitimate personalized care relies largely on wearable technology that can [2] (1) Monitor physical activity, (2) collect data and (3) deliver real-time feedback to users. With exponential growth in the field of creating functional and useful wearables there have been changes in health care and the process of detecting ailments such as diabetes, obesity, or heart disease. [2] This, combined with improvements in the field of microelectronics has led to the creation of two new active research areas: Activity of Daily Living(ADL) and Human Activity Recognition(HAR).

[3] ADL is a way to describe the functional status of a person. Research has been done in developing systems for monitoring human subjects over long periods of time using wearable units. A number of researchers have used accelerometers to implement and understand ADL. HAR aims to recognize the actions and goals of one or more agents from a series of observations on agents’ actions and environmental conditions.

[3] Two approaches to measure and predict ADL and HAR are: image processing and use of wearable sensors. The image processing method has proved to be more challenging to implement due to the difficulty in installing cameras in several rooms, variable lighting, possibility of poor image quality due to faults in the camera lenses and issues with privacy. Using wearable sensors has been the popular approach to this problem. It can provide personalized feedback for a user and only requires the user to wear the wearable for a particular time period. Possible flaws with this approach are: battery life of the sensor and calibration of the sensors. This paper focusses on optimizing and improving the process of classifying a user’s body posture and movement using wearable technology.

The Prediction Problem

The applications and use of wearable devices have been seen in industries such as consumer applications, lifestyle, fitness and sports, enterprise and defense. [4] The wearable technology market has been growing at a rate of 16.2%. However, current mainstream wearable devices have poor validity and accuracy in their measurements [5]. Several comparative studies done using these devices have concluded that devices like the Apple Watch 2 and Samsung Gear S3 have irregular accuracies on different measurements and devices like the Jawbone UP3, Fitbit Surge, Huawei Talk Band B3, Xiaomi Mi Band 2, Ledongli, and APP-2 were only able to accurately measure distance and number of steps but had a large amount of error in the other measurements [5].

The current process of data measurement, activity recognition, classification and calculations is summarized in the data pipeline diagram below:

In the current data pipeline, Stage 2 is when the data implements a machine learning classifier to predict the activity of the user. It is clear that the accuracy: percent correct and kappa of stage 2 greatly influences the output in stage 4. In stage 2, the most common classification technique used is a [3] decision tree model combined with the Adaboost algorithm. At this stage, the user is classified as “sitting, sitting down, standing, walking or exercising.” [5] This then influences the number of steps walked/calories burned/heartrate, since a user engaged in heavy exercise will have a higher heartrate, calories burned and steps walked than someone that has been sitting down. Once the calculations are made in stage 3, it is then verified that the calculations are in range for the user’s activity.

This paper aims to improve and optimize stage 2 of this pipeline. Stage 2, is the most crucial step for predicting and accurately understanding the user’s activity. This work uses an applied machine learning approach to create an efficient, accurate and low latency process of classifying a person’s activity and posture into one of the following: {sitting down, sitting up, standing, standing up, walking} given information and data from sensors placed on the user.

[2][3][5] Understanding and improving this process has implications for research in wearable technology, HAR, ADL and using Machine Learning. This can lead to advances in primarily the healthcare and fitness industries. Accurate recognition has the potential benefit of developing technology that can be used to support chronically ill patients, elderly people or people with special needs. Therefore, accurately predicting and classifying activities such as sitting down, sitting up, or walking will be very useful to provide feedback to caregivers about the patient’s behavior.

The Applied Machine Learning Approach used in this paper is summarized below:

1. Dataset Pre-processing and Preparation: Cleaning the data and splitting up the entire dataset into Development set, Training Data, Testing Data, Final Test Data

2. Data Exploration: Feature selection, experimenting with different machine learning algorithms on the dataset in order to gain insights/information about the data.

3. Recording baseline performance on the development data, using a model built from the training data. Baseline performance was calculated using several machine learning algorithms. Only the best/most accurate one’s are discussed

4. Optimization:

a. Error Analysis: Analyzing the confusion matrix and the incorrectly classified instances to improve feature selection and the models built

b. Parameter Tuning: Tuning the different parameters used in machine learning algorithms to improve performance specifically for this dataset

5. Final Result: Performance on the final holdout/untouched test data.

Background

This section provides basic knowledge of the sensors, data used and the process of determining a person’s activity/position, which is required for understanding the approach used in this paper.

Related Work

This section provides a comprehensive systematic review about work done in the field of HAR using wearable accelerometers. In the last few years there have been several published works in the field of HAR. Specifically, in three areas:

1. Detecting human activity using smartphones

2. Detecting activity through body work inertial sensors

3. Classifying body movement data using pattern discovery, metric learning or r transform

At present, researchers and companies are focused on using this technology to improve a user’s quality of life. [4] Companies like Apple, Samsung, Fitbit and Jawbone are all attempting to create the smallest, most efficient, accurate and precise method of predicting a user’s activity. To better understand the present state of HAR and ADL I conducted a quantitative analysis on published HAR IEEE papers. This provided me insights into work done using Applied Machine Learning and accelerometers or sensors. The following metadata was drawn from the IEEE database articles: Research title, Usage of accelerometers, machine learning technique. The metadata shows that there is an exponentially growing number of publications on HAR using wearable accelerometers and Applied Machine Learning.

The data collected is presented below:

In the surveyed works, I observed the use of up to 4 accelerometers in the collection of data. The most widely used type of algorithm to build a model was decision trees. The most prevalent method of testing was kfold cross validation but other less reliable testing methodologies were also used in some works. From the literature, I was able to conclude that most methodologies achieve a success rate of 89% accuracy in activities classification.

[3] The data used in this paper was collected for ongoing research in the fields of HAR and ADL. The research was done to build a wearable device using 4 accelerometers to collect human activity data in different static postures and dynamic movements.

Data Collection

[6] This paper uses data collected by a research group in Pontifical Catholic University Rio de Janeiro in 2013. The data was collected from 4 subjects: 2 men, 2 women and was 8 hours of data for each activity. The subjects were healthy adults and each activity was performed separately [3]. Although, information of only four subjects was collected there is an extensive, well represented amount of data. The participants represent different age groups, genders and heights. Information about the subjects is summarized in the table below:

The data was collected by using a wearable device composed of 4 tri-axial ADXL335 accelerometers connected to an ATmega328V microcontroller [6]. The 4 sensors were placed on the following locations.

1. Sensor on waist

2. Sensor on left thigh

3. Sensor on right ankle

4. Sensor on right arm

The following positioning and orientation was used

Each sensor has a x, y and z attribute. All four of the sensors were calibrated based on their original values and for a particular user.

Data Preparation:

Data Cleaning and Preprocessing:

The collected data had a few corrupt records and the format was incompatible. This process involved writing a scripting Python code to remove typographical errors and create a compatible .csv file.

The dataset had pre-existing clusters of users and genders. Instances 1 to 15616 only belonged to a particular user. If the dataset was split into training, development and test sets without pre-processing the model’s built would be biased due to making erroneous assumption when the learning algorithm is used. A high bias can cause a model to miss relevant relation between features and target outputs. This would have led to overfitting for the user the model was trained on and under-fitting on any other user. To ensure that the machine learning algorithms used would avoid overfitting, the data was then randomized by inserting a new attribute called random. The dataset was then sorted by this attribute. (“Random” was not used for any predictions made).

Dividing the Data:

I divided the dataset into four components. In total, there were 165632 instances:

1. Development set, used for understanding the data and data exploration (20% of data)

2. Training set (30% of data)

3. Cross Validation Set (30% of data)

4. Holdout Set (20% of data)

Class Attribute:

I decided that that the attribute “class” would be the class value since this particular paper aims to classify a user’s activity and posture. The class attribute can be sitting down, standing, walking etc. [3] The class value, was hand labelled by the researchers to classify the activity of the user.

In order to gain insights about the data I visualized the distribution of the class values in the development data.

The distribution of the class value in the development data is as follows:

Feature Space Representation and Data Insights:

In order to better understand what performance and accuracy I should expect, I ran the following machine learning algorithms on the development data using WEKA to classify the user’s activity.

The performance using 10-fold cross validation was:

Reasons for using a J48 Tree as the Machine Learning algorithm

These results helped support my initial hypothesis of using a J48 Tree to classify instances. I believed that a Decision Tree classifier would be best suited as this is a “divide and conquer” classification problem. [7] At each node of the tree the J48 algorithm chose the attribute that most effectively split the samples into subsets of a class value. In this dataset, the J48 Tree would choose the attribute that most effectively split the sample into sitting, sitting up, standing, standing up or walking.

The J48 algorithm is used in this work for the following reasons:

1. The algorithm recursively traverses the tree until a leaf node is then reached. Since at each stage the J48 considers which decision would lead to the largest information gain the possible consequences of each decision are considered [8].

2. Since the trees have in-memory classification models there is a low computation cost. Since we are trying to create an efficient and optimized classifier this is ideal [8].

3. J48 handles both numeric and nominal input attributes. Since this dataset consists of nominal attributes like gender and numeric attributes such as sensor measurements, J48 is well suited [7].

Feature Selection:

Since this method relies largely on what attributes are included in the dataset, I conducted experiments on development data in WEKA to understand the importance of each attribute. This provided information that was then used to create an accurate, predictive model. Since, the J48 Decision tree is being used in this work, the InformationGainAttribute evaluation was used to evaluate each attribute and Ranker was used as the search method. The results of the experiment are summarized in the table below:

The results indicate that “user” is an important attribute when classifying an activity. The relatively high importance rank of this attribute could lead to overfitting of the data. Since the algorithm would learn a model that fits a particular user too well. For example, if in the training set user 1 only had instances of “sitting” there is a possibility that the model would consider it highly likely that user 1 is always sitting. Whereas, this may only be the in the training data. Furthermore if “user” is highly weighted and the model is built only for a particular range/number of users when it is used to predict activity of a new user it may be unable to or may have poor performance. Hence, the attribute “user” leads to overfitting and bias. Due to this I decided to remove the attribute user using WEKA. The attribute “user” was unneeded and provided irrelevant information for the classification problem. Removing this attribute had the following advantages:

1. A reduced amount of complexity in the model built. This would lead to the creation of an efficient, low latency and space efficient model.

2. Improved accuracy of the predictive model.

Baseline Performance and Error Analysis

This section discusses the baseline analysis that was conducted using WEKA. The models that were built and used in this section were trained on the training dataset and the models were then used to classify instances of the test dataset. As discussed in the earlier sections, the J48 algorithm will be used to build the model. The settings used to build the baseline model were the default WEKA settings listed below.

Baseline Performance: 
Accuracy: 0.8011
Kappa: 0.7902

Error Analysis Process:

The stages of this process are summarized below

In order to gain a better understanding of incorrectly classified instances I created a .csv file in the predict labels section of WEKA. This file enabled me to see instances along with their actual classes and predicted/classified classes by the baseline model. I proceeded to analyze instances where the predictions were incorrect.

Confusion Matrix:

Process of Identifying Problematic features (This stage was conducted in the Explore Results Tab of Weka):

1. Checked for high frequency. Fixing a high frequency feature would most significantly improve the model. Since this feature appears most often in incorrectly classified instances.

2. Checked for Horizontal Absolute difference (Sorted by Horizontal absolute difference), to see how often features were different from those it was incorrectly not classified as. This was to compare 5017 instances that were incorrectly classified as positive to the 24844 correctly classified negative instances.

3. Vertical Absolute Difference- Since the results being explored were for positives predicted as negatives (vertically different on the confusion matrix) it was important to find features with substantial vertical absolute difference. To do this I sorted the feature table by “Vertical Absolute Difference”.

4. Checked feature weight, to understand if the model was treating the feature as positive or negative. Selecting a feature that has a sufficiently large negative or positive weight, would be useful.

For Horizontal Difference

Features were sorted by Horizontal difference to get the highest difference on top. Average cell value was used since most of these features were numerical.

“Height” was a significant problem it had a high frequency (329), substantial horizontal absolute difference (0.2855) and feature weight (0.822) and an average cell-value=0.3293. The feature weight was positive, indicating that the baseline model treated Height as an indicator of a user’s class = Sitting. We are however, considering instances that were classified as Sitting but are actually Standing. This could be leading to an error during classification. Height represented the original height of the user. For a user, this remained constant across all instances. The model was confused for extreme values of height when the user was not shorter than average or taller than average.

For Vertical Difference

Features were sorted by Vertical Absolute Difference to get the lowest difference on top. Average cell value was used since most of these features were numerical.

Weight was a significant problem since:

It had a high frequency (429), low vertical absolute difference (0.018), high magnitude feature weight(-0.2634), and average cell value = 0.0593.

Weight was measured in kilograms and remained constant across instances for a particular user.

The model predicted class= sitting down when class= sitting several times. However, the baseline model did not consider how user’s weight related to x1, y1, x2, y2 co-ordinates. Depending on a user’s weight there was a specific configuration of x1, x2, y1, y2 which would determine the angle best associated to predict them sitting. This angle calculation depended on their weight.

The problem with correctly considering weight: When the model was incorrectly classified a user as sitting down instead of sitting, the weight of the user did not correctly correspond to the body mass index. [9] This is an important feature to consider since, if the model ever has to classify user’s that are on the extremes of the weight spectrum (approaching obesity or anorexia), the body mass index may also not be within the expected range. This is because the user’s center of gravity may not be in the expected location.

Improvements:

a. Restructuring of data by adding a feature of (Height – y4):

This would consider the difference in the user’s actual height and current height measured. Y4 is the height measured along the ‘y-axis’ on the accelerometer mounted on the user’s right upper-arm. Considering this difference would better help to understand if a user was standing or sitting. If they were sitting height difference should be > 0 but if they were standing difference should approach 0.

b. Restructuring of data by adding feature (x1-x2)

This feature would help to better place the center of gravity of a user. X1 is the measurement along the x-axis on the accelerometer around the user’s waist. X2 is the measurement along the x-axis measured by the accelerometer around the user’s upper left thigh. This feature would help particularly for user’s that have weight and body mass index that are in the extreme range. (Very low or very high). By considering this, the model would get a better understanding of the location a user’s center of gravity.

c. Restructuring of data to calculate plunge angle:

This feature is represented in the figure below:

This feature is the angle between Z4, the measurement along the z axis of the sensor mounted on the user’s right arm and Z2, the measurement along the z-axis of the sensor mounted on the user’s left thigh. Plunge angle is useful in better understanding a user’s posture. Poor posture corresponds to acute/lower angles while good posture approaches 90 degrees. In order to calculate this the following equation was used [10]:

d. Standardization of Weight

Weight is a numerical attribute in order for it to be considered accurately in the data it would be useful to have it standardized, in a range between (0,1). To do this in the Test set I created a new column called WeightScaled. I got the maximum value of weight in the development dataset and for each row divided the value of weight by MaxWeight to fill in WeightScaled. This would help with attributes that had extreme values. It would help the model better understand if a user was standing/sitting or walking.

After these changes were made the new model performance was:

New Metrics:
Accuracy: 0.8984
Kappa: 0.88

To check for improvement, I checked the compare models tab. I compared the Accuracy and Kappa of the baseline model and the new model with the improvements/changes discussed above. The new model had higher values of Accuracy and Kappa. I also compared the confusion matrices. The new model had less errors when classifying class = sittingdown and when classifying class = standingup. Indicating that there was improved performance in classifying a user’s posture. To test if this was significant I also looked at the Difference Matrix to compare where the new model correctly classified instances in comparison to the baseline one. I used this to check if my proposed solutions had a significant impact on the manner in which the classification was being done. The p-value was 0.045, indicating a significant improvement from baseline performance.

Parameter Tuning

Once I was able to achieve a significant improvement with my new model, the next step was to find a parametric value in the J48 algorithm that would result in the best performance for my dataset. In order to correctly and efficiently execute this process I did the following:

I decided to tune confidenceFactor. [9] The J48 algorithm uses pruning of models. This was a numeric value used by the J48 algorithm, and is used for the process of pruning. Smaller values of confidenceFactor indicate a large amount of pruning. Pruning is a technique that reduces the size of the decision tree model by removing sections of the tree that provide little power/information to classify instances. This process helps to improve accuracy of the test set by reducing overfitting. From looking at my development data it seemed that by modifying the values of confidenceFactor, it would be possible to get smoother more defined boundaries across different classes. This is important since at each node the J48 algorithm considers what decision will lead to the largest information gain.

Default value of confidenceFactor = 0.25

Settings Considered

(confidenceFactor = 0.15), (confidenceFactor = 0.25), (confidenceFactor = 0.30), (confidenceFactor = 0.35)

I then performed a comparison of the baseline model built using J48 default settings and the optimized J48 model. A t-test was conducted to compare the performance of each model. This compared performance across the 5 folds. From the test, the p-value = 0.0421. This value is lower than 0.05 i.e p-value < 0.05. This indicated that the tuning had a significant impact on improving the baseline performance. The experimenter in WEKA also indicated that the Kappa Statistics were statistically different.

Final Result

To test the new model and generate a final result the training dataset and the test dataset were combined to form a single training set. This set was used to train the new model. The test-set used in this section is the final holdout test set. This set has been unused and unseen.

The model was created using all the tuned settings, and features from the work done in this paper.

Performance Metrics on the final test/holdout set :

Performance of baseline model: 

Accuracy 0.7811
Kappa 0.7612

Performance of new model, with confidencefactor = 0.30 
Accuracy 0.9212
Kappa 0.9016

The new model correctly classified 30,516 instances. This final performance was then compared to the initial baseline model built at the beginning of this paper. The difference between the numbers of correctly classified instances was compared.

It was determined that, p-value = 0.04011, indicating that there was a statistically significant improvement from the performance of the baseline model.

Conclusion

This paper has explored using an applied machine learning approach to an important classification problem. This work used several machine learning tools, algorithms and techniques to classify a user’s activity and posture. The sequence of tools and techniques used included: Data Pre-processing, Data preparation, feature space analysis, feature selection, baseline performance, error analysis, parameter tuning and model testing/comparison.

The main contributions of this paper are:

· An analysis of HAR and ADL

· An evaluation of machine learning algorithms for activity recognition

· A method of error analysis that can be used to improve models built using this dataset

· A feature space that can be used to predict/classify human activity

· The best settings for confidenceFactor when building a J48 tree for this dataset or any dataset for predicting human activity

· A final tuned, efficient, low latency, accurate model for this dataset/type of data

A limitation of this paper is that the data used was not collected by me, but was collected in 2013 by a research group in Brazil. Although there is extensive documentation provided along with the dataset, I could have created a better model if I had collected the data. Using my own data would have led to me having a much better understanding of it and how accurate it really was. Furthermore, I could have ensured that any accelerometers, sensors or CPU’s used were precise and error free. Future work in this field could be to include newer and more extensive classes in the data. Different classifier’s performances would then be evaluated. With the inclusion of new classes in the data, a classifier to qualitatively recognize activities could be created. Different activities like exercising, swimming, weight lifting could be recognized with this level of accuracy, precision and efficiency. Although, there already exist wearable devices with the ability to recognize these types of activities, several improvements need to be made to create a classifier with that is highly accurate, precise, efficient.

References:

[1] “Wearable Devices Market Global Industry Demand, Size, Growth Research Report By 2022.” MarketWatch, MarketWatch, 12 Apr. 2018, www.marketwatch.com/press-release/wearable-devices-market-global-industry-demand-size-growth-research-report-by-2022-2018-04-12.

[2] Chen, Pang, and G. Laguna. “Final Report for LDRD Project Learning Efficient Hypermedia Navigation.” 1997, doi:10.2172/532659.

[3] Wallace Ugulino, et al. “Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements.” Brazilian Symposium on Artificial Intelligence. , vol. 21, 10 June 2012.

[4] “Wearable Sensors Are The Future Of Personalized Medicine.” Orthogonal, 5 Sept. 2017, orthogonal.io/medical-softtware/wearable-sensors-are-the-future-of-personalized-medicine-html/.

[5] Xie, Junqing et al. “Evaluating the Validity of Current Mainstream Wearable Devices in Fitness Tracking Under Various Physical Activities: Comparative Study” JMIR mHealth and uHealth vol. 6,4 e94. 12 Apr. 2018, doi:10.2196/mhealth.9754

[6] UCI Machine Learning Repository: Data Set, archive.ics.uci.edu/ml/datasets/Wearable+Computing%3A+Classification+of+Body+Postures+and+Movements+(PUC-Rio).

[7] “Weka: Decision Trees– J48.” Weka, Weka , stp.lingfil.uu.se/~santinim/ml/2016/Lect_03/Lab02_DecisionTrees.pdf.

[8] Joshi, Parikshit. “When to Use Linear Regression, Clustering, or Decision Trees - DZone AI.” Dzone.com, 4 Oct. 2017, dzone.com/articles/decision-trees-vs-clustering-algorithms-vs-linear.

[9] Son, Sung Min. “Influence of Obesity on Postural Stability in Young Adults”Osong public health and research perspectives vol. 7,6 (2016): 378-381.

[10] “How to Calculate Distance, Azimuth and Dip from Two XYZ Coordinates.” Geographic Information Systems Stack Exchange, gis.stackexchange.com/questions/108547/how-to-calculate-distance-azimuth-and-dip-from-two-xyz-coordinates.