Machine Learning with SAP Data
Python Notebook in IBM Data Science Experience

Machine Learning with SAP Data

Data scientists aim at turning raw data into insight. They use machine learning techniques to learn from experience. Previously unknown data records are assigned to categories using classification or output values are predicted for the data records using regression. This article explains how data scientists can take advantage of IBM Data Science Experience for machine learning with SAP data.

Overview

IBM Data Science Experience (DSX) is an award-winning Cloud-based platform that provides the typical tools used by data scientists like Jupyter notebooks and that can connect to various data sources in the Cloud or on premise. It supports the collaborative work of multi data scientists on the same notebooks. DSX runs on top of IBM's Cloud platform, which was formally known as IBM Bluemix and implements Cloud Foundry, and allows you to accomplish the data science tasks from a Web browser. It uses Apache Spark as compute engine for large-scale parallel data processing. Python, Scala and R can be used as typical languages to interactively work in Jupyter notebooks, which contain live code and data visualization.

The data that is used for training and for classification can already exist in DSX or can originate in an external data service like Amazon Redshift, Cloudant, Hortonworks, Teradata or Db2. This blog explains how you can securely connect DSX to an on-premise Db2 for z/OS database via a secure gateway.

Using IBM Data Science Experience for Modeling and Scoring

In the following, SAP ERP data from Commodity Management is considered as an example to explain the methodology. Before you create a new notebook in DSX, make sure to create a Spark service and an object storage in IBM Cloud. Then a notebook can be created, for example using Python as programming language. You have access to the entire notebook that we discuss in this article.

As a data asset, the ERP data is provided as a CSV file. This file was populated with the relevant columns from the SAP table or view (view ILOGP in this case). We will also see how to directly access the source database using SQL.

Reading data into Spark

After opening up the notebook, you need to prepare the Python environment and import the relevant Python packages. Execute the following commands exactly once to install the packages that do not exist yet.

  • !pip install plotly --user 
  • !pip install cufflinks==0.8.2 --user

Then import the Spark and plotting packages. The repository packages will be needed to persist machine learning models into IBM Cloud. 

Now you are ready to import the SAP data into Spark. After building a Spark session, read the data directly from the original table or from the previously uploaded CSV file. For the CSV file you do not need to create the credentials information on your own. Rather click on the '1001' button and select 'Insert Credentials' from the 'Insert to code' pulldown. If you prefer to directly access the source tables via SQL SELECT, you should use the previously defined connection to that database (see this blog), and execute the SELECT statement. The fetched data rows are then passed into a Spark DataFrame. To understand the data model of the relevant SAP tables, you can take advantage of SQL views and table UDFs that are created as part of SAP Core Data Services (CDS). A further method to read data from SAP is by creating an OData service in an SAP Gateway. In Python the Pyslet module provides an OData client.

Note that the format of the column fields is also passed to the createDataFrame method. Using the printSchema() and count() methods of the DataFrame, you can retrieve the schema of the DataFrame and the number of the records in it. To learn more about DataFrames in the Python Spark module, check out this site.

Build model using training data

Now that the data is in the Spark structure, you can use it for training to build your classification model. For this purpose the data records are split into disjoint sets of training, testing and prediction records. The number of training records needs to be sufficiently large.

The actual training is to be performed in a pipeline that executes different stages. These stages consist of transformers and estimators that first need to be prepared. StringIndexer structures are used to specify the target field of the classification (outputCol="label") and to map the values of the input fields to numeric values. All of the numeric fields are then combined into a vector with name "features". Random forest classification as a decision tree technique is used as machine learning algorithm. The defined transformers and estimators are then passed into the pipeline. The pipeline is executed once its fit method is called with the training data as input parameter. Other machine learning algorithms like linear regression or SVM and other machine learning libraries like Tensorflow can also be used. Note however that Tensorflow models cannot be persisted in the Machine Learning service (see later).

Evaluate model using test data

The variable model_rf contains the model that has been built based on the training data. By calling method transform with test data, you can evaluate how well the model works. You should use cross-validation data for this purpose to avoid overfitting to the training data. If you are happy with the result, the model can be persisted in the Machine Learning service so that it can later be used for scoring.

Persist model in IBM Cloud

Classification models from Spark Pipelines and from Scikit Pipelines can be stored in the IBM Cloud Machine Learning service. The credentials from the service need to be provided to the machine learning repository client. The following figure shows the Machine Learning service and how its credentials are used in Python to persist the classification model.

Visualize predictions

To visualize the predictions for the test data, which the model had produced, the Plotly Python library is used. Directly in the notebook, it displays the following diagrams.

Create online scoring endpoint

As a last step, an online endpoint is created for the model. This REST API can then be used by the intended consumer of the machine learning service. An access token needs to be generated that the consumer will provide. The following figure shows how the REST endpoint is created.

For illustration purposes, the REST API is invoked with two data records as input: (MSY0, 2, 1) and (CSY0, 1, 1). The REST API returns the classification result based on the model that had been created and published.

SAP ABAP applications can invoke the scoring endpoint via a REST client service. For guidance on how to define a REST client on the SAP app server and how to invoke a REST API in the ABAP program via class CL_HTTP_CLIENT, please refer to this IBM developerWorks article. Note that the SSL certificate of the Watson machine learning URL https://ibm-watson-ml.mybluemix.net needs to be configured in SAP transaction STRUST.


Fixed broken diagrams in the write-up

Like
Reply

To view or add a comment, sign in

More articles by Johannes Schuetzner

Others also viewed

Explore content categories