Bayesian Networks with Continuous Distributions - Regression model to describe wine quality

Robson Fernandes

Published Oct 21, 2019

Bayesian networks (BNs), also known as belief networks (or Bayes nets for short), belong to the family of probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain. In particular, each node in the graph represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics. Ben-Gal I., Bayesian Networks, in Ruggeri F., Faltin F. & Kenett R., Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons (2007).

Não foi fornecido texto alternativo para esta imagem

Figure 1 - Example of Bayesian network with belief propagation. Source : https://pr-owl.org/basics/bn.php

Data Set Information

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods. Source : https://archive.ics.uci.edu/ml/datasets/Wine+Quality

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Import packages

Import the packages desire

library("bnlearn")
library("bnviewer")
library("StatMeasures")

Read Wine Data Set

Read wine dataset from CSV file and force the quality variable to be a numeric type

df_wine = read.csv("winequality-red.csv", sep = ";")

df_wine$quality = as.numeric(df_wine$quality)

Structure Learning for Bayesian Networks

Structure learning for Bayesian networks. The task of structure learning for Bayesian networks refers to learn the structure of the directed acyclic graph (DAG) from data. There are two major approaches for the structure learning: score-based approach and constraint-based approach.

Automatic Structure Learning - Hill Climbing

bn.learn.wine = bnlearn::hc(df_wine)
bn.learn.wine = drop.arc(bn.learn.wine, "quality", "sulphates")

Output - Hill Climbing

  Bayesian network learned via Score-based methods

  model:
   [fixed.acidity][chlorides|fixed.acidity][alcohol|chlorides]
   [free.sulfur.dioxide|fixed.acidity:alcohol]
   [total.sulfur.dioxide|free.sulfur.dioxide:alcohol]
   [citric.acid|fixed.acidity:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
   [residual.sugar|fixed.acidity:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
   [pH|fixed.acidity:citric.acid:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
   [density|fixed.acidity:residual.sugar:chlorides:pH:alcohol]
   [volatile.acidity|fixed.acidity:citric.acid:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:density:pH]
   [sulphates|fixed.acidity:volatile.acidity:residual.sugar:chlorides:total.sulfur.dioxide:density:pH:alcohol]
   [quality|volatile.acidity:total.sulfur.dioxide:pH:alcohol]
  nodes:                                 12 
  arcs:                                  45 
    undirected arcs:                     0 
    directed arcs:                       45 
  average markov blanket size:           9.33 
  average neighbourhood size:            7.50 
  average branching factor:              3.75 

  learning algorithm:                    Hill-Climbing 
  score:                                 BIC (Gauss.) 
  penalization coefficient:              3.688567 
  tests used in the learning procedure:  572 
  optimized:                             TRUE

Visualization - Bayesian Network

Visualization of the structure based on data structure learning through the Hill Climbing algorithm using the bnviewer package for interactive visualization of Bayesian networks.

viewer(bn.learn.wine,
       bayesianNetwork.width = "100%",
       bayesianNetwork.height = "100vh",
       bayesianNetwork.layout = "layout_on_grid",
       bayesianNetwork.title="<br><span style='font-size:18px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;'>
                                             Bayesian Networks
                                             with Continuous Distributions -
                                             Wine Dataset</span>",
       bayesianNetwork.subtitle = "<span style='font-size:15px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;'>Automatic
                                             Structure Learning -
                                             HC (Hill Climbing)</span>",
       node.colors = list(background = "white",
                          border = "black",
                          highlight = list(background = "#e91eba",
                                           border = "black")),

       node.font = list(color = "black", face="Arial"),

       clusters.legend.title = list(text = "Legend",
                                    style = "font-size:18px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;"),

       clusters.legend.options = list(

         list(label = "Quality",
              shape = "icon",
              icon = list(code = "f1ce", size = 50, color = "#e91e63")),
         list(label = "Acid",
              shape = "icon",
              icon = list(code = "f140", size = 50, color = "#03a9f4")),
         list(label = "Sugar",
              shape = "icon",
              icon = list(code = "f192", size = 50, color = "#4caf50")),
         list(label = "Sulfur Dioxide",
              shape = "icon",
              icon = list(code = "f10c", size = 50, color = "#ffc107")),
         list(label = "Alcohol",
              shape = "icon",
              icon = list(code = "f043", size = 50, color = "#03a9f4"))
       ),

       clusters = list(
         list(label = "Quality",
              shape = "icon",
              icon = list(code = "f1ce", color = "#e91e63"),
              nodes = list("quality")),
         list(label = "Acid",
              shape = "icon",
              icon = list(code = "f140", color = "#03a9f4"),
              nodes = list("fixed.acidity","citric.acid","volatile.acidity","pH")),
         list(label = "Sugar",
              shape = "icon",
              icon = list(code = "f192", color = "#4caf50"),
              nodes = list("residual.sugar")),
         list(label = "Sulfur Dioxide",
              shape = "icon",
              icon = list(code = "f10c", color = "#ffc107"),
              nodes = list("total.sulfur.dioxide","free.sulfur.dioxide")),
         list(label = "Alcohol",
              shape = "icon",
              icon = list(code = "f043", color = "#03a9f4"),
              nodes = list("alcohol"))
       )
)

Output Visualization

Manual Structure Learning

model = "quality <- (residual.sugar,
                    chlorides,
                    free.sulfur.dioxide,
                    citric.acid,
                    sulphates,
                    alcohol);
       free.sulfur.dioxide <- (total.sulfur.dioxide);
       citric.acid <- (pH);
       citric.acid <- (fixed.acidity);
       fixed.acidity <- (volatile.acidity);
       fixed.acidity <- (density);"


bn.manual.wine = model.to.structure(model)
bn.manual.wine

Output - Manual Structure Learning

  Random/Generated Bayesian network

  model:
   [residual.sugar][chlorides][sulphates][alcohol][total.sulfur.dioxide][pH]
   [volatile.acidity][density][free.sulfur.dioxide|total.sulfur.dioxide]
   [fixed.acidity|volatile.acidity:density][citric.acid|pH:fixed.acidity]
   [quality|residual.sugar:chlorides:free.sulfur.dioxide:citric.acid:sulphates:alcohol]
  nodes:                                 12 
  arcs:                                  11 
    undirected arcs:                     0 
    directed arcs:                       11 
  average markov blanket size:           4.67 
  average neighbourhood size:            1.83 
  average branching factor:              0.92 

  generation algorithm:                  Empty

Visualization - Bayesian Network

Manual structure visualization using the bnviewer package for interactive viewing of Bayesian networks.

viewer(bn.manual.wine,
       bayesianNetwork.width = "100%",
       bayesianNetwork.height = "100vh",
       bayesianNetwork.layout = "layout_hierarchical_direction_LR",
       bayesianNetwork.title="<br><span style='font-size:18px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;'>
                                             Bayesian Networks
                                             with Continuous Distributions -
                                             Wine Dataset</span>",
       bayesianNetwork.subtitle = "<span style='font-size:15px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;'>Manual
                                             Structure Learning</span>",
       edges.smooth = FALSE,
       node.colors = list(background = "white",
                          border = "black",
                          highlight = list(background = "#e91eba",
                                           border = "black")),

       node.font = list(color = "black", face="Arial"),

       clusters.legend.title = list(text = "Legend",
                                    style = "font-size:18px;
                                             font-family:Arial;
                                             color:black;
                                             text-align:center;"),

       clusters.legend.options = list(

               list(label = "Quality",
                    shape = "icon",
                    icon = list(code = "f1ce", size = 50, color = "#e91e63")),
               list(label = "Acid",
                    shape = "icon",
                    icon = list(code = "f140", size = 50, color = "#03a9f4")),
               list(label = "Sugar",
                    shape = "icon",
                    icon = list(code = "f192", size = 50, color = "#4caf50")),
               list(label = "Sulfur Dioxide",
                    shape = "icon",
                    icon = list(code = "f10c", size = 50, color = "#ffc107")),
               list(label = "Alcohol",
                    shape = "icon",
                    icon = list(code = "f043", size = 50, color = "#03a9f4"))
       ),

       clusters = list(
               list(label = "Quality",
                    shape = "icon",
                    icon = list(code = "f1ce", color = "#e91e63"),
                    nodes = list("quality")),
               list(label = "Acid",
                    shape = "icon",
                    icon = list(code = "f140", color = "#03a9f4"),
                    nodes = list("fixed.acidity","citric.acid","volatile.acidity","pH")),
               list(label = "Sugar",
                    shape = "icon",
                    icon = list(code = "f192", color = "#4caf50"),
                    nodes = list("residual.sugar")),
               list(label = "Sulfur Dioxide",
                    shape = "icon",
                    icon = list(code = "f10c", color = "#ffc107"),
                    nodes = list("total.sulfur.dioxide","free.sulfur.dioxide")),
               list(label = "Alcohol",
                    shape = "icon",
                    icon = list(code = "f043", color = "#03a9f4"),
                    nodes = list("alcohol"))
       )
)

Output Visualization

Split DataSet - Training and Test

Split dataset between 1000 samples to training data and 500 samples to test data.

training.set <- df_wine[1:1000, ]
test.set <- df_wine[1000:500, ]

Fit Bayesian Network

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate (Rossi, Richard J. (2018). Mathematical Statistics : An Introduction to Likelihood Based Inference).

The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. (Ward, Michael Don; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. New York: Cambridge University Press. )

bayesian.fit <- bn.fit(bn.learn.wine,
                       data = training.set,
                       method="mle")

Prediction Wine Quality

Prediction of wine quality, considering the bayesian model.

bayesian.predict <- predict(bayesian.fit,
                            "quality",
                            test.set)


real <- test.set[,"quality"]
previsto <- bayesian.predict

Accuracy Model

Accuracy assessment of the developed model.

mape <- mape(y = real, yhat = predict)
accuracy <- 100 - mape*100
accuracy

The Bayesian Network based on Automatic Structure Learning with Hill Climbing explained 90.47% the data.

To the next...

I hope this approach can contribute to those who are starting in the area of Data Science, whether Statistics, Mathematicians, Computer Scientists or students who have an interest in the subject.

To view or add a comment, sign in

Bayesian Networks with Continuous Distributions - Regression model to describe wine quality

Robson Fernandes

Data Set Information

Import packages

Split DataSet - Training and Test

Fit Bayesian Network

Prediction Wine Quality

Accuracy Model

To the next...

More articles by Robson Fernandes

Explore content categories

Data Set Information

Import packages

Split DataSet - Training and Test

Fit Bayesian Network

Prediction Wine Quality

Accuracy Model

To the next...

More articles by Robson Fernandes

dbnlearn: An R package for Dynamic Bayesian Network Structure Learning, Parameter Learning and Forecasting

BnViewer - Interactive Visualization of Bayesian Networks - Interactive Panel and High Definition Export

Artificial Neural Networks - Multi Layer Perceptron applied to the Iris Data Set Classification

Node Clustering in Probabilistic Graphical Models - Bayesian Networks Explainability

Probabilistic Graphical Models - An approach in Bayesian Networks for Sport analysis and Insights for the formation of Basketball teams

BNViewer - An R package for interactive visualization of Bayesian Networks

Complex Systems - Detecção de Comunidades em Redes Complexas

Data Science - Inferência em Redes Bayesianas Aplicado a Análise de Fatores de Risco em Trombose Coronária

Data Science - Análise de Regressão Linear Aplicado a Previsão de Vendas

Explore content categories