Bayesian Networks with Continuous Distributions - Regression model to describe wine quality
Bayesian networks (BNs), also known as belief networks (or Bayes nets for short), belong to the family of probabilistic graphical models (GMs). These graphical structures are used to represent knowledge about an uncertain domain. In particular, each node in the graph represents a random variable, while the edges between the nodes represent probabilistic dependencies among the corresponding random variables. These conditional dependencies in the graph are often estimated by using known statistical and computational methods. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics. Ben-Gal I., Bayesian Networks, in Ruggeri F., Faltin F. & Kenett R., Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons (2007).
Figure 1 - Example of Bayesian network with belief propagation. Source : https://pr-owl.org/basics/bn.php
Data Set Information
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods. Source : https://archive.ics.uci.edu/ml/datasets/Wine+Quality
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Import packages
Import the packages desire
library("bnlearn")
library("bnviewer")
library("StatMeasures")
Read Wine Data Set
Read wine dataset from CSV file and force the quality variable to be a numeric type
df_wine = read.csv("winequality-red.csv", sep = ";")
df_wine$quality = as.numeric(df_wine$quality)
Structure Learning for Bayesian Networks
Structure learning for Bayesian networks. The task of structure learning for Bayesian networks refers to learn the structure of the directed acyclic graph (DAG) from data. There are two major approaches for the structure learning: score-based approach and constraint-based approach.
Automatic Structure Learning - Hill Climbing
bn.learn.wine = bnlearn::hc(df_wine) bn.learn.wine = drop.arc(bn.learn.wine, "quality", "sulphates")
Output - Hill Climbing
Bayesian network learned via Score-based methods
model:
[fixed.acidity][chlorides|fixed.acidity][alcohol|chlorides]
[free.sulfur.dioxide|fixed.acidity:alcohol]
[total.sulfur.dioxide|free.sulfur.dioxide:alcohol]
[citric.acid|fixed.acidity:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
[residual.sugar|fixed.acidity:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
[pH|fixed.acidity:citric.acid:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:alcohol]
[density|fixed.acidity:residual.sugar:chlorides:pH:alcohol]
[volatile.acidity|fixed.acidity:citric.acid:chlorides:free.sulfur.dioxide:total.sulfur.dioxide:density:pH]
[sulphates|fixed.acidity:volatile.acidity:residual.sugar:chlorides:total.sulfur.dioxide:density:pH:alcohol]
[quality|volatile.acidity:total.sulfur.dioxide:pH:alcohol]
nodes: 12
arcs: 45
undirected arcs: 0
directed arcs: 45
average markov blanket size: 9.33
average neighbourhood size: 7.50
average branching factor: 3.75
learning algorithm: Hill-Climbing
score: BIC (Gauss.)
penalization coefficient: 3.688567
tests used in the learning procedure: 572
optimized: TRUE
Visualization - Bayesian Network
Visualization of the structure based on data structure learning through the Hill Climbing algorithm using the bnviewer package for interactive visualization of Bayesian networks.
viewer(bn.learn.wine,
bayesianNetwork.width = "100%",
bayesianNetwork.height = "100vh",
bayesianNetwork.layout = "layout_on_grid",
bayesianNetwork.title="<br><span style='font-size:18px;
font-family:Arial;
color:black;
text-align:center;'>
Bayesian Networks
with Continuous Distributions -
Wine Dataset</span>",
bayesianNetwork.subtitle = "<span style='font-size:15px;
font-family:Arial;
color:black;
text-align:center;'>Automatic
Structure Learning -
HC (Hill Climbing)</span>",
node.colors = list(background = "white",
border = "black",
highlight = list(background = "#e91eba",
border = "black")),
node.font = list(color = "black", face="Arial"),
clusters.legend.title = list(text = "Legend",
style = "font-size:18px;
font-family:Arial;
color:black;
text-align:center;"),
clusters.legend.options = list(
list(label = "Quality",
shape = "icon",
icon = list(code = "f1ce", size = 50, color = "#e91e63")),
list(label = "Acid",
shape = "icon",
icon = list(code = "f140", size = 50, color = "#03a9f4")),
list(label = "Sugar",
shape = "icon",
icon = list(code = "f192", size = 50, color = "#4caf50")),
list(label = "Sulfur Dioxide",
shape = "icon",
icon = list(code = "f10c", size = 50, color = "#ffc107")),
list(label = "Alcohol",
shape = "icon",
icon = list(code = "f043", size = 50, color = "#03a9f4"))
),
clusters = list(
list(label = "Quality",
shape = "icon",
icon = list(code = "f1ce", color = "#e91e63"),
nodes = list("quality")),
list(label = "Acid",
shape = "icon",
icon = list(code = "f140", color = "#03a9f4"),
nodes = list("fixed.acidity","citric.acid","volatile.acidity","pH")),
list(label = "Sugar",
shape = "icon",
icon = list(code = "f192", color = "#4caf50"),
nodes = list("residual.sugar")),
list(label = "Sulfur Dioxide",
shape = "icon",
icon = list(code = "f10c", color = "#ffc107"),
nodes = list("total.sulfur.dioxide","free.sulfur.dioxide")),
list(label = "Alcohol",
shape = "icon",
icon = list(code = "f043", color = "#03a9f4"),
nodes = list("alcohol"))
)
)
Output Visualization
Manual Structure Learning
model = "quality <- (residual.sugar, chlorides, free.sulfur.dioxide, citric.acid, sulphates, alcohol); free.sulfur.dioxide <- (total.sulfur.dioxide); citric.acid <- (pH); citric.acid <- (fixed.acidity); fixed.acidity <- (volatile.acidity); fixed.acidity <- (density);" bn.manual.wine = model.to.structure(model) bn.manual.wine
Output - Manual Structure Learning
Random/Generated Bayesian network
model:
[residual.sugar][chlorides][sulphates][alcohol][total.sulfur.dioxide][pH]
[volatile.acidity][density][free.sulfur.dioxide|total.sulfur.dioxide]
[fixed.acidity|volatile.acidity:density][citric.acid|pH:fixed.acidity]
[quality|residual.sugar:chlorides:free.sulfur.dioxide:citric.acid:sulphates:alcohol]
nodes: 12
arcs: 11
undirected arcs: 0
directed arcs: 11
average markov blanket size: 4.67
average neighbourhood size: 1.83
average branching factor: 0.92
generation algorithm: Empty
Visualization - Bayesian Network
Manual structure visualization using the bnviewer package for interactive viewing of Bayesian networks.
viewer(bn.manual.wine,
bayesianNetwork.width = "100%",
bayesianNetwork.height = "100vh",
bayesianNetwork.layout = "layout_hierarchical_direction_LR",
bayesianNetwork.title="<br><span style='font-size:18px;
font-family:Arial;
color:black;
text-align:center;'>
Bayesian Networks
with Continuous Distributions -
Wine Dataset</span>",
bayesianNetwork.subtitle = "<span style='font-size:15px;
font-family:Arial;
color:black;
text-align:center;'>Manual
Structure Learning</span>",
edges.smooth = FALSE,
node.colors = list(background = "white",
border = "black",
highlight = list(background = "#e91eba",
border = "black")),
node.font = list(color = "black", face="Arial"),
clusters.legend.title = list(text = "Legend",
style = "font-size:18px;
font-family:Arial;
color:black;
text-align:center;"),
clusters.legend.options = list(
list(label = "Quality",
shape = "icon",
icon = list(code = "f1ce", size = 50, color = "#e91e63")),
list(label = "Acid",
shape = "icon",
icon = list(code = "f140", size = 50, color = "#03a9f4")),
list(label = "Sugar",
shape = "icon",
icon = list(code = "f192", size = 50, color = "#4caf50")),
list(label = "Sulfur Dioxide",
shape = "icon",
icon = list(code = "f10c", size = 50, color = "#ffc107")),
list(label = "Alcohol",
shape = "icon",
icon = list(code = "f043", size = 50, color = "#03a9f4"))
),
clusters = list(
list(label = "Quality",
shape = "icon",
icon = list(code = "f1ce", color = "#e91e63"),
nodes = list("quality")),
list(label = "Acid",
shape = "icon",
icon = list(code = "f140", color = "#03a9f4"),
nodes = list("fixed.acidity","citric.acid","volatile.acidity","pH")),
list(label = "Sugar",
shape = "icon",
icon = list(code = "f192", color = "#4caf50"),
nodes = list("residual.sugar")),
list(label = "Sulfur Dioxide",
shape = "icon",
icon = list(code = "f10c", color = "#ffc107"),
nodes = list("total.sulfur.dioxide","free.sulfur.dioxide")),
list(label = "Alcohol",
shape = "icon",
icon = list(code = "f043", color = "#03a9f4"),
nodes = list("alcohol"))
)
)
Output Visualization
Split DataSet - Training and Test
Split dataset between 1000 samples to training data and 500 samples to test data.
training.set <- df_wine[1:1000, ] test.set <- df_wine[1000:500, ]
Fit Bayesian Network
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate (Rossi, Richard J. (2018). Mathematical Statistics : An Introduction to Likelihood Based Inference).
The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference. (Ward, Michael Don; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. New York: Cambridge University Press. )
bayesian.fit <- bn.fit(bn.learn.wine,
data = training.set,
method="mle")
Prediction Wine Quality
Prediction of wine quality, considering the bayesian model.
bayesian.predict <- predict(bayesian.fit,
"quality",
test.set)
real <- test.set[,"quality"]
previsto <- bayesian.predict
Accuracy Model
Accuracy assessment of the developed model.
mape <- mape(y = real, yhat = predict) accuracy <- 100 - mape*100 accuracy
The Bayesian Network based on Automatic Structure Learning with Hill Climbing explained 90.47% the data.
To the next...
I hope this approach can contribute to those who are starting in the area of Data Science, whether Statistics, Mathematicians, Computer Scientists or students who have an interest in the subject.