Machine Learning on Apache logs
Below post is a small attempt to predict response status of apache logs ( close to 65% )
To visualize and predict application logs ( I am using category model) , my label catagory is on response status ,
Below is sample log with restricted column to understand better
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 20
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 2000
As you all know for the best accuracy , we should convert string to numerical format , but in my case I have limited column (8999 rows x 4 columns) and most of them having less types so filled with dummies.
I am able to get predictive results close to 65% accuracy , To improve accuracy we might need to add more columns with numerical values.
For converstion for raw log to csv I have created below go script ( Intention of go is to make the script to be compatible to fetch data from kafka in batches in the next post )
Recommended by LinkedIn
package main
import (
"fmt"
"io/ioutil"
"log"
"os"
"regexp"
"strings"
)
var (
list []string
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println(" ")
}
}()
s, e := ioutil.ReadFile("access.log")
f, err := os.Create("access.log")
if err != nil {
log.Fatal(err)
}
defer f.Close()
if e == nil {
list = strings.Split(string(s), "\n")
}
for i := 0; i < len(list); i++ {
r := regexp.MustCompile(`(?P<ip>(\d{1,}\.){3}\d{1,}) - - (?P<sep>(\[.*?\]))\s\"(?P<call>(\w{3}.*HTTP/1.1))\"\s(?P<response>(\d+))`)
k := r.FindStringSubmatch(list[i])
d := k[1] + "," + k[3] + "," + k[5] + "," + k[7] + "\n"
_, err2 := f.WriteString(d)
if err2 != nil {
log.Fatal(err2)
}
}
}n
go run access.go -> csv will be save as access.csv
----------------------------------------------------------
For the preprocess I am using pandas and numpy. Out of 8K records I am using 20 % for traning and 80% for the prediction , which you will see in the y_pred results . Below I tried linear regression
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
df=pd.read_csv("access.csv")
df.columns=["ipaddress","access_time","protocal","response"]
sns.pairplot(df[["ipaddress","access_time","protocal","response"]])
df['ipaddress']= df['ipaddress'].astype(str)
ipaddress = pd.get_dummies(df['ipaddress'])
df1 = pd.concat([ipaddress, df], axis=1)
#df.drop('ipaddress', axis=1, inplace=True)
df1.columns = ["ipaddress_e","ipaddress","access_time","protocal","response"]
## filling test with numerical
states = pd.get_dummies(df['protocal'])
df1 = pd.concat([states, df], axis=1)
df1.drop(["protocal","ipaddress","access_time"],axis=1,inplace=True)
df1.columns = ["protocal","ipaddress_e","response"]
#print(df1.head)
x=df1.iloc[:,:-1]
y=df1.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=0)
lr = LinearRegression()
lr.fit(x_train, y_train)
LinearRegression()
#print(df1.head)
y_pred = lr.predict(x_test)
y_test_1=y_test.to_list()
#print(y_pred[0:5],y_test[0:5])
for i in range(len(y_pred)):
print(y_pred[i],y_test_1[i])
print(df.count)
For the model deploy I am using pickle and on coming post I will be showing CI/CD for model retrain and deployment
Good work
Good one Praveen Sam
good one, keep writing Praveen Sam
Great effort , more innovative thinking 👌