Machine Learning on Apache logs


Below post is a small attempt to predict response status of apache logs ( close to 65% )

To visualize and predict application logs ( I am using category model) , my label catagory is on response status ,

Below is sample log with restricted column to understand better

172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 20
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 200
172.17.0.1 , [03/Apr/2022:18:15:37 +0000] , GET / HTTP/1.1 , 2000        

As you all know for the best accuracy , we should convert string to numerical format , but in my case I have limited column (8999 rows x 4 columns) and most of them having less types so filled with dummies.

I am able to get predictive results close to 65% accuracy , To improve accuracy we might need to add more columns with numerical values.

For converstion for raw log to csv I have created below go script ( Intention of go is to make the script to be compatible to fetch data from kafka in batches in the next post )

package main


import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "regexp"
    "strings"
)


var (
    list []string
)


func main() {


    defer func() {
        if r := recover(); r != nil {
            fmt.Println(" ")
        }
    }()


    s, e := ioutil.ReadFile("access.log")


    f, err := os.Create("access.log")


    if err != nil {
        log.Fatal(err)
    }


    defer f.Close()


    if e == nil {
        list = strings.Split(string(s), "\n")
    }



    for i := 0; i < len(list); i++ {
        r := regexp.MustCompile(`(?P<ip>(\d{1,}\.){3}\d{1,}) - - (?P<sep>(\[.*?\]))\s\"(?P<call>(\w{3}.*HTTP/1.1))\"\s(?P<response>(\d+))`)
        k := r.FindStringSubmatch(list[i])
        d := k[1] + "," + k[3] + "," + k[5] + "," + k[7] + "\n"
        _, err2 := f.WriteString(d)


        if err2 != nil {
            log.Fatal(err2)
        }
    }
}n        


go run access.go -> csv will be save as access.csv

----------------------------------------------------------

For the preprocess I am using pandas and numpy. Out of 8K records I am using 20 % for traning and 80% for the prediction , which you will see in the y_pred results . Below I tried linear regression

import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt


df=pd.read_csv("access.csv")
df.columns=["ipaddress","access_time","protocal","response"]
sns.pairplot(df[["ipaddress","access_time","protocal","response"]])
df['ipaddress']= df['ipaddress'].astype(str)




ipaddress = pd.get_dummies(df['ipaddress'])


df1 = pd.concat([ipaddress, df], axis=1)
#df.drop('ipaddress', axis=1, inplace=True)
df1.columns = ["ipaddress_e","ipaddress","access_time","protocal","response"]
## filling test with numerical 
states = pd.get_dummies(df['protocal'])
df1 = pd.concat([states, df], axis=1)
df1.drop(["protocal","ipaddress","access_time"],axis=1,inplace=True)
df1.columns = ["protocal","ipaddress_e","response"]
#print(df1.head)
x=df1.iloc[:,:-1]
y=df1.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=0)
lr = LinearRegression()
lr.fit(x_train, y_train)
LinearRegression()
#print(df1.head)
y_pred = lr.predict(x_test)
y_test_1=y_test.to_list()
#print(y_pred[0:5],y_test[0:5])
for i in range(len(y_pred)):
    print(y_pred[i],y_test_1[i])
    
print(df.count)        

For the model deploy I am using pickle and on coming post I will be showing CI/CD for model retrain and deployment




To view or add a comment, sign in

Others also viewed

Explore content categories