Use multi-level hash in Perl for complex data queries

Jeffrey Wilson

Published Nov 11, 2024

[Review other tutorials from my table of contents.]

I think this one goes in the "advanced Perl" pile. Not for the faint of heart ... it gets a little scary. Ready? Ok, let's go!

I once encountered a 500G+ log file that I needed to convert into a summary report. Where to begin? If this is your first hint at this particular problem, go back and read the prequel on how I staged the data.

Multiple iterations

I like to use Perl for large scale text processing. It's literally in the name. Perl is an acronym: practical extraction and reporting language.

I want to review the log file for the stats reports. What shape do these reports take when they're written to file? Let's investigate using head or tail to sample a very large file. Our next step is to build a regular expression to describe them.

The log entries with the necessary info are marked with a common string: ": 𝘴𝘵𝘢𝘵𝘴: ". I can describe this in regular expression like this:

\:\sstats\:\s  # colon, whitespace, literal "stats", colon, whitespace

This will help narrow the log file down to only the lines containing stats. I used it in my previous article as a filter for grep to pre-filter the data before passing it into the Perl script that I describe below.

One of the first things I notice is that not all stats are reported on the same log entry. Consequently, I will need to stitch together multiple log entries into a single row on my eventual report. File this thought in the back of your mind for now ... How do I associate the multiple entries into the same row?

Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: 1094112151 questions, 1000188 cache entries, 29786 negative entries, 16% cache hits
Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: throttle map: 14, ns speeds: 4548
Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: outpacket/query ratio 51%, 1% throttled, 0 no-delegation drops
Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: 243931 outgoing tcp connections, 4 queries running, 10105788 outgoing timeouts
Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: 425354 packet cache entries, 57% packet cache hits
Aug  1 16:50:06 server1 pdns_recursor[6661]: stats: 22 qps (average over 1816 seconds)

That is six rows, each with a partial window into the full report. The first line tells me how many questions, cache entries, negative entries, and percent of cache hits. The data nuggets on the next line are throttle map and ns speeds. And so on and so forth, not to bore you to tears - let's move on ...

One last note ... to make a named capturing regular expression in Perl, precede it with a name like so: ?<𝘤𝘢𝘱𝘵𝘶𝘳𝘦_𝘯𝘢𝘮𝘦> ... Perl has a number of built-in operators. I take advantage of the regular expression compiler by creating pre-compiled regex's with qr//.

Here are the regular expressions and captures that I came up with:

# $1 outgoing tcp connections
# $2 queries running
# $3 outgoing timeouts
my $line1 = qr/(?<outgoing_tcp_conn>\d+)\s+outgoing\stcp\sconnections,\s
               (?<queries_running>\d+)\squeries\srunning,\s
               (?<outgoing_timeouts>\d+)\soutgoing\stimeouts/x;

# $1 packet cache entries
# $2 packet cache hits pct
my $line2 = qr/(?<packet_cache_entries>\d+)\spacket\scache\sentries,\s
               (?<packet_cache_hits_pct>\d+)\%\spacket\scache\shits/x;

# $1 qps
# $2 window size
my $line3 = qr/(?<qps>\d+)\sqps\s\(average\sover\s
               (?<qps_avg_window>\d+)\sseconds\)/x;

# $1 questions
# $2 cache entries
# $3 negative entries
# $4 cache hit pct
my $line4 = qr/(?<questions>\d+)\squestions,\s
               (?<cache_entries>\d+)\scache\sentries,\s
               (?<negative_entries>\d+)\snegative\sentries,\s
               (?<cache_hits_pct>\d+)\%\scache\shits/x;

# $1 throttle map
# $2 ns speeds
my $line5 = qr/throttle\smap\:\s(?<throttle_map>\d+),\s
               ns\sspeeds\:\s(?<ns_speeds>\d+)/x;

# $1 outpkt qry ratio
# $2 pct throttled
# $3 no-delegation drops
my $line6 = qr/outpacket\/query\sratio\s(?<outpacket_query_ratio>\d+)\%,\s 
               (?<pct_throttled>\d+)\%\sthrottled,\s
               (?<no_delegation_drops>\d+)\sno\-delegation\sdrops/x;

One thing to keep in mind about Perl hashes, is that the keys are extracted in no particular order. If you want them to be in a certain order, you get to apply that ordering yourself:

do {   # do stuff with $hash{$_} }  foreach(sort keys %hash);

Perl has some tricky assumptions that are difficult to master. The "implicit variable" $_ takes on the most recent context, which in the above code snippet, is the per-element output of sort keys %hash.

Another thought is, there are ways to enforce your own logic on sort within Perl. Maybe that's a post for another day. I do a bit of a trick to coerce the dates into a more easily sorted format - I convert them to Unix epoch, which is the number of seconds since midnight 1970/01/01, GMT. If you're curious what this exact moment's Unix epoch is, jump on your CLI and ask the date utility:

date +%s

When I pull the date from the syslog entry, I capture the elements and send them to this conversion sub. I make use of Time::Local to convert to Unix epoch.

Recommended by LinkedIn

Declarative Machine Learning

Michael Malak 10 years ago

Ichhadhari Data: A Tester's View of Shape-Shifting…

Rahul Verma 2 years ago

Binary Search - DS & Algo Series

Justin Choi 4 years ago

sub logts2epoch {
  my ($mon,$mday,$hh,$mm,$ss) = @_;
  my $c = 0;
  my %mo = map { $_ => $c++ } qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;
  return undef unless defined($mo{$mon});
  return Time::Local::timelocal($ss,$mm,$hh,$mday,$mo{$mon},2024);
}

Here's the rest of the script.

my $flds = [
  q{outgoing_tcp_conn},
  q{queries_running},
  q{outgoing_timeouts},
  q{packet_cache_entries},
  q{packet_cache_hits_pct},
  q{qps},
  q{qps_avg_window},
  q{questions},
  q{cache_entries},
  q{negative_entries},
  q{cache_hits_pct},
  q{throttle_map},
  q{ns_speeds},
  q{outpacket_query_ratio},
  q{pct_throttled},
  q{no_delegation_drops},
];

my $str;
my $file = shift || die;
{
  local $/;
  open (my $fh, $file);
  $str = <$fh>;
}

my %db;
for my $line (split /[\r\n]/,$str) {
# Aug  1 10:48:24 
  next unless
  ($line =~ m/^(?<mon>Jan|Feb|Mar|Apr|May|Jun|Aug|Sep|Oct|Nov|Dec)\s+(?<mday>\d+)\s+(?<hh>\d+)\:(?<mm>\d+)\:(?<ss>\d+)/);
  my ($mon,$mday,$hh,$mm,$ss)=@+{qw/mon mday hh mm ss/};
  my $ts = &logts2epoch($mon,$mday,$hh,$mm,$ss);
  my %row;
  if (($line =~ $line1) || ($line =~ $line2) ||
      ($line =~ $line3) || ($line =~ $line4) ||
      ($line =~ $line5) || ($line =~ $line6)) {
    do { $row{$_} = $+{$_} } foreach (keys %+);
  }
  # there may be a more efficient way to transfer from hash to hash
  # I just don't recall the specifics right now
  do { $db{$ts}{$_} = $row{$_} } foreach (keys %row);
}

print join(",","Date/Time",@$flds)."\n";
for my $ts (sort keys %db) {
  print join(",",strftime("%Y/%m/%d %H:%M:%S",localtime($ts)),@{$db{$ts}}{@$flds})."\n";
}

Why the array of capture labels? In addition to referring to the array for my title row, I also use this array to enforce a consistent ordering when I query the keys on the main hash variable.

Recall that I mentioned using Perl's qr// operator to pre-compile the regex. I read one $line at a time from the log file and compare it to all 6 regex by appending them together with double pipes (||) - an OR operation.

Once I have captured a matching line, yet another built-in Perl trick comes along - the capture hash, %+. De-reference this hash as you would any other, using the $ sigil and a label.

The context of %db is established outside the loop, so it survives across loop iterations. By contrast, %row is defined within the context of each loop, so no entries remain from previous loop iterations.

Remember that question we're holding on to?

How do I associate the multiple entries into the same row?

I use a hash of hashes and pivot on the common timestamp for each of the 6 partial log entries.

Are you familiar with how to de-reference a hash pointer?

First, a non-pointer example: If my hash %db has a label "Jeff was here", then that's easy: $db{"Jeff was here"} grabs the value for that label.

But what if $db points to a hash? Then it's slightly clunky: $db->{"Jeff was here"}. Simple enough?

I learned about slicing hashes a while back and I'm stuck on it.

# how do you slice a hash? just like this:
@{$db{$ts}}{@$flds}

I'm sure Gábor Szabó explains it better than I can, so rather than duplicate his fine effort, I'll just ask you to click on that link up there ^^.

Alrighty sports fans, I know that this has been a lot of fugly Perl to look at.

We began with a ginormous syslog file. We staged and pre-filtered and Perl'd that file. In the end, we are left with a CSV that contains one row per timestamp (recall that, in the original syslog, each timestamp has 6 different reports - we have successfully consolidated all 6 into 1!) Import the resulting CSV into Excel, and look for patterns by graphing an interesting column.

Here's what one graph looks like, using just the QPS column:

Questions? Comments? Critiques? The comment section is waiting for you to speak up!

Happy hunting!

Keith Schoenefeld 1y

I still contend that Perl is the only programming language I’ve seen where the same developer asked to solve the same problem using the exact same logic could write an infinite number of unique scripts to do the job. Moreover, in a script it doesn’t seem to be uncommon to see similar logic implemented in wildly different ways. Side note, I had to write a quick Perl script recently and definitely thought of you in the process.

Use multi-level hash in Perl for complex data queries

Jeffrey Wilson

Multiple iterations

Recommended by LinkedIn

More articles by Jeffrey Wilson

Others also viewed

Follow Confuscius' path to intelligence: try IntelligentGraph and PathQL

You should stop worrying about R vs. Python vs. whatever the tool

Python - The New Found Love

Why becoming a Data Scientist is NOT actually easier than you think.

Data and the Meaning of Life According to Monty Python

C++ Language: Variables & Data Types

Universal Numerical Fingerprint (UNF): The Format-Agnostic Key to Data Trust

Spring Expression Language with Spring Data

How we searched negative information on phone numbers using python

Linear Search Algorithm - Theory + Code + Questions

Explore content categories

Multiple iterations

Recommended by LinkedIn

More articles by Jeffrey Wilson

Troubleshooting the troubleshooting: A Network Anomaly Post-Mortem

Crafting a Career, Ch 50

Crafting a Career, Ch 49

Crafting a Career, Ch 48

Crafting A Career, Ch 47

Crafting A Career, Ch 46

Crafting a Career, Ch 45

Crafting a Career, Ch 44

Crafting a Career, Ch 43

Crafting A Career, Ch 42

Others also viewed

Follow Confuscius' path to intelligence: try IntelligentGraph and PathQL

You should stop worrying about R vs. Python vs. whatever the tool

Python - The New Found Love

Why becoming a Data Scientist is NOT actually easier than you think.

Data and the Meaning of Life According to Monty Python

C++ Language: Variables & Data Types

Universal Numerical Fingerprint (UNF): The Format-Agnostic Key to Data Trust

Spring Expression Language with Spring Data

How we searched negative information on phone numbers using python

Linear Search Algorithm - Theory + Code + Questions

Explore content categories