Saturday, April 9, 2022

[SOLVED] Parsing complicated debug logs with regex/grep/awk/sed

Issue

I have debug logs that are GB in size and contain lots of extraneous data. Single log entries can be 1,000,000+ lines long, some parts with indents, some without, there is very little consistency except for the beginning timestamp at the start of each entry. Each new entry starts with a timestamp ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah so it is easily identifiable but can have many many lines after it that belong to it. I've been using python to locate strings of text then move up find the parent entry they belong to and down to the end of the entry where the next instance of ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah is located which is unfortunately not nearly performant enough to make this a painless process. I'm now trying to get grep to do the same with regex since grep seems to be in a different universe in terms of speed. Also I run into the issue of python version differences on machines I'm working on (2vs3) it's just a pain.

This is what I have so far for grep and it works in small test cases but not on large files, there are obviously some issues with it performance wise, how can I resolve this? Perhaps there's a good way to do this with awk?

grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)" the key string I'm looking for is 00:00:00:fc:77:00

sample

2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22 
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...

if any of these have my search string in them I want the whole piece between to the time stamps, all the many thousands of lines.


Solution

Assumptions:

  • only timestamp lines start with ^YYYY/MM/DD
  • a timestamp line starts with a string of the format YYYY/MM/DD HH:MM:SS.sss {<thread_name>} and this string is unique within the file
  • search strings do not contain embedded new lines
  • search strings are guaranteed not to be broken by newlines in the log file
  • a single log entry (OP: can be 1,000,000+ lines long) may be too large to fit into memory
  • need to search for multiple patterns

Setup:

$ cat search_strings
00:00:00:fc:77:00
and this line of text

$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

NOTE: log.txt lines of interest start with ^MATCH

One awk idea that requires two passes through the log file:

awk '
FNR==NR        { strings[$0]; next }

FNR==1         { pass++ }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3
                                           print_header=1 
                                         }

pass==1        { for (string in strings)
                     if (match($0,string)) {           # search for our strings in current line and if found ...
                        dttlist[dtt]                   # save current date/time/thread
                        next
                     }
               }

pass==2 && 
(dtt in dttlist) { if (print_header) {
                      print "################# matching block:"
                      print_header=0
                   }
                   print
                 }
' search_strings log.txt log.txt

Assuming memory usage is not an issue, another awk idea requiring a single pass through the log file:

awk '

function print_lines() {

    if (lineno > 0)
       for (i=0; i<=lineno; i++)
           print lines[i]

    delete lines
    lineno=0
}

FNR==NR         { strings[$0]; next }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3 }

dtt != prev_dtt { delete lines
                  lineno=0
                  lines[0]="############# matching block:"

                  printme=0                     # disable printing of lines to stdout;
                                                # instead they will be saved to the lines[] array
                  prev_dtt=dtt
                }

! printme      { for (string in strings)
                     if (match($0,string)) { 
                        print_lines()          # flush any lines in the lines[] array and ...
                        printme=1              # set flag to print new lines to stdout
                        break
                     }
                 if (! printme)  {             # if not printing lines to stdout then ...
                    lines[++lineno]=$0         # save the current line in the lines[] array
                 }
               }
printme                                        # printme==1 => print current line to stdout
' search_strings log.txt

Both of these generate:

################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...


Answered By - markp-fuso
Answer Checked By - Cary Denson (WPSolving Admin)