Tuesday, May 24, 2022

[SOLVED] (sed/awk) extract values text file and write to csv (no pattern)

May 24, 2022 awk, bash, csv, replace, sed

Issue

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.

My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.

The file let's call it my_file_1.txt has a structure that looks something like this

lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...

and I would like to construct something like

file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...

How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.

I don't really have any experience with awk. With sed my best guess would be

filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
  s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
  h
  $!N
  /.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
  D
  T
  G
  P
' $filename | sed -z 's/,\n/,/' >> my_data.csv

and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.

It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.

Solution (Edit)

I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.

Awk (Rusty Lemur's answer)

Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.

BEGIN {
  counter = 1 
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0) 
}

/epoch/ {
  epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0) 
}

/stop value/ {
  stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0) 
  
  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1 
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

I accepted this answer because it most understandable.

Sed (potong's answer)

sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
        /^.*start value/{:a;N;/\n.*stop value/!ba;x
        s/.*/expr & + 1/e;x;G;F
        s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt |         sed '1!N;s/\n//'

Solution

awk's basic structure is:

read a record from the input (by default a record is a line)
evaluate conditions
apply actions

The record is split into fields (by default based on whitespace as the separator). The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second. The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.

A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).

BEGIN {
  counter = 1
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = $NF  # when a line contains "start value" store the last field as startValue 
}

/epoch/ {
  epoch = $NF
}

/stop value/ {
  stopValue = $NF

  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

Save that as processor.awk and invoke as:

awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv

Answered By - Rusty Lemur

Answer Checked By - Candace Johnson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0