Wednesday, December 1, 2021

[SOLVED] awk FS vs FPAT puzzle and counting words but not blank fields

December 01, 2021 awk, bash

Issue

Suppose I have the file:

$ cat file
This, that;
this-that or this.

(Punctuation at the line end is not always there...)

Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:

sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n"  | tr '[:upper:]' '[:lower:]' | sort | uniq -c
   1 or
   2 that
   3 this

With grep you can shorten that a bit to only match what you define as a word:

grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output

With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):

gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   3 this
   1 or
   2 that

Now trying to replicate in POSIX awk I tried:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   2 
   3 this
   1 or
   2 that

Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.

You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.

I can also fix it this way:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file

Which seems a little less than straight forward.

Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?

Solution

This should work in POSIX/BSD or any version of awk:

awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file

   1 or
   3 this
   2 that

By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.

Answered By - anubhava

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 1, 2021

[SOLVED] awk FS vs FPAT puzzle and counting words but not blank fields

Issue

Solution

Popular Posts

Labels