Issue
I have a text file that has duplicate words. It is easy to sort and extract unique words. But some of the words has slash followed by a tag. In that case, I need to remove the words without slash and keep the longer word in the file.
For e.g. If the file looks like this...
test/e
this/x
word/p
and
some
more/q
test
this
and
some
Using sort:
sort -u t1.txt
and
more/q
some
test
test/e
this
this/x
word/p
But the expected result:
and
more/q
some
test/e
this/x
word/p
Update: I used over-simplified example. There are cases when a word may have multiple tags. In that case I need to keep those words.
# cat t1.txt
test/e
this/x
word/p
and
some
more/q
test
this
and
some
more/n
word/n
# awk -F/ '!seen[$1]++{rows[$0]} END {for (i in rows) print i}' t1.txt
some
test/e
more/q
and
word/p
this/x
In this case, more/n and word/n should be included in the output:
some
test/e
more/q
and
word/p
this/x
more/n
word/n
Solution
Would you try the following:
awk '{
sub("/.*$", "", prev) # remove "tag" from the variable prev
if ($0 != prev) print # print $0 if it differs from prev
prev = $0 # update prev
}' < <(sort -r t1.txt) # feed the reverse-sorted file
Output for the 2nd t1.txt:
word/p
word/n
this/x
test/e
some
more/q
more/n
and
[Edit]
If you want to extract word
and more
out of the output above, please feed the output to:
awk -F/ 'seen[$1]++ {print $1}'
Result:
word
more
[Edit2]
If the input file contains multiple tags, please try the following:
awk -F/ ' # split records on "/"
!seen[$1] { # the word is new (not seen)
seen[$1]++ # remember the word as seen
if ($2 != "") tags[$1] = $2 # store the tag if not empty
next # skip to the next input
}
index(tags[$1], $2) == 0 && $2 != "" { # if a new tag is input
split(tags[$1], a, /,/) # split the string into an array
tags[$1] = $2 # redefine the string of tags
for (i in a) { # loop over the stored tags
if (index($2, a[i]) == 0) # skip redundant tags
tags[$1] = tags[$1] "," a[i] # reconstruct the string of tags
}
}
END { # final output
for (i in seen) { # loop over the words
if (tags[i] == "") print i # if the tag is empty, print just the word
else { # else print the word with each tag
split(tags[i], a, /,/)
for (j in a) {
print i "/" a[j]
}
}
}
}
' t1.txt
t1.txt:
test/e
this/x
word/p
and
some
more/qn
test
this
and
some
more/n
word/n
Output:
some
more/qn
this/x
and
word/n
word/p
test/e
Answered By - tshiono Answer Checked By - Senaida (WPSolving Volunteer)