Thursday, July 7, 2022

[SOLVED] Remove duplicate words and keep only the longer entries

July 07, 2022 awk, grep, sed

Issue

I have a text file that has duplicate words. It is easy to sort and extract unique words. But some of the words has slash followed by a tag. In that case, I need to remove the words without slash and keep the longer word in the file.

For e.g. If the file looks like this...

test/e
this/x
word/p
and
some
more/q
test
this
and
some

Using sort:

sort -u  t1.txt
and
more/q
some
test
test/e
this
this/x
word/p

But the expected result:

and
more/q
some
test/e
this/x
word/p

Update: I used over-simplified example. There are cases when a word may have multiple tags. In that case I need to keep those words.

# cat t1.txt
test/e
this/x
word/p
and
some
more/q
test
this
and
some
more/n
word/n

# awk -F/ '!seen[$1]++{rows[$0]} END {for (i in rows) print i}' t1.txt
some
test/e
more/q
and
word/p
this/x

In this case, more/n and word/n should be included in the output:

some
test/e
more/q
and
word/p
this/x
more/n 
word/n

Solution

Would you try the following:

awk '{
    sub("/.*$", "", prev)       # remove "tag" from the variable prev
    if ($0 != prev) print       # print $0 if it differs from prev
    prev = $0                   # update prev
}' < <(sort -r t1.txt)          # feed the reverse-sorted file

Output for the 2nd t1.txt:

word/p
word/n
this/x
test/e
some
more/q
more/n
and

[Edit]
If you want to extract word and more out of the output above, please feed the output to:

awk -F/ 'seen[$1]++ {print $1}'

Result:

word
more

[Edit2]
If the input file contains multiple tags, please try the following:

awk -F/ '                                       # split records on "/"
!seen[$1] {                                     # the word is new (not seen)
    seen[$1]++                                  # remember the word as seen
    if ($2 != "") tags[$1] = $2                 # store the tag if not empty
    next                                        # skip to the next input
}
index(tags[$1], $2) == 0 && $2 != "" {          # if a new tag is input
    split(tags[$1], a, /,/)                     # split the string into an array
    tags[$1] = $2                               # redefine the string of tags
    for (i in a) {                              # loop over the stored tags
        if (index($2, a[i]) == 0)               # skip redundant tags
            tags[$1] = tags[$1] "," a[i]        # reconstruct the string of tags
    }
}
END {                                           # final output
    for (i in seen) {                           # loop over the words
        if (tags[i] == "") print i              # if the tag is empty, print just the word
        else {                                  # else print the word with each tag
            split(tags[i], a, /,/)
            for (j in a) {
                print i "/" a[j]
            }
        }
    }
}
' t1.txt

t1.txt:

test/e
this/x
word/p
and
some
more/qn
test
this
and
some
more/n
word/n

Output:

some
more/qn
this/x
and
word/n
word/p
test/e

Answered By - tshiono

Answer Checked By - Senaida (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, July 7, 2022

[SOLVED] Remove duplicate words and keep only the longer entries

Issue

Solution

Popular Posts

Labels