Issue
Greeting,
I have following headers in a file with multiple dna sequences
>10 AC_000167.1
>11 AC_000168.1
>12 AC_000169.1
>MT NC_006853.1
>X AC_000187.1
>GPS_000341582.1 NW_003097887.1
>GPS_000341583.1 NW_003097888.1
>GPS_000341584.1 NW_003097889.1
>GPS_000341585.1 NW_003097890.1
>GPS_000341586.1 NW_003097891.1
I am using following sed command to replace everything after the first white space.
sed -i 's/[^(>\d+?MT?X?GPS_\d+\.\d+)]\S..\d+\.\d+//g' newHeader.txt
The output should like this
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
However the command does not seem to work and does not give any error. How can I fix this?
Solution
With sed
:
$ sed -i -E 's/^([^ ]+) .*/\1/' file
The regular expression matches as follows:
Node | Explanation |
---|---|
^ |
the beginning of the string anchor |
( |
group and capture to \1: |
[^ |
]+ any character except: space (1 or more times (matching the most amount possible)) |
) |
end of \1 |
' ' | space |
.* |
any character except \n (0 or more times (matching the most amount possible)) |
With grep
:
grep -oP '^>\S+' file
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
>GPS_000341586.1
The regular expression matches as follows:
Node | Explanation |
---|---|
^ |
the beginning of the string anchor |
> |
> |
\S+ |
non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) |
If you want to edit in place:
grep -oP '^>\S+' file | sponge file
Answered By - Gilles Quénot Answer Checked By - Katrina (WPSolving Volunteer)