Issue
File1 is an hard formatted pdb file containing protein coordinates:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
....................... plus many more lines .................................
File2 is a list of representative lines obtained from fields 3,4, and 5 of the above pdb file. To keep all simple, let's consider just to lines:
GLU A 2
GLY A 124
The desired output is:
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
..............................................................................
..............................................................................
..............................................................................
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
i.e. a modified pdb with 00.00 in the 11th field if a File1's line contain a File2 occurrence.
I already know how to do that with Bash while-read and awk but because these tools change the format and require reformatting and/or specify the output format, in this particular case dealing with hundreds of files they are not practical. In order to avoid these problems I decided to look for a solution based on sed. I got a working solution if I explicitly give a single search pattern. i.e. the following code works:
digits=00.00
sed "/GLU A 2/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb
but the following does not (the File1 lines are unchanged) and I did not manage to figure out why:
digits=00.00
while read pattern; do
sed "/$pattern/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb ;
done < File2.txt
Sorry for the lengthy message. Thanks in advance for any help.
@anubhava:
using my real data this is what happen at the first substitution site:
ATOM 293 CE1 HIS A 38 -18.278 19.735 13.486 1.00 67.94 C
ATOM 294 NE2 HIS A 38 -18.518 18.594 14.144 1.00 67.94 N
ATOM 295 N GLY A 39 -13.836 00.00 9.206 1.00 71.50 N
ATOM 296 CA GLY A 39 -12.628 00.00 8.447 1.00 71.50 C
ATOM 297 C GLY A 39 -11.358 00.00 9.286 1.00 71.50 C
ATOM 298 O GLY A 39 -11.411 18.636 10.344 1.00 00.00 O
ATOM 299 N PRO A 40 -10.180 17.577 8.797 1.00 71.93 N
ATOM 300 CA PRO A 40 -8.908 17.719 9.520 1.00 71.93 C
ATOM 301 C PRO A 40 -8.580 19.169 9.912 1.00 71.93 C
In this case the site is /GLY A 39/. As you can see there is a shift in some lines and unwanted substitutions in the 8th field. Strange enough such problems occur only for the first replacement i.e. the remaning output is just perfect. Thanks.
Solution
Using sed
in a while loop
which reads file 2 line by line, you can target only lines matches those found in file2 and carry out the sub on those lines where;
s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/
- Group everything up to the last digits that matches the pattern and retain to be returned with back reference \1
. Exclude the number matched in the pattern and once again group everything else after from the space to the end of the line and return with back-reference \2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 37.48 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 37.48 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 37.48 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 37.48 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 37.48 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 37.48 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 90.37 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 90.37 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 90.37 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 90.37 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
$ while read -r line; do sed -i.bak "/$line/s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/\100.00\2/" file1; done < file2
$ cat file1
ATOM 1 N MET A 1 -37.809 27.446 34.618 1.00 43.34 N
ATOM 2 CA MET A 1 -37.480 26.307 33.746 1.00 43.34 C
ATOM 3 C MET A 1 -36.495 25.493 34.556 1.00 43.34 C
ATOM 4 CB MET A 1 -36.919 26.801 32.394 1.00 43.34 C
ATOM 5 O MET A 1 -35.346 25.898 34.661 1.00 43.34 O
ATOM 6 CG MET A 1 -36.980 25.729 31.301 1.00 43.34 C
ATOM 7 SD MET A 1 -35.977 26.080 29.826 1.00 43.34 S
ATOM 8 CE MET A 1 -36.833 27.479 29.055 1.00 43.34 C
ATOM 9 N GLU A 2 -36.991 24.516 35.314 1.00 00.00 N
ATOM 10 CA GLU A 2 -36.090 23.617 36.039 1.00 00.00 C
ATOM 11 C GLU A 2 -35.250 22.852 35.010 1.00 00.00 C
ATOM 12 CB GLU A 2 -36.860 22.659 36.957 1.00 00.00 C
ATOM 13 O GLU A 2 -35.776 22.534 33.938 1.00 00.00 O
ATOM 14 CG GLU A 2 -37.467 23.407 38.153 1.00 00.00 C
ATOM 981 N CYS A 123 -15.659 -7.164 13.998 1.00 90.53 N
ATOM 982 CA CYS A 123 -16.801 -7.332 13.106 1.00 90.53 C
ATOM 983 C CYS A 123 -17.894 -8.234 13.699 1.00 90.53 C
ATOM 984 CB CYS A 123 -16.321 -7.886 11.757 1.00 90.53 C
ATOM 985 O CYS A 123 -18.918 -8.425 13.046 1.00 90.53 O
ATOM 986 SG CYS A 123 -15.266 -6.683 10.904 1.00 90.53 S
ATOM 987 N GLY A 124 -17.679 -8.840 14.874 1.00 00.00 N
ATOM 988 CA GLY A 124 -18.641 -9.764 15.474 1.00 00.00 C
ATOM 989 C GLY A 124 -18.851 -11.029 14.637 1.00 00.00 C
ATOM 990 O GLY A 124 -19.970 -11.514 14.513 1.00 00.00 O
ATOM 991 N SER A 125 -17.793 -11.536 13.996 1.00 92.09 N
ATOM 992 CA SER A 125 -17.837 -12.749 13.159 1.00 92.09 C
ATOM 993 C SER A 125 -17.220 -13.976 13.833 1.00 92.09 C
ATOM 994 CB SER A 125 -17.117 -12.481 11.840 1.00 92.09 C
ATOM 995 O SER A 125 -17.538 -15.108 13.459 1.00 92.09 O
ATOM 996 OG SER A 125 -17.831 -11.523 11.084 1.00 92.09 O
Answered By - HatLess Answer Checked By - Clifford M. (WPSolving Volunteer)