Wednesday, April 6, 2022

[SOLVED] Multiple pattern matching guided string replacements with sed

Issue

File1 is an hard formatted pdb file containing protein coordinates:

ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N  
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C  
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C  
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C  
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O  
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C  
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S  
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C  
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 37.48           N  
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 37.48           C  
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 37.48           C  
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 37.48           C  
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 37.48           O  
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 37.48           C 
..............................................................................
..............................................................................
..............................................................................
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N  
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C  
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C  
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C  
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O  
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S  
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 90.37           N  
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 90.37           C  
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 90.37           C  
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 90.37           O  
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N  
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C  
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C  
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C  
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O  
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O 
....................... plus many more lines ................................. 

File2 is a list of representative lines obtained from fields 3,4, and 5 of the above pdb file. To keep all simple, let's consider just to lines:

GLU A   2
GLY A 124

The desired output is:

ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N  
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C  
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C  
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C  
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O  
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C  
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S  
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C  
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 00.00           N  
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 00.00           C  
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 00.00           C  
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 00.00           C  
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 00.00           O  
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 00.00           C 
..............................................................................
..............................................................................
..............................................................................
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N  
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C  
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C  
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C  
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O  
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S  
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 00.00           N  
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 00.00           C  
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 00.00           C  
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 00.00           O  
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N  
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C  
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C  
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C  
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O  
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O 

i.e. a modified pdb with 00.00 in the 11th field if a File1's line contain a File2 occurrence.

I already know how to do that with Bash while-read and awk but because these tools change the format and require reformatting and/or specify the output format, in this particular case dealing with hundreds of files they are not practical. In order to avoid these problems I decided to look for a solution based on sed. I got a working solution if I explicitly give a single search pattern. i.e. the following code works:

digits=00.00
sed "/GLU A   2/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb  > out.pdb

but the following does not (the File1 lines are unchanged) and I did not manage to figure out why:

digits=00.00
while read pattern; do 
    sed "/$pattern/s/\(.\{61\}\)\(.\{5\}\)/\1$digits/" File1.pdb > out.pdb ;
done < File2.txt

Sorry for the lengthy message. Thanks in advance for any help.

@anubhava:

using my real data this is what happen at the first substitution site:

ATOM    293  CE1 HIS A  38     -18.278  19.735  13.486  1.00 67.94           C  
ATOM    294  NE2 HIS A  38     -18.518  18.594  14.144  1.00 67.94           N  
ATOM    295  N   GLY A  39     -13.836  00.00   9.206  1.00 71.50           N  
ATOM    296  CA  GLY A  39     -12.628  00.00   8.447  1.00 71.50           C  
ATOM    297  C   GLY A  39     -11.358  00.00   9.286  1.00 71.50           C  
ATOM    298  O   GLY A  39     -11.411  18.636  10.344  1.00 00.00           O  
ATOM    299  N   PRO A  40     -10.180  17.577   8.797  1.00 71.93           N  
ATOM    300  CA  PRO A  40      -8.908  17.719   9.520  1.00 71.93           C  
ATOM    301  C   PRO A  40      -8.580  19.169   9.912  1.00 71.93           C  

In this case the site is /GLY A 39/. As you can see there is a shift in some lines and unwanted substitutions in the 8th field. Strange enough such problems occur only for the first replacement i.e. the remaning output is just perfect. Thanks.


Solution

Using sed in a while loop which reads file 2 line by line, you can target only lines matches those found in file2 and carry out the sub on those lines where;

s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/ - Group everything up to the last digits that matches the pattern and retain to be returned with back reference \1. Exclude the number matched in the pattern and once again group everything else after from the space to the end of the line and return with back-reference \2

$ cat file1
ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 37.48           N
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 37.48           C
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 37.48           C
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 37.48           C
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 37.48           O
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 37.48           C
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 90.37           N
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 90.37           C
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 90.37           C
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 90.37           O
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O
$ while read -r line; do sed -i.bak "/$line/s/\(.*\)[0-9]\{2\}\.[0-9]\{2\}\([[:space:]]\+.*$\)/\100.00\2/" file1; done < file2
$ cat file1
ATOM      1  N   MET A   1     -37.809  27.446  34.618  1.00 43.34           N
ATOM      2  CA  MET A   1     -37.480  26.307  33.746  1.00 43.34           C
ATOM      3  C   MET A   1     -36.495  25.493  34.556  1.00 43.34           C
ATOM      4  CB  MET A   1     -36.919  26.801  32.394  1.00 43.34           C
ATOM      5  O   MET A   1     -35.346  25.898  34.661  1.00 43.34           O
ATOM      6  CG  MET A   1     -36.980  25.729  31.301  1.00 43.34           C
ATOM      7  SD  MET A   1     -35.977  26.080  29.826  1.00 43.34           S
ATOM      8  CE  MET A   1     -36.833  27.479  29.055  1.00 43.34           C
ATOM      9  N   GLU A   2     -36.991  24.516  35.314  1.00 00.00           N
ATOM     10  CA  GLU A   2     -36.090  23.617  36.039  1.00 00.00           C
ATOM     11  C   GLU A   2     -35.250  22.852  35.010  1.00 00.00           C
ATOM     12  CB  GLU A   2     -36.860  22.659  36.957  1.00 00.00           C
ATOM     13  O   GLU A   2     -35.776  22.534  33.938  1.00 00.00           O
ATOM     14  CG  GLU A   2     -37.467  23.407  38.153  1.00 00.00           C
ATOM    981  N   CYS A 123     -15.659  -7.164  13.998  1.00 90.53           N
ATOM    982  CA  CYS A 123     -16.801  -7.332  13.106  1.00 90.53           C
ATOM    983  C   CYS A 123     -17.894  -8.234  13.699  1.00 90.53           C
ATOM    984  CB  CYS A 123     -16.321  -7.886  11.757  1.00 90.53           C
ATOM    985  O   CYS A 123     -18.918  -8.425  13.046  1.00 90.53           O
ATOM    986  SG  CYS A 123     -15.266  -6.683  10.904  1.00 90.53           S
ATOM    987  N   GLY A 124     -17.679  -8.840  14.874  1.00 00.00           N
ATOM    988  CA  GLY A 124     -18.641  -9.764  15.474  1.00 00.00           C
ATOM    989  C   GLY A 124     -18.851 -11.029  14.637  1.00 00.00           C
ATOM    990  O   GLY A 124     -19.970 -11.514  14.513  1.00 00.00           O
ATOM    991  N   SER A 125     -17.793 -11.536  13.996  1.00 92.09           N
ATOM    992  CA  SER A 125     -17.837 -12.749  13.159  1.00 92.09           C
ATOM    993  C   SER A 125     -17.220 -13.976  13.833  1.00 92.09           C
ATOM    994  CB  SER A 125     -17.117 -12.481  11.840  1.00 92.09           C
ATOM    995  O   SER A 125     -17.538 -15.108  13.459  1.00 92.09           O
ATOM    996  OG  SER A 125     -17.831 -11.523  11.084  1.00 92.09           O


Answered By - HatLess
Answer Checked By - Clifford M. (WPSolving Volunteer)