Friday, October 29, 2021

[SOLVED] Concatenate the content of a file with a increment of columns has 1 only

Issue

In the line of previous post (how to concatenate the content of a file with a increment of last number of the column), i need help with a little different issue.

Now i like to have increment (1 to 5th times) of every columns (can be 2nd, 3rd ....nth which certainly will start and end with "1" only) except the first column (which may start from 1 but end with any number).

input file:

TCTA    3   TCTG    1   TCTA    1
TCTA    4   TCTG    1   TCTA    1
TCTA    5   TCTG    1   TCTA    1
TCTA    6   TCTG    1   TCTA    1
TCTA    7   TCTG    1   TCTA    1
TCTA    8   TCTG    1   TCTA    1
TCTA    9   TCTG    1   TCTA    1
TCTA    10  TCTG    1   TCTA    1
TCTA    11  TCTG    1   TCTA    1
TCTA    12  TCTG    1   TCTA    1
TCTA    13  TCTG    1   TCTA    1
TCTA    14  TCTG    1   TCTA    1
TCTA    15  TCTG    1   TCTA    1

output required:

TCTA    3   TCTG    1   TCTA    1
TCTA    4   TCTG    1   TCTA    1
TCTA    5   TCTG    1   TCTA    1
TCTA    6   TCTG    1   TCTA    1
TCTA    7   TCTG    1   TCTA    1
TCTA    8   TCTG    1   TCTA    1
TCTA    9   TCTG    1   TCTA    1
TCTA    10  TCTG    1   TCTA    1
TCTA    11  TCTG    1   TCTA    1
TCTA    12  TCTG    1   TCTA    1
TCTA    13  TCTG    1   TCTA    1
TCTA    14  TCTG    1   TCTA    1
TCTA    15  TCTG    1   TCTA    1
TCTA    3   TCTG    2   TCTA    2
TCTA    4   TCTG    2   TCTA    2
TCTA    5   TCTG    2   TCTA    2
TCTA    6   TCTG    2   TCTA    2
TCTA    7   TCTG    2   TCTA    2
TCTA    8   TCTG    2   TCTA    2
TCTA    9   TCTG    2   TCTA    2
TCTA    10  TCTG    2   TCTA    2
TCTA    11  TCTG    2   TCTA    2
TCTA    12  TCTG    2   TCTA    2
TCTA    13  TCTG    2   TCTA    2
TCTA    14  TCTG    2   TCTA    2
TCTA    15  TCTG    2   TCTA    2
TCTA    3   TCTG    3   TCTA    3
TCTA    4   TCTG    3   TCTA    3
TCTA    5   TCTG    3   TCTA    3
TCTA    6   TCTG    3   TCTA    3
TCTA    7   TCTG    3   TCTA    3
TCTA    8   TCTG    3   TCTA    3
TCTA    9   TCTG    3   TCTA    3
TCTA    10  TCTG    3   TCTA    3
TCTA    11  TCTG    3   TCTA    3
TCTA    12  TCTG    3   TCTA    3
TCTA    13  TCTG    3   TCTA    3
TCTA    14  TCTG    3   TCTA    3
TCTA    15  TCTG    3   TCTA    3
TCTA    3   TCTG    4   TCTA    4
TCTA    4   TCTG    4   TCTA    4
TCTA    5   TCTG    4   TCTA    4
TCTA    6   TCTG    4   TCTA    4
TCTA    7   TCTG    4   TCTA    4
TCTA    8   TCTG    4   TCTA    4
TCTA    9   TCTG    4   TCTA    4
TCTA    10  TCTG    4   TCTA    4
TCTA    11  TCTG    4   TCTA    4
TCTA    12  TCTG    4   TCTA    4
TCTA    13  TCTG    4   TCTA    4
TCTA    14  TCTG    4   TCTA    4
TCTA    15  TCTG    4   TCTA    4
TCTA    3   TCTG    5   TCTA    5
TCTA    4   TCTG    5   TCTA    5
TCTA    5   TCTG    5   TCTA    5
TCTA    6   TCTG    5   TCTA    5
TCTA    7   TCTG    5   TCTA    5
TCTA    8   TCTG    5   TCTA    5
TCTA    9   TCTG    5   TCTA    5
TCTA    10  TCTG    5   TCTA    5
TCTA    11  TCTG    5   TCTA    5
TCTA    12  TCTG    5   TCTA    5
TCTA    13  TCTG    5   TCTA    5
TCTA    14  TCTG    5   TCTA    5
TCTA    15  TCTG    5   TCTA    5

I tried to incorporate the code from previous post but not success so far..

awk -v n=3 '
{
   rec = rec $0 RS
}
1
END {
   for (i=2; i<=n; ++i)
      printf "%s", gensub(/[0-9]+(\n|$)/, i "\\1", "g", rec)
}' file

Issue here is that it takes only last column, however i need any columns but first.

Please help.

Thanks


Solution

Assumptions:

  • first numeric column does not include the single digit 1
  • for all other columns to be incremented the source/input column's value is 1 (ie, we're only going to increment columns that contain a single 1)
  • net result: replace all occurrences of the single digit 1 with incremented values
  • the number of columns is not known beforehand
  • the number of columns could vary from row to row

Modifying OP's current awk code to replace all standalone 1's with an incremented value:

awk -v n=5 '
{
   rec = rec $0 RS
}
1
END {
   for (i=2; i<=n; ++i) {
       x=gensub(/([^[:digit:]])1([^[:digit:]])/, "\\1" i "\\2", "g", rec)
       printf "%s", gensub(/([^[:digit:]])1([^[:digit:]])/, "\\1" i "\\2", "g", x)
   }
}' file

Where:

  • ([^[:digit:]])1([^[:digit:]]) matches a non-digit character (capture group #1) + a single 1 + a non-digit character (capture group #2)
  • "\\1" i "\\2" - replacement is capture group #1 + the current increment value (i=2..5 in this instance) + capture group #2
  • we perform 2x gensub() calls to address issue where there are consecutive numeric columns containing a single 1 (NOTE: there may be a way to do this with a single function call but I'm drawing a blank at the moment ... open to suggestions from the awkers in the community)

Using a modified input file to demonstrate the consecutive numeric column issue:

$ cat file
TCTA    3   TCTG    1   TCTA    1
TCTA    4   TCTG    1   TCTA    1
TCTA    5   TCTG    1   TCTA    1
TCTA    6   TCTG    1   TCTA    1
TCTA    7   TCTG    1   TCTA    1
TCTA    8   TCTG    1   TCTA    1
TCTA    9   TCTG    1   TCTA    1
TCTA    10  TCTG    1   TCTA    1
TCTA    11  TCTG    1   TCTA    1
TCTA    12  TCTG    1   TCTA    1
TCTA    13  TCTG    1   TCTA    1
TCTA    14  TCTG    1   TCTA    1 1 1
TCTA    15  TCTG    1   TCTA    1 1 1 1

This generates:

TCTA    3   TCTG    1   TCTA    1
TCTA    4   TCTG    1   TCTA    1
TCTA    5   TCTG    1   TCTA    1
TCTA    6   TCTG    1   TCTA    1
TCTA    7   TCTG    1   TCTA    1
TCTA    8   TCTG    1   TCTA    1
TCTA    9   TCTG    1   TCTA    1
TCTA    10  TCTG    1   TCTA    1
TCTA    11  TCTG    1   TCTA    1
TCTA    12  TCTG    1   TCTA    1
TCTA    13  TCTG    1   TCTA    1
TCTA    14  TCTG    1   TCTA    1 1 1
TCTA    15  TCTG    1   TCTA    1 1 1 1
TCTA    3   TCTG    2   TCTA    2
TCTA    4   TCTG    2   TCTA    2
TCTA    5   TCTG    2   TCTA    2
TCTA    6   TCTG    2   TCTA    2
TCTA    7   TCTG    2   TCTA    2
TCTA    8   TCTG    2   TCTA    2
TCTA    9   TCTG    2   TCTA    2
TCTA    10  TCTG    2   TCTA    2
TCTA    11  TCTG    2   TCTA    2
TCTA    12  TCTG    2   TCTA    2
TCTA    13  TCTG    2   TCTA    2
TCTA    14  TCTG    2   TCTA    2 2 2
TCTA    15  TCTG    2   TCTA    2 2 2 2
TCTA    3   TCTG    3   TCTA    3
TCTA    4   TCTG    3   TCTA    3
TCTA    5   TCTG    3   TCTA    3
TCTA    6   TCTG    3   TCTA    3
TCTA    7   TCTG    3   TCTA    3
TCTA    8   TCTG    3   TCTA    3
TCTA    9   TCTG    3   TCTA    3
TCTA    10  TCTG    3   TCTA    3
TCTA    11  TCTG    3   TCTA    3
TCTA    12  TCTG    3   TCTA    3
TCTA    13  TCTG    3   TCTA    3
TCTA    14  TCTG    3   TCTA    3 3 3
TCTA    15  TCTG    3   TCTA    3 3 3 3
TCTA    3   TCTG    4   TCTA    4
TCTA    4   TCTG    4   TCTA    4
TCTA    5   TCTG    4   TCTA    4
TCTA    6   TCTG    4   TCTA    4
TCTA    7   TCTG    4   TCTA    4
TCTA    8   TCTG    4   TCTA    4
TCTA    9   TCTG    4   TCTA    4
TCTA    10  TCTG    4   TCTA    4
TCTA    11  TCTG    4   TCTA    4
TCTA    12  TCTG    4   TCTA    4
TCTA    13  TCTG    4   TCTA    4
TCTA    14  TCTG    4   TCTA    4 4 4
TCTA    15  TCTG    4   TCTA    4 4 4 4
TCTA    3   TCTG    5   TCTA    5
TCTA    4   TCTG    5   TCTA    5
TCTA    5   TCTG    5   TCTA    5
TCTA    6   TCTG    5   TCTA    5
TCTA    7   TCTG    5   TCTA    5
TCTA    8   TCTG    5   TCTA    5
TCTA    9   TCTG    5   TCTA    5
TCTA    10  TCTG    5   TCTA    5
TCTA    11  TCTG    5   TCTA    5
TCTA    12  TCTG    5   TCTA    5
TCTA    13  TCTG    5   TCTA    5
TCTA    14  TCTG    5   TCTA    5 5 5
TCTA    15  TCTG    5   TCTA    5 5 5 5


Answered By - markp-fuso