Issue
So I have a file with a bunch of occurrences of the same string across thousands of lines. For simplicity's sake, my demo file reads as such:
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
I would like to replace each occurrence so that it reads
123 Fragment-cnum0001 Energy
alpha-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0002
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0003
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0004
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0005
XXX
sss
I know I could do specific replacements of each line with
sed 's/0001/0002/2' file
But I was hoping that the following loop would work:
Though probably very slow, my original thought was to do
for k in *.txt; do
x=0 #reset the number of occurrences to zero
tconf=$(grep -c "cnum0001\n" $k) #find the total number of occurrences
while $(( (($x + 0)) <= (($tconf + 0)) )); do #while the number of occurrences is less than the total number
x=$(($x + 1)) #add one to the number of occurrences
cnc=$(( printf %04d $x )) #set it so that $cnc includes the necessary number of leading zeroes before $x, so if x=1, cnc=0001.
cn=${prefixCFGi}-cnum${cnc}
sed -i 's/Fragment-cnum0001/$cn/$x' $k #This is the command I need help with. I want it to find the xth occurrence of Fragment-cnum0001 and replace it with $cn
done #loop through the txt file until $x=$tconf
done #loop through all txt files
However, when I tried:
x=2;sed "s/0001/0002/$x" file
the output was exactly the same as the input. In this simple case, it should have merely changed the second occurrence of 0001 to 0002 and it did not. To me, this means that sed isn't understanding that x=2 and replacing it in the execution accordingly.
I am writing this as a part of a much larger zsh script, but I am currently working in the terminal.
Notes that I have added because the answers I was getting were not fully addressing my question:
- I cannot use the line number as a counter (so code that says "do this every 4 lines" will not work). The number of lines between each occurrence is variable as is the text between them. My actual file has over a hundred lines between each occurrence.
- I need to be able to specify the found string must be on its own line, as I have occurrences that are in the middle of a larger string on other lines that I do not want counted or replaced.
I am open to other commands, sed is just the one whose arrangement I am most familiar with.
Solution
While it would certainly be possible to use a bash
loop to update the file, the repeated sed -i
call is going to be excessive (ie, having to rewrite the entire file for each pass through the loop). Better performance is going to come from using a tool (eg, awk
, perl
, python
) that's capable of making the (multiple) changes in a single pass through the file.
Setup:
$ cat file1.txt
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
Fragment-cnum0001
XXX
sss
$ cat file2.txt
456 Fragment-cnum0001 Mining
Fragment-cnum0001
XXX
sss
456 Fragment-cnum0001 Mining
Fragment-cnum0001
XXX
sss
456 Fragment-cnum0001 Mining
Fragment-cnum0001
XXX
sss
One awk
idea to replace OP's current while
loop:
newpfx="alpha"
for k in *.txt
do
printf "\n############## $k\n"
awk -v pfx="Fragment,${newpfx}" ' # define old/new prefix strings
BEGIN { split(pfx,a,",") # a[1]==old prefix / a[2]==new prefix
oldid=a[1] "-cnum0001" # assumes always looking for string ending in "cnum0001"
newid=a[2] "-cnum"
}
$1==oldid { $1 = newid sprintf("%04d", ++sfx) } # if 1st field matches "oldid" then redefine 1st field; assumes no other fields on this line
1 # print current line
' "$k"
done
This generates:
############## file1.txt
123 Fragment-cnum0001 Energy
alpha-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0002
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0003
XXX
sss
############## file2.txt
456 Fragment-cnum0001 Mining
alpha-cnum0001
XXX
sss
456 Fragment-cnum0001 Mining
alpha-cnum0002
XXX
sss
456 Fragment-cnum0001 Mining
alpha-cnum0003
XXX
sss
If using GNU awk
(for -i inplace
support) we can directly update the files, eg:
newpfx="alpha"
for k in *.txt
do
awk -i inplace -v pfx="Fragment,${newpfx}" '
BEGIN { split(pfx,a,",")
oldid=a[1] "-cnum0001"
newid=a[2] "-cnum"
}
$1==oldid { $1 = newid sprintf("%04d", ++sfx) }
1
' "$k"
done
This generates:
$ cat file1.txt
123 Fragment-cnum0001 Energy
alpha-cnum0001
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0002
XXX
sss
123 Fragment-cnum0001 Energy
alpha-cnum0003
XXX
sss
$ cat file2.txt
456 Fragment-cnum0001 Mining
alpha-cnum0001
XXX
sss
456 Fragment-cnum0001 Mining
alpha-cnum0002
XXX
sss
456 Fragment-cnum0001 Mining
alpha-cnum0003
XXX
sss
We could go further and pull the for k in *.txt
into our single awk
call, eg:
awk -i inplace -v pfx="Fragment,${newpfx}" 'BEGIN ....' *.txt
OP will need to decide if this will work in the real script.
OP has mentioned this code is nested within a couple other loops; if those additional loops consist of making further modifications to these same files then it may be possible to pull those other loops into the same awk
script, which in turn would improve the overall performance of the main script.
Answered By - markp-fuso Answer Checked By - Marie Seifert (WPSolving Admin)