Issue
in a super trouble with some unix stuffs..Any guidance here will be much appreciated.
I want to identify the duplicate records based on their id from the below file and assign a unique random number to it in a separate col and sum their value field. my input file:
name,location,id,state,website,status,color,field1,value,field3,field4,field5
joe,US,23A,CA,g,oog,le,10,blue,0,10,0,0,0
jack,UK,89A,LN,yah,oo,11,red,0,20,0,0,0
joe,US,23A,CA,g,mail,10,blue,0,120,0,0,0
rose,EU,AV45,UN,new,mail,45,black,0,110,0,0,0
Karl,US,2345,NY,microsoft,99,green,0,34,0,0,0
jonas,IN,AW3455,ND,facebook,37,brown,0,48,0,0,0
Karl,US,2345,NY,microsoft,99,purple,0,87,0,0,0
alin,IN,3T45,CA,re,edit,78,white,0,22,0,0,0
alin,IN,3T45,CA,ora,cle,11,orange,0,35,0,0,0
I expect my output file to be:
RandonUniqID,ID,Value
2202,23A,130
3029,89A,20
3066,AV45,110
5077,2345,121
1055,AW3455,48
3099,3T45,57
Here i wanted to generate unique random id for the rec and those records that are repeating , i want to get their values field summed up in a separate field. the trickiest part here is my 5th col that is website is very dynamic.the values in that field will have commas delimiter placed anywhere. So this is getting me in trouble.
Solution
Try this:
awk -F ',' '
NR>1{
if( ! ( $3 in UID ) ) {
# select a uniq random id
while( (Rnd=int(1000000*rand())) in UID) i++
UID[$3]=Rnd
}
# workaround for 9th col where there are "," inside field
S[$3]+=$(NF - 3)
}
END {
print "RandonUniqID,ID,Value"
for( uid in UID ) printf( "%s,%s,%s\n", UID[uid], uid, S[uid])
}
' YourFile
I assume there is lot less id than 1000000
Answered By - NeronLeVelu Answer Checked By - Marilyn (WPSolving Volunteer)