Issue
I'm trying to format the family IDs on a fam file whose sample and family IDs are the same, and coded in the following way:
Continent_Breed_Ind-ID
The idea would be to transform column 1 into something that only contains continent+breed, but keeping the other columns.
Mock dataset:
Continent1_Breed1_Ind-ID1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2_Ind-ID2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1_Ind-ID1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Desired outcome:
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
I have tried using sed as follows:
sed -r 's/_[^_]*//2g' file.fam
But that only gives me the first column.
Any ideas?
Solution
You may use this simple sed
command:
sed 's/_[^_]* / /' file
Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9
Here:
_[^_]*
: Matches_
followed by 0 or more non-_
characters followed by a space- We replace this match by a space to get the space between first and second column back
PS: Note that there is no global flag used here.
Answered By - anubhava Answer Checked By - Pedro (WPSolving Volunteer)