Issue
I have a text file like this:
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz
Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz
Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz
Tomato mottle virus
And I need to get a csv
file like this:
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus
Because later I want to use this like a tuple to find the compressed file, read it and get a final file with names like:
Viruses/GCF_000837105.1/Tomato mottle virus.fna
I just need to learn how to do the first part of the problem. It could by with:
- sed
- awk
- R
- Python
Any help would be very appreciated. This is hard for me to accomplish because the original filenames are very messed up.
Thank you all for your time.
Paulo
PS- I have tried this:
sed -z 's/\n/,/g;s/,$/\n/' multi_headers
However it put comma in all \n
.
Solution
Using any awk in any shell on every Unix box and only storing 1 line at a time in memory so it'll work no matter how large your input file is:
$ awk '{ORS=(NR%2 ? "," : RS)} 1' file
Viruses/GCF_000820355.1_ViralMultiSegProj14361_genomic.fna.gz,Sclerophthora macrospora virus A
Viruses/GCF_000820495.2_ViralMultiSegProj14656_genomic.fna.gz,Influenza B virus RNA
Viruses/GCF_000837105.1_ViralMultiSegProj14079_genomic.fna.gz,Tomato mottle virus
There's a lot happening in a small amount of code above so here's an explanation:
ORS
is the builtin variable containing the string to be printed at the end of each output record (record = line in this case), a newline by default.RS
is the builtin variable containing the string (or regexp) that separates each input record, a newline by default.NR
is the builtin variable containing the current record/line number soNR%2
is1
for odd numbered records and 0 for even numbered.NR%2 ? "," : RS
is a ternary expression resulting in,
for odd numbered lines,\n
(or whatever else you have setRS
to, e.g.\r\n
) for even numbered.1
is a true condition which causes the default action of printing the current record to be executed.
So the above script says "if the current line number is odd print it with a ,
at the end, otherwise print it with a newline at the end", hence it's joining every pair of lines with a ,
between.
Answered By - Ed Morton Answer Checked By - Mildred Charles (WPSolving Admin)