Issue
I'm working with GWAS data. Need help.
IID,rs098083,kgp794789,rs09848309,kgp8300747,.....
63,CC,AG,GA,AA,.....
54,AT,CT,TT,AG,.....
12,TT,GA,AG,AA,.....
.
.
.
As above I have a total of 512 rows and 2 Million columns.
Desired output:
SNP,Genotyping
rs098083,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 54, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
kgp794789,{
"CC" : [ 1, 63, 6, 18, 33, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 12, 13, ...],
"AA" : [86, 124, 4, 19, ...],
"AT" : [8, 98, 34, 74, ....],
.
.
.
}
rs09848309,{
"CC" : [ 1, 63, 6, 18, 3, ...],
"CT" : [ 2, 5, 6, 7, 8, ...],
"TT" : [ 4, 9, 24 13, ...],
"AA" : [86, 134, 4, 19, ...],
"AT" : [8, 48, 34, 44, ....],
.
.
.
As above after pivoting, I should have a JSON file of 2 million rows & 2 Columns. The SNP
column of the row contains the ID of the SNP. The genotyping
column will contain a JSON BLOB. This BLOB will be a set of key-value pairs. The key is a particular genotype (e.g., CC, CT, TT, ....) and the value is a list of the IIDs with a genotype matching the key.
Output Format would be " a CSV with embedded JSON"
Solution
update
Output Format would be "a CSV with embedded JSON"
Given your simple CSV input, using jq
is possible; but for a more robust solution I would use Miller (available here as a stand-alone binary for various OSs):
mlr --csv filter '
begin { @records = {}; }
for (k,v in mapexcept($*,"IID")) {
if (! haskey(@records,k)) { @records[k] = {}; }
if (! haskey(@records[k],v)) { @records[k][v] = []; }
@records[k][v] = append(@records[k][v],$*["IID"]);
}
false;
end {
for(k,v in @records) {
emit1 {"SNP": k, "Genotyping": v}
}
}
' then json-stringify -f 'Genotyping' gwas.csv
SNP,Genotyping
rs098083,"{""CC"": [63], ""AT"": [54], ""TT"": [12]}"
kgp794789,"{""AG"": [63], ""CT"": [54], ""GA"": [12]}"
rs09848309,"{""GA"": [63], ""TT"": [54], ""AG"": [12]}"
kgp8300747,"{""AA"": [63, 12], ""AG"": [54]}"
In the future, I need to query the results using SQL
I would say that your original data is already in a suitable form for storing it in a DB.
- With the original form, you would query it like this:
SELECT IID, kgp8300747 FROM mytable;
And you'll get:
IID | kgp8300747 |
---|---|
63 | AA |
54 | AG |
12 | AA |
- If you store the JSON in your DB then you would query it like this:
SELECT Genotyping FROM mytable WHERE SNP = 'kgp8300747';
For getting:
Genotyping |
---|
{"AA": [63, 12], "AG": [54]} |
Both approaches will need post processing, I don't know which one is more convenient for your use-case.
old stuff
This is a request for clarification
What's the desired output?
I ask this because the one that you provided is an invalid CSV containing invalid JSON, so it's difficult to believe that you really want something like this.
Problem with the JSON:
- this is invalid:
{ CC : [1,2] }
- The keys need double-quotes:
{ "CC" : [1,2] }
Problem with the CSV:
- this is invalid:
rs0993,{
"CC": [1,2],
"CT": [3]
}
- The second column contains newlines and/or commas and/or double-quotes so it shall be escaped with the CSV rules:
rs0993,"{
""CC"": [1,2],
""CT"": [3]
}"
Plausible desired outputs:
- A CSV with embedded JSON:
SNP,Genotyping
rs098083,"{""CC"": [63], ""AT"": [54], ""TT"": [12]}"
kgp794789,"{""AG"": [63], ""CT"": [54], ""GA"": [12]}"
- A JSON array of arrays:
[
["SNP", "Genotyping"],
["rs098083", {"CC": [63], "AT": [54], "TT": [12]}],
["kgp794789", {"AG": [63], "CT": [54], "GA": [12]}]
]
- A JSON array of objects:
[
{"SNP": "rs098083", "Genotyping": {"CC": [63], "AT": [54], "TT": [12]}},
{"SNP": "kgp794789", "Genotyping": {"AG": [63], "CT": [54], "GA": [12]}}
]
With that in mind, make sure that the input and output that you provided are really what you have/want; please edit your question if that's not the case.
Answered By - Fravadona Answer Checked By - David Marino (WPSolving Volunteer)