Issue
I have a 70 GB csv file (data.csv
) that has the following structure Timestamp,Price,Volume
.
1649289600174,100,100
1649289600176,105,100
1649289600178,109,100
1649289600179,104,100
1649289600181,101,100
1649289600182,100,100
1649289600188,90,100
I can read line by line like this using python.
with open("data.csv") as infile:
for line in infile:
row = line.split(",")
current_price = row[1]
If the prices moves +10
or -10
, then I would like to take a snapshot of that chunk and calculate open, close, high, low values for those prices. Also volume_sum
for the volume. In the above sample, price moves from 100 to 90.
open
is the first price found in the chunk. In the above sample it is100
close
is the last price found in the chunk. In the above sample it is90
high
is the max(price) found in the chunk. In the above sample it is109
low
is the min(price) found in the chunk. In the above sample it is90
volume_sum
is the sum(volume) found in the chunk. In the above sample it is700
These open, close, high, low, volume_sum becomes a new record. The new chunk will be measured from the close
price, which is 90. i.e. If the price goes to 100 or goes to 80, that would become a new chunk.
I'm on Ubuntu. Since, I have a very large file, I'm looking for a more efficient way to do this. [I don't want to use pandas for this. It is slow for my use case]
Can someone guide me?
Thanks
Solution
Assumptions:
- first chunk starts with the price found in the first row
- when switching chunks the volume of the corresponding 'switch' row is not counted in the new chunk
- though not mentioned in the question/description, we'll also provide the open/close times for each chunk
- ouput format:
openTime,closeTime,openPrc,closePrc,highPrc,lowPrc,volume
- when we reach the end of the file go ahead and print out the last chunk regardless of whether or not we crossed the
+/- 10
theshold (and assuming there are at least 2 data rows in the last chunk)
Adding a few more lines to the input:
$ cat data.csv
1649289600174,100,100
1649289600176,105,100
1649289600178,109,100
1649289600179,104,100
1649289600181,101,100
1649289600182,100,100
1649289600188,90,100
1649289600190,83,90
1649289600288,95,60
1649289600388,79,35
1649289600488,83,100
One awk
idea:
awk -v mvDN=-10 -v mvUP=10 '
function print_stats() {
if (openTime && openTime != closeTime)
print openTime,closeTime,openPrc,closePrc,highPrc,lowPrc,volume
openTime=closeTime
openPrc=highPrc=lowPrc=closePrc
volume=( NR==1 ? currVol : 0 )
}
BEGIN { FS=OFS="," }
{ closeTime=$1
currPrc=$2
currVol=$3
closePrc=currPrc
volume+=currVol
highPrc=(currPrc > highPrc ? currPrc : highPrc)
lowPrc= (currPrc < lowPrc ? currPrc : lowPrc)
if ( (currPrc - openPrc <= mvDN) || (currPrc - openPrc >= mvUP) )
print_stats()
}
END { print_stats() }
' data.csv
This generates:
1649289600174,1649289600188,100,90,109,90,700
1649289600188,1649289600388,90,79,95,79,185
1649289600388,1649289600488,79,83,83,79,100
Answered By - markp-fuso Answer Checked By - Timothy Miller (WPSolving Admin)