Thursday, May 19, 2022

[SOLVED] Fastest way to chunk and calculate sum, min, max in a 70 GB csv file

May 19, 2022 awk, python, sed

Issue

I have a 70 GB csv file (data.csv) that has the following structure Timestamp,Price,Volume.

1649289600174,100,100
1649289600176,105,100
1649289600178,109,100
1649289600179,104,100
1649289600181,101,100
1649289600182,100,100
1649289600188,90,100

I can read line by line like this using python.

with open("data.csv") as infile:
    for line in infile:
        row = line.split(",")
        current_price = row[1]

If the prices moves +10 or -10, then I would like to take a snapshot of that chunk and calculate open, close, high, low values for those prices. Also volume_sum for the volume. In the above sample, price moves from 100 to 90.

open is the first price found in the chunk. In the above sample it is 100
close is the last price found in the chunk. In the above sample it is 90
high is the max(price) found in the chunk. In the above sample it is 109
low is the min(price) found in the chunk. In the above sample it is 90
volume_sum is the sum(volume) found in the chunk. In the above sample it is 700

These open, close, high, low, volume_sum becomes a new record. The new chunk will be measured from the close price, which is 90. i.e. If the price goes to 100 or goes to 80, that would become a new chunk.

I'm on Ubuntu. Since, I have a very large file, I'm looking for a more efficient way to do this. [I don't want to use pandas for this. It is slow for my use case]

Can someone guide me?

Thanks

Solution

Assumptions:

first chunk starts with the price found in the first row
when switching chunks the volume of the corresponding 'switch' row is not counted in the new chunk
though not mentioned in the question/description, we'll also provide the open/close times for each chunk
ouput format: openTime,closeTime,openPrc,closePrc,highPrc,lowPrc,volume
when we reach the end of the file go ahead and print out the last chunk regardless of whether or not we crossed the +/- 10 theshold (and assuming there are at least 2 data rows in the last chunk)

Adding a few more lines to the input:

$ cat data.csv
1649289600174,100,100
1649289600176,105,100
1649289600178,109,100
1649289600179,104,100
1649289600181,101,100
1649289600182,100,100
1649289600188,90,100
1649289600190,83,90
1649289600288,95,60
1649289600388,79,35
1649289600488,83,100

One awk idea:

awk -v mvDN=-10 -v mvUP=10 '

function print_stats() {
    if (openTime && openTime != closeTime)
       print openTime,closeTime,openPrc,closePrc,highPrc,lowPrc,volume

    openTime=closeTime
    openPrc=highPrc=lowPrc=closePrc
    volume=( NR==1 ? currVol : 0 )
}

BEGIN { FS=OFS="," }
      { closeTime=$1
        currPrc=$2
        currVol=$3

        closePrc=currPrc
        volume+=currVol
        highPrc=(currPrc > highPrc ? currPrc : highPrc)
        lowPrc= (currPrc < lowPrc  ? currPrc : lowPrc)

        if ( (currPrc - openPrc <= mvDN) || (currPrc - openPrc >= mvUP) )
           print_stats()
      }
END   { print_stats() }
' data.csv

This generates:

1649289600174,1649289600188,100,90,109,90,700
1649289600188,1649289600388,90,79,95,79,185
1649289600388,1649289600488,79,83,83,79,100

Answered By - markp-fuso

Answer Checked By - Timothy Miller (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, May 19, 2022

[SOLVED] Fastest way to chunk and calculate sum, min, max in a 70 GB csv file

Issue

Solution

Popular Posts

Labels