Issue
I have multiple large csv files. Each file has a size of 1 GB to 7 GB.
All my files doesn't have any headers. It is in this format. ID,col1,col2,col3
1,23,22,27
2,67,29,22
3,34,34,23
File 2 has this structure.
4,23,22,27
5,67,29,22
6,34,34,23
i.e. The ID is unique in all files.
I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1
.
1,23,22,27
2,67,29,22
3,34,34,23
4,23,22,27
5,67,29,22
6,34,34,23
At the moment I'm doing like this.
import pandas as pd
CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'
for csv_file_name in sorted_fnames:
chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
print(csv_file_name)
for chunk in chunk_container:
chunk.to_csv(output_file, mode="a", index=False)
And then I'm sorting the file like this.
sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv
However, the merging process is incredibly slow. It takes more than an hour to merge the file.
Note: I have only 16 GB RAM. That's why I'm using the chunking option.
Is there any fastest solution available?
Thanks
Solution
on the second thought , you can use hdf5
structure that handles big data really well:
import pandas as pd
hdf_path = '_combined.h5'
with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
for csv_file_name in sorted_fnames:
store.append('data', pd.read_csv(csv_file_name), index=False)
you eventually can save it back to csv, if you wanted, but working with hdf5 would be more effeient
Answered By - eshirvana Answer Checked By - Dawn Plyler (WPSolving Volunteer)