Issue
I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo
and gives me all the parquet files:
hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"
However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
When I run my command I expect the output to look like this:
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
Solution
You can use sort -u
(unique) with /
as the delimiter and using the first three fields as key. The -s
option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.
For this input
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet
the result is
$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet
Answered By - Benjamin W. Answer Checked By - Pedro (WPSolving Volunteer)