Friday, May 27, 2022

[SOLVED] Recursive search grep

Issue

I'm trying to search through HDFS for parquet files and list them out. I'm using this, which works great. It looks through all of the subdirectories in /sources.works_dbo and gives me all the parquet files:

 hdfs dfs -ls -R /sources/works_dbo | grep ".*\.parquet$"

However; I just want to return the first file it encounters per subdirectory, so that each subdirectory only appears on a single line in my output. Say I had this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

When I run my command I expect the output to look like this:

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet

Solution

You can use sort -u (unique) with / as the delimiter and using the first three fields as key. The -s option ("stable") makes sure that the file retained is the first one encountered for each subdirectory.

For this input

sources/works_dbo/test1/file1.parquet
sources/works_dbo/test1/file2.parquet
sources/works_dbo/test2/file3.parquet

the result is

$ sort -s -t '/' -k 1,3 -u infile
sources/works_dbo/test1/file1.parquet
sources/works_dbo/test2/file3.parquet


Answered By - Benjamin W.
Answer Checked By - Pedro (WPSolving Volunteer)