Thursday, February 8, 2024

[SOLVED] How use cat and grep with gsutil and filter with a subdirectory name?

Issue

Currently, for getting a string (here: 123456789) into some files in all my buckets I do the following:

gsutil cat -h gs://AAAA/** | grep '123456789' > 20221109.txt

And I get the name of my path file when I match, so it works, but if I do it this way, it will search among all the directories (and I have a thousand directories and thousand files, it makes so much time. I want to filter with a date thanks to the name of the subdirectory, like:

gsutil cat -h gs://AAAA/*2022-11-09*/** | grep '123456789' > 20221109.txt

But it didn't work, and I have no clue how to solve my problem, I read a lot of answers in SO, but I don't find them.

ps: I can't use find with gsutil , so I try to make it with cat and grep with gsutil in a single command line.

Thanks in advance for your help.


Solution

Finally, I managed to get what I wanted, but it was highly illegible. I think it's possible to do better. I'm open to any improvement. Reminder, This solution avoid reading all the directory of a bucket.

1st Step : I manage to get all the paths and the file that match my pattern of the subdirectory (like a date here):

gsutil ls gs://directory1/*2022-11-09*/** > gs_path_files_2022_11_09.txt

After that, I want to make a grep for each file and get in the output the name of the file and the line where I get my match (again in the terminal):

while read -r line; do
  gsutil cat "$line" | awk -v l="'Command: gsutil cat $line | awk '/the_string_i_want_to_match_in_my_file/{print ARGV[ARGIND] ":" $0}':" '/the_string_i_want_to_match_in_my_file/{print l $0}' >> results.txt
done < test.txt

and you will get after that the command (and the name of the file ) + the line where you get your match.

Best regards



Answered By - Cass
Answer Checked By - Katrina (WPSolving Volunteer)