Issue
I want to extract lines from File1 which are not present in File2
File1
a
b
c
File2
a
c
b
One possible command in bash is:
comm -23 <(sort File1) <(sort File2) > File
And it works perfectly well in bash, but I don't know how correctly to implement in Python.
I've tried with
import os
os.system("comm -23 <(sort File1) <(sort File2) > File")
And is not working. Any hint?
Solution
If you must use a shell, do it safely:
subprocess.call(['bash', '-c',
'comm -23 <(sort "$1") <(sort "$2") >"$3"', '_',
infile1_name, infile2_name, outfile_name])
That is to say: Instead of passing the filenames in as part of your code, pass them as out-of-band variables such that their names can't be interpreted by the shell.
Continuing to use sort
and comm
, but doing so without a shell, it looks more like:
from subprocess import Popen, PIPE, call
p_sort1 = Popen(['sort'], stdin=open(infile1_name, 'r'), stdout=PIPE)
p_sort2 = Popen(['sort'], stdin=open(infile2_name, 'r'), stdout=PIPE)
p_comm = Popen(['comm', '-23',
f'/dev/fd/{p_sort1.stdout.fileno()}',
f'/dev/fd/{p_sort2.stdout.fileno()}',
],
pass_fds=[p_sort1.stdout.fileno(), p_sort2.stdout.fileno()],
stdout=open('outfile_name', 'w'),
)
# Close file descriptors on the Python end so only the spawned processes have
# open copies
p_sort1.stdout.close()
p_sort2.stdout.close()
# Let the pipeline run, with comm reading output from the two sort processes
p_comm.wait()
All that said, doing this in pure Python (as in MaxNoe's answer) is ideal; the only major advantage one gets from the above is sort
's ability to work with contents larger than RAM (by splitting into temporary files and doing a merge-sort to combine them).
Answered By - Charles Duffy Answer Checked By - Dawn Plyler (WPSolving Volunteer)