Issue
I'm using sed
to correct a two column PDF (using pdftotext
3.03) conversion issue. The converter will at times work properly (left column text prints first, then right column text prints second). However, sometimes it will break at a hyphen in the text and switch into printing text in a ^left column + right column$
format.
I thought to solve this by printing out all the left column text to stdout
and building a multiline hold space with the right column text which I would switch into the pattern space and output to stdout
later. However, I found several conversion correction cases requiring that I used the hold space for the fix.
Instead, I attempted to try and dump the right column text into a external file (rtcol.txt) temporarily so that I can later retrieve it. But I seem to be having problems getting the contents of the file to print to the screen. I suspect it has something to do with the fact that sed
may still have the file open and has yet to write the cached text into it. Please see the following partial pseudo code example:
# Example of complicated case when hyphen causes a multiline issue
/regex to 1st line/ { h; s/ignored ltcol.*\(capture rtcol txt.*\).*- \(more rtcol.*$\)/\1\2/w rtcol.txt
x; s/\(capture ltcol.*\)ignore rtcol.*\(more ltcol.*\)-.*/\1\2/;
N; s/\(.*\)\n\(Get more ltcol from next line.*\)/\1\2/p; d;
}
# Section where converter at least separated the lines (ltcol \n rtcol)
/regex to next fix/ { n; p; n; w rtcol.txt
# Section where converter simply added the two lines (ltcol + rtcol)
n; s/^\(ltcol.*\) \(rtcol.*\)$/\1\n\2/; P; s/[^\n]*\n//w rtcol.txt
}
# After final correction, try to print out temp file contents
/regex to last fix/ { n; p; n; w rtcol.txt
r rtcol.txt
}
At first, I thought this was a sed
limitation on the number of UTF-8 characters you can execute. But currently I'm using ~1900 bytes in the temp file, and I tried to save the temp file into another file and it worked (even though that file was not dynamically created as I desired).
So, my question is this: can you write to a file and pull its contents within the same script? Is there a way to force sed
to sync the cached text to the file and then read from it? Or should I look to use a different type of program (e.g. awk
) or approach for this conversion correction process?
sed --version
is sed (GNU sed) 4.7
in Debian 11
Solution
Was able to find the answer to this question.
At first, I thought it was a limitation of sed where the pattern space could only hold so many characters (per the link, only 4000 bytes). But in my application, using UTF-8 encoding (1 byte per character) I only got to around 2000 bytes max.
Decided to do some source code diving, and found that GNU sed
does indeed have a flush output function. Did a quick search on sed about flushing output and found this stack overflow post by @Naab (give credit where due). And sure enough found that -u
option in the source code as well.
So for future reference, if you need to use external files as a temp storage for file parsing, make sure to add the -u
flag as below (not exceeding 4000 bytes):
pdftotext -y 130 -W 700 -H 560 <PDF File> - | sed -n -u -f script.sed
Answered By - Tyrone Mosley Answer Checked By - David Goodson (WPSolving Volunteer)