Monday, November 27, 2023

[SOLVED] How to write / read a temporary file within the same script using sed

November 27, 2023 linux, sed, shell, text-parsing, unix

Issue

I'm using sed to correct a two column PDF (using pdftotext 3.03) conversion issue. The converter will at times work properly (left column text prints first, then right column text prints second). However, sometimes it will break at a hyphen in the text and switch into printing text in a ^left column + right column$ format.

I thought to solve this by printing out all the left column text to stdout and building a multiline hold space with the right column text which I would switch into the pattern space and output to stdout later. However, I found several conversion correction cases requiring that I used the hold space for the fix.

Instead, I attempted to try and dump the right column text into a external file (rtcol.txt) temporarily so that I can later retrieve it. But I seem to be having problems getting the contents of the file to print to the screen. I suspect it has something to do with the fact that sed may still have the file open and has yet to write the cached text into it. Please see the following partial pseudo code example:

                      # Example of complicated case when hyphen causes a multiline issue
/regex to 1st line/ { h; s/ignored ltcol.*\(capture rtcol txt.*\).*- \(more rtcol.*$\)/\1\2/w rtcol.txt
                      x; s/\(capture ltcol.*\)ignore rtcol.*\(more ltcol.*\)-.*/\1\2/;
                      N; s/\(.*\)\n\(Get more ltcol from next line.*\)/\1\2/p; d; 
                    }

                      # Section where converter at least separated the lines (ltcol \n rtcol)
/regex to next fix/ { n; p; n; w rtcol.txt
                      
                      # Section where converter simply added the two lines (ltcol + rtcol)
                      n; s/^\(ltcol.*\) \(rtcol.*\)$/\1\n\2/; P; s/[^\n]*\n//w rtcol.txt
                    }

                      # After final correction, try to print out temp file contents
/regex to last fix/ { n; p; n; w rtcol.txt
                      r rtcol.txt
                    }

At first, I thought this was a sed limitation on the number of UTF-8 characters you can execute. But currently I'm using ~1900 bytes in the temp file, and I tried to save the temp file into another file and it worked (even though that file was not dynamically created as I desired).

So, my question is this: can you write to a file and pull its contents within the same script? Is there a way to force sed to sync the cached text to the file and then read from it? Or should I look to use a different type of program (e.g. awk) or approach for this conversion correction process?

sed --version is sed (GNU sed) 4.7 in Debian 11

Solution

Was able to find the answer to this question.

At first, I thought it was a limitation of sed where the pattern space could only hold so many characters (per the link, only 4000 bytes). But in my application, using UTF-8 encoding (1 byte per character) I only got to around 2000 bytes max.

Decided to do some source code diving, and found that GNU sed does indeed have a flush output function. Did a quick search on sed about flushing output and found this stack overflow post by @Naab (give credit where due). And sure enough found that -u option in the source code as well.

So for future reference, if you need to use external files as a temp storage for file parsing, make sure to add the -u flag as below (not exceeding 4000 bytes):

pdftotext -y 130 -W 700 -H 560 <PDF File> - | sed -n -u -f script.sed

Answered By - Tyrone Mosley

Answer Checked By - David Goodson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[SOLVED] How to write / read a temporary file within the same script using sed

Issue

Solution

Popular Posts

Labels