Saturday, October 29, 2022

[SOLVED] Concatenating XML files with shell not working as expected

Issue

I have been trying to modify a script concatenating xml files from a path and merge them into a single xml, this script was originally used for concatenating text files.

I have the following script

#!usr/bin/sh
ORIGIN_PATH="/backup/data/export/imatchISO"
HISTORY_PATH="/backup/data/batch/hist"
SEND_PATH="/backup/data/batch/output"
DATE=`date +%y%m%d`
LOG="/backup/data/batch/log/concatIMatch_"$DATE

cd $ORIGIN_PATH

ls -lrt >> $LOG

cat $ORIGIN_PATH/SWIFTCAMT053_* >> $SEND_PATH/SWIFTCAMT053.XML_$DATE 2>> $LOG

mv $ORIGIN_PATH/SWIFTCAMT053_* $HISTORY_PATH >> $LOG 2>> $LOG


if [[ $(ls -A $SEND_PATH/SWIFTCAMT053.XML_$DATE) ]]; then
    echo $(date "+%Y-%m-%d %H:%M:%S")" - Ficheros 053 concatenados"  >> $LOG
        mv $SEND_PATH/SWIFTCAMT053.XML_$DATE $SEND_PATH/SWIFTCAMT053.XML 2>> $LOG
        exit 0
else
    echo $(date "+%Y-%m-%d %H:%M:%S")" - ¡ERROR CON LOS FICHEROS 053 AL CONCATENAR!"  >> $LOG
        exit 1
fi

and what I have is a path containing several xml files with the same format:

<?xml version="1.0" ?>
<DataPDU xmlns:ns2="urn:swift:saa:xsd:saa.2.0">
    <ns2:Revision>2.0.13</ns2:Revision>
        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

</DataPDU>

the thing is that when I concatenate with this is appending the end of the file to the next one , which is not the expected result as it is duplicating the xml declaration tag and the opening <DataPDU> and closing <DataPDU> for all files.

What I'm needing is to have a single xml file with the following sctructure

<?xml version="1.0" ?>
<DataPDU xmlns:ns2="urn:swift:saa:xsd:saa.2.0">
    <ns2:Revision>2.0.13</ns2:Revision>
        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

        <ns2:Header>
        ...
        </ns2:Header>
        
        <ns2:Body>
        ...
        </ns2:Body>

</DataPDU>

So technically what I want is to have the first 3 lines and the last line only occurring once.

I have received a tip that I could do something with:

$ awk 'NR<3 {print} FNR>3 {print last} {last=$0} END{print}' *.xml

But I don't understand how to modify my script for this.


Solution

Using xmllint to properly process XML files and excluding Revision Element from second body

body1=$(xmllint --xpath '/DataPDU/*' tmp.xml | sed -ze 's/\n/\&#xA;/g')
body2=$(xmllint --xpath '/DataPDU/*[not(local-name()="Revision")]' tmp.xml | sed -ze 's/\n/\&#xA;/g')

printf "%s\n" "cd /DataPDU" "set ${body1}${body2}" "save" "bye" | xmllint --shell tmp.xml

Code uses same file twice so change second file name accordingly. Plain new lines \n are replaced by its equivalent &#xA; entity to avoid errors on xmllint shell.

awk can be used too but requires that XML format does not change between files.
Body can be extracted by setting record separator RS to
xmlns:ns2="urn:swift:saa:xsd:saa.2.0"> or </DataPDU>
Record #2 contains the inner elements.

# from any file
echo -e '<?xml version="1.0" ?>\n\t<DataPDU xmlns:ns2="urn:swift:saa:xsd:saa.2.0">' > output.xml

# concatenate bodies on a variable from all files
for f in *.xml; do
    body+=$(gawk 'BEGIN{ RS="xmlns:ns2=\"urn:swift:saa:xsd:saa.2.0\">|<[/]DataPDU>" } { if(NR == 2) { print $0 }}' "$f")
 done
 
echo "$body" >> output.xml
# Add closing tag
echo "</DataPDU>" >> output.xml


Answered By - LMC
Answer Checked By - Willingham (WPSolving Volunteer)