Friday, October 7, 2022

[SOLVED] Unicode char representations in BASH / shell: printf vs od

Issue

I have two correlated ‘why’not ‘how to’— questions:

Question 1

While href="https://www.gnu.org/software/coreutils/manual/html_node/printf-invocation.html" rel="nofollow noreferrer">printf and od produce same decimal, octal, and hex representations for ASCII characters —

ascii_char=A

printf "%d" "'$ascii_char"
65
echo -n $ascii_char | od -A n -t d1
   65
echo -n $ascii_char | od -A n -t u1
  65
printf "%o" "'$ascii_char"
101
echo -n $ascii_char | od -A n -t o1
 101
printf "%x" "'$ascii_char"
41
echo -n $ascii_char | od -A n -t x1
 41

why do they somehow not produce same representations for a Unicode char?

unicode_char=🐕

printf "%d" "'$unicode_char"
128021
echo -n $unicode_char | od -A n -t d1
  -16  -97 -112 -107
echo -n $unicode_char | od -A n -t d
 -1785683984
echo -n $unicode_char | od -A n -t u1
 240 159 144 149
echo -n $unicode_char | od -A n -t u
 2509283312
printf "%o" "'$unicode_char"
372025
echo -n $unicode_char | od -A n -t o1
 360 237 220 225
echo -n $unicode_char | od -A n -t o
 22544117760
printf "%x" "'$unicode_char"
1f415
echo -n $unicode_char | od -A n -t x1
 f0 9f 90 95
echo -n $unicode_char | od -A n -t x
 95909ff0

Question 2

While od results for a Unicode char are different from those of printf, how come printf still knows how to convert od results back to a character — while printf cannot convert back its own results?

printf "%o" "'$unicode_char"
372025    # printf cannot convert back its own result
echo -n $unicode_char | od -A n -t o1
 360 237 220 225    # looks different, but printf can convert it back correctly
printf %b '\360\237\220\225'
🐕    # success

printf "%x" "'$unicode_char"
1f415    # printf can convert back this result
printf "\U$(printf %08x 0x1f415)"
🐕    # success
echo -n $unicode_char | od -A n -t x1
 f0 9f 90 95    # looks different, but printf can convert it back correctly
printf %b '\xf0\x9f\x90\x95'
🐕    # success

Solution

As pointed out in the comments, the difference you are seeing is the difference between a Unicode codepoint and its UTF-8 encoding.

printf prints codepoints, see POSIX documentation for printf ... "'🐕":

If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.

This number is always the same, no matter if you choose UTF-8, UTF-16, UTF-32, ... od on the other hand, has no knowledge of the character codeset. od only prints bytes / words (= groups of -t bytes), and those are always encoded even if the encoding happens to be the same number as its codepoint (e.g. for ASCII characters in ASCII encoding, or ASCII characters in UTF-8 encoding).

🐕 has the codepoint 12802. UTF-8 tries to encode codepoints in single bytes (hence UTF-8, because 1 byte = 8 bit), but 12802 > 28=256 does not fit into a single byte, so the number is split across multiple bytes, which are marked as special to prevent confusion. These special markers on each byte lead to the different output from od.

If you convert to UTF-32, every codepoint will fit into a single word, allowing you to use od to display codepoints:

# Assuming little endian system. For big endian systems use UTF-32BE.
echo -n 🐕 | iconv -t UTF-32LE | od -An -tu4
128021

Question 2

With printf %b '\360\237\220\225' you (manually) reverse the oct-dump from od, so that the original UTF-8 encoding of 🐕 is printed to the terminal. Here, printf does not care about the character set or encodings at all; the terminal is the one interpreting the encoding.

printf %o \'🐕=372025 cannot be reversed that easily because ...

  1. Octal numbers do not align as nicely with bytes as e.g. hexadecimal. For a single byte (28=256) two octal digits are insufficient (8²=64) and three octal digits are too much (8³=512). Hence, if you print 4 bytes as a single octal number (printf %o) some digits contain information from two bytes. Therefore you cannot split the octal number into 4 octal numbers (one for each byte) simply by grouping the existing digits. Instead, you have to convert into base 256 and then convert each base-256-digit into base 8 again -- just like you would if you had one big decimal number. That's what od does ;)
    You could say, this part is an advanced form of "Why can printf read decimal numbers, but not base-5 numbers?".
  2. The resulting bytes would still have to be encoded in UTF-8, so that your terminal recognizes them. With printf %b '\360\237\220\225' this part was not necessary, because the UTF-8 encoded 🐕 was never decoded into its codepoint to begin with.

Nevertheless, you can convert the octal representation of 🐕's codepoint back:

printf "\\U$(printf %08x 0372025)"  # leading 0 = octal number
🐕


Answered By - Socowi
Answer Checked By - Cary Denson (WPSolving Admin)