UTF-8 Width Display Chinese Character Problem

When I use Perl or C to printfsome data, I tried their format to control the width of each column, for example

printf("%-30s", str);

But when str contains a Chinese character, the column does not align as expected. see attachment pattern.

My ubuntu encoding is zh_CN.utf8, as far as I know, utf-8 encoding has a length of 1 ~ 4 bytes. The Chinese character has 3 bytes. In my test, I found that the printf format format denotes the Chinese character as 3, but in fact it displays 2 ascii widths.

Thus, the actual screen width is not constant as expected, but a variable related to the number of characters in China, i.e.

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w is the width limit, x is the number of Chinese characters, Sw (x) is the actual width of the screen.

Thus, the more the Chinese character str contains, the shorter it is displayed.

How can I get what I want? Count Chinese characters before printf?

As far as I know, all Chinese or even all wide characters, I think, are displayed as 2 widths, then why printf considers it 3? UTF-8 encoding has nothing to do with display length.

+5
source share
1 answer

Yes, this is a problem with all versions printfthat I know of. I will briefly discuss this issue in this answer as well as this one .

For C, I don’t know a library that will do this for you, but if someone has it, it will be an ICU.

Perl Unicode:: GCString CPAN , Unicode . Unicode № 11: .

, 1 , 2 . , , , . columns, , .

Unicode . Unicode, , "" ( CJK) .

sample terminal output

umenu, , .

Unicode:: LineBreak, Unicode::GCString . Unicode № 14: Unicode.

umenu, Perl v5.14:

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

, , , - , , , , :

print pad(entitle($item), $width, ".");

, .

, , printf, , , .

+6

All Articles