Unicode Search - Question

Question

Unicode Search - Question

Is this code ok? Actually, I don’t know what form of normalization I need (the only thing I noticed is that NFDI get the wrong conclusion).

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':encoding(utf-8)';

use Unicode::Normalize;
use Unicode::Collate::Locale;
use Unicode::GCString;

my $text = "my taxt täxt";
my %hash;

while ( $text =~ m/(\p{Alphabetic}+(?:'\p{Alphabetic}+)?)/g ) { #'
    my $word = $1;
    my $NFC_word = NFC( $word );
    $hash{$NFC_word}++;
}

my $collator = Unicode::Collate::Locale->new( locale => 'DE' ); 

for my $word ( $collator->sort( keys %hash ) ) {
    my $gcword = Unicode::GCString->new( $word );
    printf "%-10.10s : %5d\n", $gcword, $hash{$word};
}

+4

perl unicode word unicode-normalization collate

sid_com Jul 13 '11 at 13:01

source share

1 answer

tchrist · Accepted Answer · 2011-08-16T01:51:41+0000

Wow !! I can’t believe that no one answered this. This is a superfood big question. You, too, were all right. I like that you use Unicode :: Collate :: Locale and Unicode :: GCString. Good for you!

The reason you get the "wrong" output is because you are not using the Unicode :: GCString class method columnsto determine the print width of the material you are printing.

printf , , , GCS. , , , :

 printf "%-10.10s", $gstring;

:

 $colwidth = $gcstring->columns();
 if ($colwidth > 10) {
      print $gcstring->substr(0,10);
 } else {
     print " " x (10 - $colwidth);
     print $gcstring;
 }

, ?

. . . UCA , . , , , , normalization => undef, gmatch .

Unicode Search - Question

More articles: