Perl Byte Based Substr

Question

Perl Byte Based Substr

I am using SimpleDB for my application. Everything goes well if the restriction of one attribute is 1024 bytes. So for a long string, I have to cut the string into pieces and save it.

My problem is that sometimes my string contains a unicode character (Chinese, Japanese, Greek) and the function is substr()based on a non-byte character value.

I tried using use bytesfor byte semantic or later substr(encode_utf8($str), $start, $length), but that doesn't help at all.

Any help would be appreciated.

+3

perl amazon-simpledb utf-8

Minh le Apr 24 '12 at 16:56

source share

2 answers

substr 1- , UTF-8. , 1024 :

substr encode_utf8($str), 0, 1024;

. , :

$str = decode_utf8($str, Encode::FB_QUIET);

+1

Eugene Yarmash 24 . '12 17:24

ikegami · Accepted Answer · 2012-04-24T17:42:52+0000

UTF-8 was designed so that character boundaries are easy to detect. To break a string into chunks of valid UTF-8, you can simply use the following:

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

# The saving code expects bytes.
store($_) for @utf8_chunks;

# The saving code expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;

:

$ perl -e'
    use Encode qw( encode_utf8 );

    # This character encodes to three bytes using UTF-8.
    my $text = "\N{U+2660}" x 342;

    my $utf8 = encode_utf8($text);
    my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

    CORE::say(length($_)) for @utf8_chunks;
'
1023
3

Perl Byte Based Substr

More articles: