Perl Byte Based Substr

I am using SimpleDB for my application. Everything goes well if the restriction of one attribute is 1024 bytes. So for a long string, I have to cut the string into pieces and save it.

My problem is that sometimes my string contains a unicode character (Chinese, Japanese, Greek) and the function is substr()based on a non-byte character value.

I tried using use bytesfor byte semantic or later substr(encode_utf8($str), $start, $length), but that doesn't help at all.

Any help would be appreciated.

+3
source share
2 answers

UTF-8 was designed so that character boundaries are easy to detect. To break a string into chunks of valid UTF-8, you can simply use the following:

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

# The saving code expects bytes.
store($_) for @utf8_chunks;

# The saving code expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;

:

$ perl -e'
    use Encode qw( encode_utf8 );

    # This character encodes to three bytes using UTF-8.
    my $text = "\N{U+2660}" x 342;

    my $utf8 = encode_utf8($text);
    my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

    CORE::say(length($_)) for @utf8_chunks;
'
1023
3
+5

substr 1- , UTF-8. , 1024 :

substr encode_utf8($str), 0, 1024;

. , :

$str = decode_utf8($str, Encode::FB_QUIET);
+1

All Articles