How to use Unicode code pages above U + FFFF in Rebol 3 lines, like in Rebol 2?

I know that you cannot use escape style in strings for code points more than ^ (FF) in Rebol 2 because it knows nothing about Unicode. Thus, it does not create anything good, it looks spoiled:

print {Q: What does a Zen master {Cow} Say?  A: "^(03BC)"!}

However, the code works in Rebol 3 and prints:

Q: What does a Zen master {Cow} Say?  A: "μ"!

This is great, but R3 maximizes the ability to hold a character in a string at U + FFFF, apparently:

>> type? "^(FFFF)"
== string!

>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"

The situation is much better than the random behavior of Rebol 2 when it satisfies code points that he was not aware of. However, Rebol used a workaround for storing strings, if you knew how to make your own UTF-8 encoding (or got your strings by loading the source code from disk). You can simply assemble them from individual characters.

, UTF-8 U + 010000 - # ​​F0908080, :

workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]

, UTF-8, . - R3?

+5
2

, ... , R2. ! ! :

good-workaround: # {F0908080}

Rebol2, Rebol3. - .

, Unicode , -... , , ^ (7F), Rebol 2, 3. , , :

-: [# "^ (F0)" # "^ (90)" # "^ (80)" # "^ (80)" ]

... " UTF-8"...

, , 4 = length? terrible-workaround. Rebol2 , ! ! . , Rebol2 , , AS-BINARY AS-STRING. ( Rebol3, , !)

, 4, , , to integer!. , - , , . Rebol2:

>> to integer! #"^(80)"
== 128

>> to binary! #"^(80)"
== #{80}

R3 UTF-8, :

>> to integer! #"^(80)"
== 128

>> to binary! #"^(80)"
== #{C280}

, , , , - , -. , , "" R2 , , "". R2:

>> to binary! #"^(03BC)"
== #{BC}

"03".: -/

, - Unicode R3, - :

mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master {Cow} Say?  A: "} mu-utf8 {"!}]

. . , Rebol2.

: , , - , Rebol3:

utf8: rejoin [#{} {Q: What did the Mycenaean {Cow} Say?  A: "} #{010000} {"!}]

, , , LINEAR B SYLLABLE B008 A. , , , - , , , , . , , .


:. , , :

safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]

r2-string-to-binary: func [
    str [string!] /string /unescape /unsafe
    /local result s e escape-rule unsafe-rule safe-rule rule
] [
    result: copy either string [{}] [#{}]
    escape-rule: [
        "^^(" s: 2 hex-digit e: ")" (
            append result debase/base copy/part s e 16
        )
    ]
    unsafe-rule: [
        s: unsafe-r2-char (
            append result to integer! first s
        )
    ]
    safe-rule: [
        s: safe-r2-char (append result first s)
    ]
    rule: compose/deep [
        any [
            (either unescape [[escape-rule |]] [])
            safe-rule
            (either unsafe [[| unsafe-rule]] [])
        ]
    ]
    unless parse/all str rule [
        print "Unsafe codepoints found in string! by r2-string-to-binary"
        print "See http://stackoverflow.com/questions/15077974/"
        print mold str
        throw "Bad codepoint found by r2-string-to-binary"
    ]
    result
]

to binary!, Rebol2, Rebol3. ( terrible-workaround.)

+3

, ! . UTF-8 , UTF-16 :

utf-16: "^(d800)^(dc00)"

^ (10000) UTF-16. :

utf-16: func [
    code [integer!]
    /local low high
] [
    case [
        code < 0 [do make error! "invalid code"]
        code < 65536 [append copy "" to char! code]
        code < 1114112 [
            code: code - 65536
            low: code and 1023
            high: code - low / 1024
            append append copy "" to char! high + 55296 to char! low + 56320
        ]
        'else [do make error! "invalid code"]
    ]
]
+3

All Articles