How is this octet stream interpreted as UTF-8 Hebrew encoding?

The next stream of bytes identified as UTF-8, it contains the Hebrew sentence דירות לשותפים בתל אביב - הומלס. I am trying to understand the encoding.

ubuntu@ip-10-126-21-104:~$ od -t x1 homeless-title-fromwireshark_followed_by_hexdump.txt
0000000 0a 09 d7 93 d7 99 d7 a8 d7 95 d7 aa 20 d7 9c d7
0000020 a9 d7 95 d7 aa d7 a4 d7 99 d7 9d 20 20 d7 91 d7
0000040 aa d7 9c 20 d7 90 d7 91 d7 99 d7 91 20 2d 20 d7
0000060 94 d7 95 d7 9e d7 9c d7 a1 0a
0000072
ubuntu@ip-10-126-21-104:~$ file -i homeless-title-fromwireshark_followed_by_hexdump.txt
homeless-title-fromwireshark_followed_by_hexdump.txt: text/plain; charset=utf-8

UTF-8 file, I checked this by opening Notepad (Windows 7), entering the Hebrew symbol ד, and then saving the file. The result of which gives the following:

ubuntu@ip-10-126-21-104:~$ od -t x1 test_from_notepad_utf8_daled.txt
0000000 ef bb bf d7 93
0000005
ubuntu@ip-10-126-21-104:~$ file -i test_from_notepad_utf8_daled.txt
test_from_notepad_utf8_daled.txt: text/plain; charset=utf-8

Where ef bb bfis the specification encoded in utf-8 form, and d7 93is exactly the sequence of bytes that appears in the original stream after 0a 09(new line, tab in ascii).

The problem here is that on Unicode code pages דshould be encoded how 05 D3, so why and how did utf-8 encoding work d7 93?

d7 93 11010111 10010011, 05 D3 00000101 11010011

, , ( ) , "HEBREW LETTER DALET"

,
.

+3
3

Unicode U + 0000..U + 007F UTF-8 0x00..0x7F.

Unicode u + 0080..U + 07FF ( HEBREW LETTER DALET U + 05D3) UTF-8 . 5 6 , xxxxxyyyyyy. UTF-8 110xxxxx; 10yyyyyy.

0x05D3 = 0000 0101 1101 0011 

6 0x05D3 - 010011; 10, 1001 0011 0x93. 5 - 10111; 110, 1101 0111 0xD7.

, UTF-8 U + 05D3 0xD7 0x93.

U + 0800 Unicode, 3 4 ( ) UTF-8. 10yyyyyy . 1110xxxx (3 ) 11110xxx (4 ). , UTF-8; 0xC0, 0xC1 0xF5..0xFF.

+4

Unicode ( ) " " . HEBREW LETTER DALET U + 05D3 0x05D3. , , "" (.. ) / ... UTF-8 ( UTF-16, UTF- 32 ) , .

UTF-8 ( SO). , UTF-8 HEBREW LETTER DALET 0xD7 0x93. , , UTF-32 UCS-4, , ( ) , , Unicode.

.

Unicode Joel Spolsky , ( ! ).

+6

. Unicode .

, Unicode . - ד - U + 05D3.

- - , .

UTF-8 1- , . RFC 3629.

A UTF-16, 2- - . UTF-32, , . , U + 05D3 00 00 05 D3 D3 05 00 00 UTF-32. , , .

UTF-7, .

+2

All Articles