This is a kind of version of the previously asked questions, but I still can’t find the answer, so I'm trying to translate it into the essence of the problem in the hope that there is a solution.
I have a database in which, for historical reasons, certain text entries are not UTF-8. Most of them. And all the records took the last 3 years. But some old entries are not.
It's important to find non-UTF-8 characters so that I can either escape them or convert them to UTF-8 for some XML I'm trying to create.
I use JavaScript on the server side, it is of type ByteBuffer, so I can consider any character set as separate bytes and check them as necessary, and I do not need to use the String type, which, as I understand it, is problematic in this situation.
Is there any text check I can do to determine if this is UTF-8 or not in this case?
I searched for a couple of months (; _;) and still could not find the answer. However, there must be a way to do this, as XML validators (for example, in major browsers) may report "coding errors" when they encounter characters other than UTF-8.
I just would like to know any algorithm how to do this, so that I can try to do the same test in JavaScript. Once I know which characters are bad, I can convert them from ISO-8859-1 (for example) to UTF-8. I have methods for this.
I just don't know how to determine which characters are not UTF-8. Again, I understand that using a script like JavaScript is problematic in this situation, but I have an alternative ByteBuffer type that can handle characters in byte-based.
Thanks for any specific tests people can offer.
Arc