Quantcast
Viewing all articles
Browse latest Browse all 1448

VB6 - The case for UTF-8

Some people have been critical of the fact that my clsCNG.cls does not preserve Unicode. So with this post, I have attempted to correct that situation. I am by no means any kind of expert on Unicode, and until recently I have only cursed its existence. The Unicode standards are very loose (much like SMTP), but at least there is a fair amount of information out there if you are willing to dig for it.

ClsCNG.cls is a general purpose class designed to perform encryption services on anything that is passed to it. With one small change to the StrToByte routine, it now detects double-wide characters and passes the entire string instead of the just the low order bytes. But with that flexibility comes a new "gotcha". In the image below, you will see the Russian Unicode string does not produce the correct Hash. That is because it is a mixed string, consisting of both ASCII and Russian Unicode. This is not uncommon in HTML code, and this particular string was intercepted from http://www.humancomp.org/unichtm/unichtm.htm using a packet sniffer. There are a couple of ways around that issue. One way is to remove the NULL characters associated with the ASCII characters. The other way is to encode the string using UTF-8. This is the preferred method and is demonstrated using the "Hash UTF-8" button. I should mention at this point that I am using the TextBox provided by the Microsoft Forms 2.0 Object Library to display the Unicode characters. The regular TextBox only accepts ASCII.

The change to the StrToByte routine allowed the implementation of 2 new routines called "ByteToStrShort" and "HexToStrShort". These routines create a string without the intermediate NULL bytes and shorten the process time.

Using UTF-8 introduces another "gotcha". The Unicode standard, and in particular UTF-8, only works with true ASCII characters less than 128 (&H80). If there is any chance that your application could pass ANSI characters above &H7F, you should provide a detection routine to avoid passing it to "clsCNG.cls". DO NOT use "StrConv", as it will cause problems, especially if you are using a non-Latin System Locale.

That's the easy part. Recognizing an incoming byte string as Double-wide Unicode or UTF-8 is difficult to say the least. There is no standard methodology to deal with it. HTTP and XML will announce their intention to use UTF-8. For the Russian page below, the line:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
was provided. There is another methodology called BOM which is sometimes employed. It stands for Byte Order Mark, and is used to specify "Big Endian" or "Little Endian" order for encoded strings. Since UTF-8 uses bytes instead of words, Endian has little meaning, and it is often referred to as a "UTF-8 Signature" (EF BB BF). HTML5 requires an application to respect it, and it takes precedence over the notification. Unicode standards do not restrict or require it's use, but if you are building an application where you can control both ends, it would make sense to use it. In either case, your application should be prepared to recognize and remove it before display.

Mozilla (and I assume other browsers as well) will use the information provided to determine the type of encoding used on incoming data, and if that fails or is not provided, it then uses a heuristic approach. So I set out to provide my own routine to detect UTF-8. My first reaction was to question the need to scan the data twice. If you are going to convert the string if UTF-8 is detected, why not just attempt to convert the string and respond to any errors. Unfortunately, MultiByteToWideChar does not return encoding errors; it just does the best that it can. So the scan is necessary to detect if the incoming string is indeed UTF-8. The IsUTF8 routine is my interpretation of a C++ routine that I found on the net. It has not been tested extensively, and it could probably be executed more efficiently. Determining if an incoming string is Unicode or not is a different story, and I have not found a reliable way to do that. I tested the Microsoft "IsTextUnicode" function, but as most of the literature indicated, it is virtually useless.

I discovered another "gotcha" with MultiByteToWideChar. It will return NULL characters at the end of the string, depending on the length. That is not a problem with C++, as NULL characters signify the end of the string. But with VB, that is a problem because it identifies the string length in it's definition. So the FromUTF8 routine was modified to remove any NULL characters.

If you convert the Russian sample, you will notice that the UTF-8 string is shorter than the original (due to ASCII NULL removal), but the Chinese sample converts to a longer string. That is because the Chinese sample converts 2 byte characters to mostly 3 byte characters. Even considering the downsides, UTF-8 appears to be the most logical solution.

J.A. Coutts
Attached Images
Image may be NSFW.
Clik here to view.
 
Attached Files
  • Image may be NSFW.
    Clik here to view.
    File Type: zip
    UTF8.zip (18.1 KB)

Viewing all articles
Browse latest Browse all 1448

Trending Articles