Beginners are often attracted to the simple XOR cipher. Besides having serious limitations as ciphers go, they usually try to approach it ignoring many important topics.
The first one is that most ciphers don't really apply to text at all. Sure, there are some that do but these are mostly obsolete and of little but historical interest. So when using most ciphers today, including simple Xor, you want to combine the plaintext and the key as arrays of bytes:
But how to get those bytes?
Typically people start off with String values for both "inputs" and to get them as Byte arrays they'll either just slop them in and get lots of zero bytes or much worse they'll convert the String to ANSI.
Converting to ANSI is ok as long as you enforce the use of the 7-bit ASCII subset. But if you don't do that you can run into serious portability problems. ANSI is not ANSI, but a family of ANSI encodings that differ for different language alphabets. And you can also run into Double Byte Character Set issues in many Asian alphabets.
See the classic:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Or perhaps the newer take on the subject:
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Text in VB6
First you need some grasp of text in VB6. String values in VB6 are normally 16-bit Windows Unicode UTF-16LE characters. Yes, a String variable can hold other things such as ASCII, ANSI, DBCS, even UTF-8 and UTF-7. However this is rarely done and you'd want to a good handle on the subject before you attempt it.
So one can almost say VB6 Strings are always Unicode (UTF-16LE).
This can get confusing because to help bridge the gap from MS-DOS and earlier versions of VB, VB6 "helps" you. Most text I/O operations and many controls do implicit conversion to and from ANSI using the current session codepage. This means that in many ways VB6 is "Unicode on the inside and ANSI on the outside."
People also get confused because 7-bit US-ASCII is a proper subset of every ANSI encoding as well. Then the somewhat misleading notion of "extended 8-bit ASCII" came along in the MS-DOS era.
Once again see the links above for some conceptual help.
Demo
So here is a demo that attempts to address some of these things.
It uses the Text Object Model (TOM) interface of the Win32 RichEdit control which VB6's RichTextBox control wraps to get us some Unicode capabilities. This should work on Windows 2000 or later.
A small helper class RtbTom.cls wraps an API call to grab ITextDocument references for our RichTextBox controls.
It converts Unicode text to UTF-8 text in Byte arrays to operate upon them. UTF-8 is a multibyte character set, though be careful. The 7-bit ASCII subset uses one byte per character so people often get mislead into thinking things are that simple. Other characters get encoded as 2 to 4 bytes instead of just one. Still, at least with UTF-8 you can compress out most of the zero bytes you'd have using "Unicode" (UTF-16LE).
After applying the XorCipher() function above, it encodes the ciphertext bytes in Base64. Base64 is relatively compact compared to hex, etc. and it uses only "safe" characters - safe for ASCII, ANSI, almost anything.
We don't have much character encoding support in VB6, so the helper class TextCodec.cls wraps a few API calls to accomplish this.
Then the cipher is reversed by converting the Base64 back to bytes, applying XorCipher() once more using the same passphrase, and finally converts the resulting UTF-8 back to "Unicode."
![Name: sshot.png
Views: 109
Size: 12.6 KB]()
When the program starts it reads the data from PlainTextIn.txt into rtbPlainText1 (a RichTextBox). The file is Unicode, so we bring in lots of characters from various alphabets. You can type over this or leave it as-is.
When you type a passphrase into rtbPass it enables cmdEncrypt allowing you to click on it.
Clicking that does the encryption and displays the results in rtbCipherText and also saves it to disk as a 7-bit ASCII file Encrypted.txt.
Then it enables cmdDecrypt.
Clicking on that reverses the encryption process. Results are written into rtbPlainText2 and also saves it to disk as a Unicode file PlainTextOut.txt.
Requirements
VB6, Windows 2000 or later.
Summary
Xor Cipher is simple though limited, but using it effectively can be trickier than new programmers expect.
In order to avoid corrupting your data in the process you need to be aware of character encodings and make some choices about which ones to use and how to use them.
We could have skipped encoding as UTF-8 but we'd produce a far larger "encrypted" ciphertext in most cases, with a lot of "dead air" zero bytes. Don't rely on ANSI conversion though, it can be a trap.
The ciphertext itself is not "text" as such, but binary data. If you need to represent this as actual text you should encode it using Base64, hex, or something similar. Web and email do tons of this very thing under the covers and most people aren't aware of it.
There is a lot here that can be intimidating and confusing, but these are good concepts to understand.
The first one is that most ciphers don't really apply to text at all. Sure, there are some that do but these are mostly obsolete and of little but historical interest. So when using most ciphers today, including simple Xor, you want to combine the plaintext and the key as arrays of bytes:
Code:
Private Function XorCipher(ByRef Bytes() As Byte, ByRef PassBytes() As Byte) As Byte()
Dim Temp() As Byte
Dim Length As Long
Dim I As Long
ReDim Temp(UBound(Bytes))
'Assumes LBound() = 0 for both arrays passed to us:
Length = UBound(PassBytes) + 1
For I = 0 To UBound(Bytes)
Temp(I) = Bytes(I) Xor PassBytes(I Mod Length)
Next I
XorCipher = Temp
End Function
Typically people start off with String values for both "inputs" and to get them as Byte arrays they'll either just slop them in and get lots of zero bytes or much worse they'll convert the String to ANSI.
Converting to ANSI is ok as long as you enforce the use of the 7-bit ASCII subset. But if you don't do that you can run into serious portability problems. ANSI is not ANSI, but a family of ANSI encodings that differ for different language alphabets. And you can also run into Double Byte Character Set issues in many Asian alphabets.
See the classic:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Or perhaps the newer take on the subject:
What every programmer absolutely, positively needs to know about encodings and character sets to work with text
Text in VB6
First you need some grasp of text in VB6. String values in VB6 are normally 16-bit Windows Unicode UTF-16LE characters. Yes, a String variable can hold other things such as ASCII, ANSI, DBCS, even UTF-8 and UTF-7. However this is rarely done and you'd want to a good handle on the subject before you attempt it.
So one can almost say VB6 Strings are always Unicode (UTF-16LE).
This can get confusing because to help bridge the gap from MS-DOS and earlier versions of VB, VB6 "helps" you. Most text I/O operations and many controls do implicit conversion to and from ANSI using the current session codepage. This means that in many ways VB6 is "Unicode on the inside and ANSI on the outside."
People also get confused because 7-bit US-ASCII is a proper subset of every ANSI encoding as well. Then the somewhat misleading notion of "extended 8-bit ASCII" came along in the MS-DOS era.
Once again see the links above for some conceptual help.
Demo
So here is a demo that attempts to address some of these things.
It uses the Text Object Model (TOM) interface of the Win32 RichEdit control which VB6's RichTextBox control wraps to get us some Unicode capabilities. This should work on Windows 2000 or later.
A small helper class RtbTom.cls wraps an API call to grab ITextDocument references for our RichTextBox controls.
It converts Unicode text to UTF-8 text in Byte arrays to operate upon them. UTF-8 is a multibyte character set, though be careful. The 7-bit ASCII subset uses one byte per character so people often get mislead into thinking things are that simple. Other characters get encoded as 2 to 4 bytes instead of just one. Still, at least with UTF-8 you can compress out most of the zero bytes you'd have using "Unicode" (UTF-16LE).
After applying the XorCipher() function above, it encodes the ciphertext bytes in Base64. Base64 is relatively compact compared to hex, etc. and it uses only "safe" characters - safe for ASCII, ANSI, almost anything.
We don't have much character encoding support in VB6, so the helper class TextCodec.cls wraps a few API calls to accomplish this.
Then the cipher is reversed by converting the Base64 back to bytes, applying XorCipher() once more using the same passphrase, and finally converts the resulting UTF-8 back to "Unicode."
When the program starts it reads the data from PlainTextIn.txt into rtbPlainText1 (a RichTextBox). The file is Unicode, so we bring in lots of characters from various alphabets. You can type over this or leave it as-is.
When you type a passphrase into rtbPass it enables cmdEncrypt allowing you to click on it.
Clicking that does the encryption and displays the results in rtbCipherText and also saves it to disk as a 7-bit ASCII file Encrypted.txt.
Then it enables cmdDecrypt.
Clicking on that reverses the encryption process. Results are written into rtbPlainText2 and also saves it to disk as a Unicode file PlainTextOut.txt.
Requirements
VB6, Windows 2000 or later.
Summary
Xor Cipher is simple though limited, but using it effectively can be trickier than new programmers expect.
In order to avoid corrupting your data in the process you need to be aware of character encodings and make some choices about which ones to use and how to use them.
We could have skipped encoding as UTF-8 but we'd produce a far larger "encrypted" ciphertext in most cases, with a lot of "dead air" zero bytes. Don't rely on ANSI conversion though, it can be a trap.
The ciphertext itself is not "text" as such, but binary data. If you need to represent this as actual text you should encode it using Base64, hex, or something similar. Web and email do tons of this very thing under the covers and most people aren't aware of it.
There is a lot here that can be intimidating and confusing, but these are good concepts to understand.