I need to get my understanding of character sets and encoding right. Can someone point me to good write up on handling different character sets in C#?
Here's one of the problems I'm facing -
using (StreamReader reader = new StreamReader("input.txt"))
using (StreamWriter writer = new StreamWriter("output.txt")
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
This simple code snippet does not always preserve the encoding -
For example -
Aukéna in the input is turned into Auk�na in the output.
You just have an encoding problem. You have to remember that all you're really reading is a stream of bits. You have to tell your program how to properly interpret those bits.
To fix your problem, just use the constructors that take an encoding as well, and set it to whatever encoding your text uses.
http://msdn.microsoft.com/en-us/library/ms143456.aspx
http://msdn.microsoft.com/en-us/library/3aadshsx.aspx
I guess when reading a file, you should know which encoding the file has. Otherwise you can easily fail to read it correctly.
When you know the encoding of a file, you may do the following:
using (StreamReader reader = new StreamReader("input.txt", Encoding.GetEncoding(1251)))
using (StreamWriter writer = new StreamWriter("output.txt", false, Encoding.GetEncoding(1251)))
{
while (!reader.EndOfStream)
{
writer.WriteLine(reader.ReadLine());
}
}
Another question comes up, if you want to change the original encoding of a file.
The following article may give you a good basis of what encodings are: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
And this is a link msdn article, from which you could start: Encoding Class
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With