Visual Studio and File Encodings

We recently had a bug report in Clover.NET that it was not handling UTF-8 encoded files. We had to smile because we had earlier had a bug report that Clover.NET only handled UTF-8 files and could barf on files that were written in the local encoding. Nevertheless the bug was certainly valid.

It turns out that there is more to this than we thought – there’s a little trickery being performed by Visual Studio and the compiler, csc. The /codepage description in MSDN gives a clue:

If the source code files were created with the same codepage that is in effect on your computer or if the source code files were created with UNICODE or UTF-8, you need not use /codepage.

So you can use either your default encoding or you can use UTF-8. Hmmmm. It turns out that this applies even to UTF-8 files which do not have the initial UTF-8 signature. How does the compiler decide which encoding to use? Certainly with my default encoding, CP-1252, there are byte sequences which are both valid 1252 encoding and UTF-8 encoding.

Well, of course, I don’t know what goes on in csc’s internals but the effect seems to be that it assumes UTF-8 and falls back to the default encoding if there is a problem.

It’s not too critical with csc where you can explicitly designate the codepage for the compielr to use. What about Visual Studio, where you can;t control the codepage. If your default encoding is 1252, try this little example. Create a new file and copy the following code into it. (I’m hoping all this encoding works between your browser, my browser and my server).

using System;

namespace Sample
{
    public class SampleClass
    {
        private const string testData = "§";
        
        public static void Main() 
        {
            Console.WriteLine("Hello - " + testData);
        }
    }
}

Now save the file and close it. Then reopen it. For me the result is

using System;

namespace Sample
{
    public class SampleClass
    {
        private const string testData = "§";
        
        public static void Main() 
        {
            Console.WriteLine("Hello - " + testData);
        }
    }
}

The initializer for testData has changed. What has happened is that the string that initializes testData is valid in both codepage 1252 and UTF-8. Hey, IDE, don’t change my code. I really don’t like that. If we change the code to be what was originally entered, the change will now stick. I guess that Visual Studio realizes the encoding is now UTF-8 and saves the file with UTF-8 encoding. If, however, you open the file assuming encoding 1252 (try wordpad), the string will have become "ç"

Maybe I’ve got it wrong but it seems it’s all too clever by half. Maybe nobody will be using these combinations of odd characters and the convenience of guessing the encoding is worth the risk. Yet, having the IDE assume an encoding which may not be correct and even changing the file encoding is not cool in my book.

If you are using some odd characters in your file and you don’t intend to live wholly within the IDE, be careful about how your files are encoded.