File System Issues in Monotone

About: Encoding, platform independency and case of filenames (codepages and unicode)
Related: CaseInsensitiveFilesystems

Last updated with monotone version 0.32

I encountered some bugs and thought as it is my character: "I'm going to fix that quickly!". By now I know it better. This text consists of a few facts and a lot speculation. Formulations are often not precise due lack of knowledge, which this page should help to gain. I can tell you things are really hairy. There is a lot of misinformation, which I often copied to this text. There are exceptions of exceptions of rules. There are terms, which are used in wrong context. While writing this I tried to fix such errors, and every time I learned it better... so this text in its structure pretty much represents the chaos that is out there.

directoryname and filename can be used synonymous. I hope I always used the word filename.

The facts

Monotones copy libidn's stringprep does not work on mingw (win32). Calling "mtn löl" causes following output: ?BR?BR ?: error: failed to convert string from ASCII to UTF-8: 'l?l'?BR
a. There is a quick fix for that: call "chcp" get your current codepage and set it for example: "set CHARSET=CP850". a. I traced down that stringprep_local_charset_slow in toutf8.c is disabled on mingw. a. There is a new version of libidn which states to work on mingw (not tested).
Recursively adding files ("mtn add -R") with umlauts will fail on systems that have other codepage than UTF-8 in their file systems. a. file_io needs to be re-factored according to njs. a. The directory walking code does not do any conversion at the moment.
On Mac OSX you can add the same file (containing umlauts in its filename) twice: a. You add the file using recursion, which will lead monotone to add a decomposed (NFD) UTF-8 filename. a. You can add the file using the full path, which will lead monotone to add precomposed (NFC) UTF-8 filename. a. To drop this file you need to drop it once using your keyboards umlauts (precomposed). (i.e löl) and once using bash's completion: lo-tab -> löl (decomposed). See "About UTF-8 normalization" below.
Case insensitive file systems lead to the same problem as in 3 and sometimes to even more problems: CaseInsensitiveFilesystems

Now for the speculation...

The speculations

This is a special problem, which probably only cross-platform SCM tools have. Even for tools like rsync this is not such a big problem, since the synced filename is possibly crap, but at least the contents are copied. And copying back will probably correct the filename again. But SCM Tools have to track these filenames on different platforms and file systems. Inconsistency will lead to errors in deltas, in merging...

File systems

Most POSIX file sytems are transparent. They just accept the kind of encoding the user has set.

http://www.j3e.de/linux/convmv/

Linux (most POSIX)

Assuming I'm on linux:

If I set the terminal to UTF-8 and call "touch löl" I create a file with UTF-8(NFC) encoded filename. When I call "ls" I will get the correct filename. If I set the terminal to ASCII and call "ls" I will get something like "lÃ¶l".

This means on most POSIX systems filenames with different encoding (ie. UTF-8 and LATIN-1) can coexist in the same directory. This is what is going to make it hard to find the correct solution for monotone to handle filenames.

Mac OSX (darwin)

On OSX (UFS and HFS+) things are different. The VFS file system layer of OSX forces a filename always to be UTF-8 (NFD), which means you can create a file using UTF-8 (NFC) and read that filename (using readdir()) again and you'll get UTF-8 (NFD). You will also be able to open or find a file using UTF-8 (NFC), which explains why I was able to add the same file twice. You actually can open and find a file using NFC or NFD. Additionally HFS+ can be case insensitive.

Windows win32 (NTFS / FAT)

NTFS enforces encoding to UTF-16 (or UCS2???), while FAT should not be used with UTF-8. It should not be possible to use FAT with UTF-8, since the file system layer of win32 will prevent that. There are linux implementations of FAT, which state that the file system is going to be case-sensitive when using UTF-8 with FAT. The file system layer of win32 will convert to current codepage when using ANSI versions of file system layer functions (at least my tests told me that).

Microsoft recommends to use the wide version of all file system layer accessing function.

The ?GetACP() Page states: The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode UTF-8 (code page 65001) or UTF-16 when possible.

But it does not state how you achieve that, since there is no ?SetACP(). So I think using the wide version of the functions should be save.

Addition: wilx found the setlocale() function, which will change the output of ANSI versions of function in the C Standard Library: http://msdn2.microsoft.com/en-us/library/x99tb11d(VS.80).aspx

So you probably can't use cygwin/mingw anymore:

http://www.cygwin.com/ml/cygwin/2006-11/msg00796.html

-> Case insensitive, too!

Some cases

Well, if you want to have different codepages going to at least sort of work, you probably have to guess what the best way is to solve some cases. Using the information: what platform are we on, which file system are we on. Some cases:

Two different files fall together on a case-insensitive file system,
a file is written to the file system but can't be found again (the file system does some conversion),
a file can't be written to the local file system because it won't accept the some character encoded in the file,
- as an example you have a chinese letter encoded in UTF-8 and the local locale is LATIN-1 -> iconv will give you an error.
- and different file systems have different reserved characters like /,
we have double UTF-8 encoding,
people change the locale while a workspace still exists,
we convert from/to the wrong codepage to/from UTF-8 (can lead to wrong characters in filenames),
filenames look ugly because the local locale can't represent characters in the name.
- This could happen if we decide to just make monotone 8-bit aware but not doing conversions at all.
The NFC -> NFD problem.

Solutions

Solutions for single cases

Internally enforce lower case, plus nice error messages on adding existing files.
Find the file by content. Tell the user what happened and prompt if he likes to rename the file.
Tell the user what happened and prompt if he likes to rename the file.
Can we detect what encoding a filename is in?
Save the current locale in the workspace. And reject commands on that workspace if the locale has changed.
Can we detect what encoding a filename is in?
Prompt the user to rename the file.
Internally enforce NF? or respect "Canonical Equivalence" in string routines. (http://www.unicode.org/unicode/reports/tr15/)

If monotone does care about conversion, the locales are set correctly and the file system or the user enforces that there are no filenames in other codepages than the locale defines (which should be common practice), case 2. / 4. / 6. / 7. can be avoided. The rest of the cases seems solvable, too.

We can probably come up with a solution that could be called correct, except detecting misuse won't work in all cases, we have to trust in proper conditions. Some cases of misuse that work:

If conversion from UTF-8 fails: which is case 3. (actually not misuse, hmm)
If conversion to UTF-8 fails: which probably is case 6. or 5.. We have wrong or mixed codepages in filenames.
If the workspace locale does not match the local locale: case 5..

Not work will:

If conversion from the wrong locale leads to wrong but valid characters.

The very restrictive solution

Let monotone only accept basic ASCII ([a-b], [A-B], [0-9], -, .) and always convert to lower case for comparing, store filename with case-information. Write nice error on non basic ASCII characters and if people try to add files which already exist, write a message that the problem might be that these only differ in case.

This solution would even work if the file system converts characters to UPPER case, since we always convert them back on comparing. I have the feeling that this is the only solution, which possibly can reach a 100% correctness, also for futures changes and new platforms.

The "we don't care" solution

Make monotone 8-bit aware. All strings are saved as 8-bit, but we don't do any conversion. It is the same as most POSIX file systems do. Possible cases: 1. / 2. / 3. / 7. / 8.

The unicode solution

Only accept unicode UTF-8 on unix systems and UTF-16 (or using setlocale()) on windows. Always write filenames as UTF-8 on unix and UTF-16 on windows, regardless of the codepage / locale a user has set. Possible cases: 1. / 2. / 3. / 7. / 8.

The solution that supports codepages and handles all cases

PLEASE INSERT HERE!

Ok, I'll try to start listing what a sane solution should do (don't tell me I'm crazy :-)):

Locale in this context means the codepage or encoding defined by the current locale.

Read == A filename is read from local file system layer into monotone. (readdir())
Write == A filename is written to local file system layer from monotone. (open(), chmod(), chown(), stat(), mkdir())

Do conversions on read (from locale to UTF-8) and write (from UTF-8 to locale) of filenames on POSIX, except Darwin.
Do conversions on read (from UTF-8 (NFD) to UTF-8 (NFC)) and none on write of filenames on Darwin.
One of the following solutions on windows:
1. Do conversions on read (from locale to UTF-8) and write (from UTF-8 to locale) and use new libidn or ?GetACP() to find out what the local locale is.
2. Use the wide versions of file system accessing functions on windows, which should return UTF-16. So you need to do conversions on read (from UTF-16 to UTF-8) and write (from UTF-8 to UTF-16).
3. Use setlocale() to set UTF-8 and do no conversion. This differs from POSIX because win32's file system layer will do conversions. Tests on NTFS and FAT need to be done before using this solution.
Convert all filenames to lower case for internal use, but save the filenames with case-information in the database, comparing should happen on uniform case. Write a nice error message on an add command, stating that the error could be caused by monotone not supporting case-sensitive filenames for the reason that there are case-insensitve file systems. Offer to rename that file.
If there is a filename which has characters not supported by local locale (conversion from UTF-8 to locale fails). Write a nice error message explaining the situation. Offer to rename that file.
If not the conversion, but the write (open() etc) fails, there should also be a sane message. This is probably because a filename violated reserved character restrictions of the local file system. Offer to rename that file.
Save the current locale on a checkout in the workspace. Refuse futher commands if locale changes. Write a nice error message. Offer to update the locale in the workspace or recheckout the workspace.
If conversion from locale to UTF-8 fails, write a nice error message that this might be caused by a filename, which isn't in the encoding defined by the current locale. Offer to rename that file.
The local renaming of course has to happen without conversion, but the destination name must be convertable to UTF-8. If the file is added immediately after renaming, the destination name should be converted according to above rules.
Good practice would be to check all possible conversion, read and write actions on a update, checkout and add command, write the corresponding error message and offer a way to fix the problem (mostly renaming the file). So the user will be informed about problems early.
Writing unit-tests for 1-9.
I added this "offer rename" statements because there are errors you can get on a checkout/update. So if you have no workspace (no update), you can't rename, if you can't rename you can get no workspace (no update). Therefore we need to offer some means of renaming while checking out or updating.
There must be a migration function if we really are going to do 4., files which only differ in case must be listed and means of renaming them must be provided, when updating to the monotone version which implements 4..

Alternative solutions for case-insensitive FS are in CaseInsensitiveFilesystems, but I don't think they are conservative enough, on the other hand, if we really are going to only support the common subset, we need to implement "The very restrictive solution", which isn't nice, too. I'm glad I don't have to decide, but can just point out ideas.

Important: This list is not thought as ALL OR NOTHING. Only the solutions that make sense for someone who knows the internals of monotone better than I do should be implemented. AND there are certainly variants of these solutions and some of them might make more sense.

A variant to 12.: Checkout/update new filenames that can't be converted to local locale or can't be written with a automatically chosen filename. Write a message to the user with the orginal and the automatic filename and tell him he should use "mtn rename" to set useful name. So checkout/update will not fail.

What did other people do?

There are several dozens of SCM systems which are cross-platform. How did they solve that problem?

Here is an rfc about ftp and unicode: http://tools.ietf.org/html/rfc2640
And the IETF policy on charsets: http://tools.ietf.org/html/rfc2277
The UTF8 rfc itself: http://tools.ietf.org/html/rfc3629

This is interesting (rfc2277): Negotiating a charset may be regarded as an interim mechanism that is to be supported until support for interchange of UTF-8 is prevalent; however, the timeframe of "interim" may be at least 50 years, so there is every reason to think of it as permanent in practice.

And the angry comment

Boost should actually solve that, but it does not. I compared it with Qt and that doesn't do much more.

encodeName(): By default, this function converts fileName to the local 8-bit encoding determined by the user's locale. This is sufficient for file names that the user chooses. File names hard-coded into the application should only use 7-bit ASCII filename characters.

Wrong guess!

http://doc.trolltech.com/4.2/qfile.html#decodeName

sigh How should it fix something that seems not fixable.

The terminal/console

Well, here most problems are solved by locales and the libidn. Except that it doesn't work on windows. Which can hopefully be solved by updating libidn or hacking the copy of libidn using ?GetACP().

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_21bk.asp

About UTF-8 normalization

There are two commonly used normalizations of UTF-8:

NFC (mostly precomposed) used by the whole world. Except apple.
NFD (decomposed) used by apple.
-> http://developer.apple.com/qa/qa2001/qa1235.html
-> http://developer.apple.com/qa/qa2001/qa1173.html

Apple states: You can find a lot more information about Unicode on the Unicode consortium web site. Specifically of interest is the Unicode Standard Annex #15 Unicode Normalization Forms. As used in this Q&A, the terms decomposed and precomposed correspond to Unicode Normal Forms D (NFD) and C (NFC), respectively.

Which is NOT totally true. There are characters that don't have a precomposed form, so these will exist in NFC as decomposed form. (http://www.unicode.org/unicode/reports/tr15/)

Example

precomposed is U+00[the latin-1 character] -> Á is U+00C1.
decomposed is U+00[base ASCII char] U+[combining ACCENT char] -> Á is U+0041 U+0301.

What to do at the moment?

I use this script (http://fangorn.ch/n/blog/2007/01/20/isnotasciipl/) to check that all filenames are ASCII before I do a "mtn add -R".

Pages

Initial version by Ganwell