Repair File Names Invalid Characters
Aug 16, 2016 Remove Invalid Characters from File Names This script strips a potential file name of characters that are invalid in Windows file names, i.e.
I know that / is illegal in Linux, and the following are illegal in Windows(I think) *
.
'
/
[
]
:
;
=
,
What else am I missing?
I need a comprehensive guide, however, and one that takes into accountdouble-byte characters. Linking to outside resources is fine with me.
I need to first create a directory on the filesystem using a name that maycontain forbidden characters, so I plan to replace those characters withunderscores. I then need to write this directory and its contents to a zip file(using Java), so any additional advice concerning the names of zip directorieswould be appreciated.
JeffJeff12 Answers
A “comprehensive guide” of forbidden filename characters is not going to work on Windows because it reserves filenames as well as characters. Yes, characters like*
'
?
and others are forbidden, but there are a infinite number of names composed only of valid characters that are forbidden. For example, spaces and dots are valid filename characters, but names composed only of those characters are forbidden.
Windows does not distinguish between upper-case and lower-case characters, so you cannot create a folder named A
if one named a
already exists. Worse, seemingly-allowed names like PRN
and CON
, and many others, are reserved and not allowed. Windows also has several length restrictions; a filename valid in one folder may become invalid if moved to another folder. The rules fornaming files and foldersis on MSDN.
You cannot, in general, use user-generated text to create Windows directory names. If you want to allow users to name anything they want, you have to create safe names like A
, AB
, A2
et al., store user-generated names and their path equivalents in an application data file, and perform path mapping in your application.
If you absolutely must allow user-generated folder names, the only way to tell if they are invalid is to catch exceptions and assume the name is invalid. Even that is fraught with peril, as the exceptions thrown for denied access, offline drives, and out of drive space overlap with those that can be thrown for invalid names. You are opening up one huge can of hurt.
Dour High ArchDour High ArchLet's keep it simple and answer the question, first.
Repair File Names Invalid Characters In Excel
The forbidden printable ASCII characters are:
Linux/Unix:
Windows:
Non-printable characters
If your data comes from a source that would permit non-printable characters then there is more to check for.
Linux/Unix:
Windows:
Note: While it is legal under Linux/Unix file systems to create files with control characters in the filename, it might be a nightmare for the users to deal with such files.
Reserved file names
The following filenames are reserved:
Windows:
(both on their own and with arbitrary file extensions, e.g.
LPT1.txt
).
Other rules
Windows:
Filenames cannot end in a space or dot.
Under Linux and other Unix-related systems, there are only two characters that cannot appear in the name of a file or directory, and those are NUL '0'
and slash '/'
. The slash, of course, can appear in a path name, separating directory components.
Rumour1 has it that Steven Bourne (of 'shell' fame) had a directory containing 254 files, one for every single letter (character code) that can appear in a file name (excluding /
, '0'
; the name .
was the current directory, of course). It was used to test the Bourne shell and routinely wrought havoc on unwary programs such as backup programs.
Other people have covered the Windows rules.
Note that MacOS X has a case-insensitive file system.
1 It was Kernighan & Pike in The Practice of Programming who said as much in Chapter 6, Testing, §6.5 Stress Tests:When Steve Bourne was writing his Unix shell (which came to be known as the Bourne shell), he made a directory of 254 files with one-character names, one for each byte value except '0'
and slash, the two characters that cannot appear in Unix file names. He used that directory for all manner of tests of pattern-matching and tokenization. (The test directory was of course created by a program.) For years afterwards, that directory was the bane of file-tree-walking programs; it tested them to destruction.
Note that the directory must have contained entries .
and ..
, so it was arguably 253 files (and 2 directories), or 255 name entries, rather than 254 files. This doesn't affect the effectiveness of the anecdote, or the careful testing it describes.
Instead of creating a blacklist of characters, you could use a whitelist. All things considered, the range of characters that make sense in a file or directory name context is quite short, and unless you have some very specific naming requirements your users will not hold it against your application if they cannot use the whole ASCII table.
It does not solve the problem of reserved names in the target file system, but with a whitelist it is easier to mitigate the risks at the source.
In that spirit, this is a range of characters that can be considered safe:
- Letters (a-z A-Z) - Unicode characters as well, if needed
- Digits (0-9)
- Underscore (_)
- Hyphen (-)
- Space
- Dot (.)
And any additional safe characters you wish to allow. Beyond this, you just have to enforce some additional rules regarding spaces and dots. This is usually sufficient:
- Name must contain at least one letter or number (to avoid only dots/spaces)
- Name must start with a letter or number (to avoid leading dots/spaces)
- Name may not end with a dot or space (simply trim those if present, like Explorer does)
This already allows quite complex and nonsensical names. For example, these names would be possible with these rules, and be valid file names in Windows/Linux:
A...........ext
B -.- .ext
In essence, even with so few whitelisted characters you should still decide what actually makes sense, and validate/adjust the name accordingly. In one of my applications, I used the same rules as above but stripped any duplicate dots and spaces.
Well, if only for research purposes, then your best bet is to look at this Wikipedia entry on Filenames.
If you want to write a portable function to validate user input and create filenames based on that, the short answer is don't. Take a look at a portable module like Perl's File::Spec to have a glimpse to all the hops needed to accomplish such a 'simple' task.
Leonardo HerreraLeonardo HerreraThe easy way to get Windows to tell you the answer is to attempt to rename a file via Explorer and type in / for the new name. Windows will popup a message box telling you the list of illegal characters.
raimueFor Windows you can check it using PowerShell
To display UTF-8 codes you can convert
As of 18/04/2017, no simple black or white list of characters and filenames is evident among the answers to this topic - and there are many replies.
The best suggestion I could come up with was to let the user name the file however he likes. Using an error handler when the application tries to save the file, catch any exceptions, assume the filename is to blame (obviously after making sure the save path was ok as well), and prompt the user for a new file name. For best results, place this checking procedure within a loop that continues until either the user gets it right or gives up. Worked best for me (at least in VBA).
Though the only illegal Unix chars might be /
and NULL
, although some consideration for command line interpretation should be included.
For example, while it might be legal to name a file 1>&2
or 2>&1
in Unix, file names such as this might be misinterpreted when used on a command line.
Similarly it might be possible to name a file $PATH
, but when trying to access it from the command line, the shell will translate $PATH
to its variable value.
When creating internet shortcuts in Windows, to create the file name, it skips illegal characters, except for forward slash, which is converted to minus.
Matthias RongeMatthias RongeIn Unix shells, you can quote almost every character in single quotes '
. Except the single quote itself, and you can't express control characters, because is not expanded. Accessing the single quote itself from within a quoted string is possible, because you can concatenate strings with single and double quotes, like
'I'''m'
which can be used to access a file called 'I'm'
(double quote also possible here).
So you should avoid all control characters, because they are too difficult to enter in the shell. The rest still is funny, especially files starting with a dash, because most commands read those as options unless you have two dashes --
before, or you specify them with ./
, which also hides the starting -
.
If you want to be nice, don't use any of the characters the shell and typical commands use as syntactical elements, sometimes position dependent, so e.g. you can still use -
, but not as first character; same with .
, you can use it as first character only when you mean it ('hidden file'). When you are mean, your file names are VT100 escape sequences ;-), so that an ls garbles the output.
I had the same need and was looking for recommendation or standard references and came across this thread. My current blacklist of characters that should be avoided in file and directory names are:
Meng LuMeng LuNot the answer you're looking for? Browse other questions tagged windowslinuxdirectoryzipfilenames or ask your own question.
This question already has an answer here:
- How to remove illegal characters from path and filenames? 26 answers
I'm working on a program that reads files and saves pieces of them according to their column's title. Some of those titles have illegal characters for file names, so i've written this piece of code to handle those issues.
Is there a nicer way of doing this where i don't have 4 .Replace()
? or is there some sort of built in illegal character remover i don't know of?
Thanks!
EDIT: It does not need to replace the characters with anything specific. A blank space is sufficient.
marked as duplicate by Jehof, xlecoustillier, Manuel, ollo, Emil VikströmMar 7 '13 at 8:33
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
4 Answers
Regular expressions are generally a good way to do that, but not when you're replacing every character with something different. You might consider replacing them all with the same thing, and just using System.IO.Path.GetInvalidFileNameChars()
.
System.IO.Path.GetInvalidFileNameChars()
has all the invalid characters.
Here's a sample method:
canoncanonHave a look at Regex.Replace here, it will do everything you desire when it comes to stripping out characters individually. Selective replacement of other strings may be trickier.
If you just want to remove illegal chars, rather than replacing them with something else you can use this.
Servy