I am experimenting with the edge case that we see in production. We have a business model where clients generate text files and then FTP them to our servers. We swallow these files and process them on our Java server (runs on CentOS computers). Most (95% +) of our customers know how to generate these files in UTF-8, what we want. However, we do have a few stubborn clients (but large accounts) that generate these files on a Windows computer with the CP1252 character set. No problem, however, we set up our third-party libraries (which do most of the “processing” for us) to handle input in any character set through some kind of magic voo doo.
Sometimes we see that a file with illegal UTF-8 characters (CP1252) appears in his name. When our software tries to read these files from an FTP server, the usual way of reading files throttles and produces FileNotFoundException:
File f = getFileFromFTPServer();
FileReader fReader = new FileReader(f);
String line = fReader.readLine();
The exceptions look something like this:
java.io.FileNotFoundException: /path/to/file/some-text-blah?blah.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at
java.io.FileInputStream.(FileInputStream.java:120) at java.io.FileReader.(FileReader.java:55) at com.myorg.backend.app.InputFileProcessor.run(InputFileProcessor.java:60) at
java.lang.Thread.run(Thread.java:662)
So, I think what happens because, because the file name itself contains illegal characters, we don’t even read it in the first place. If we could, then regardless of the contents of the file, our software would have to process it correctly. So this is really a problem with reading file names with illegal UTF-8 characters.
Java- ( ). Windows test£.txt. "test" . Alt-0163. FTP- , ls -ltr , , , test?.txt.
, Java-, / :
public Driver {
public static void main(String[] args) {
Driver d = new Driver();
d.run(args[0]);
}
private void run(String fileName) {
InputStreamReader isr = null;
BufferedReader buffReader = null;
FileInputStream fis = null;
String firstLineOfFile = "default";
System.out.println("Processing " + fileName);
try {
System.out.println("Attempting UTF-8...");
fis = new FileInputStream(fileName);
isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
buffReader = new BufferedReader(isr);
firstLineOfFile = buffReader.readLine();
System.out.println("UTF-8 worked and first line of file is : " + firstLineOfFile);
}
catch(IOException io1) {
try {
System.out.println("UTF-8 failed. Attempting Windows-1252...(" + io1.getMessage() + ")");
fis = new FileInputStream(fileName);
isr = new InputStreamReader(fis, Charset.forName("windows-1252"));
buffReader = new BufferedReader(isr);
firstLineOfFile = buffReader.readLine();
System.out.println("Windows-1252 worked and first line of file is : " + firstLineOfFile);
}
catch(IOException io2) {
System.out.println("Both UTF-8 and Windows-1252 failed. Could not read file. (" + io2.getMessage() + ")");
}
}
}
}
(java -cp . com/Driver t*), :
Processing test�.txt
Attempting UTF-8...
UTF-8 failed. Attempting Windows-1252...(test�.txt (No such file or directory))
Both UTF-8 and Windows-1252 failed. Could not read file.(test�.txt (No such file or directory))
test�.txt?!?! , "�" Unicode \uFFFD. , , FTP- CentOS , Alt-0163 (£), \uFFFD (�). , ls -ltr test?.txt...
, , , , , , (, , String-wise replaceAll("\uFFFD", "_") - ), .
, Java . CentOS , (test?.txt), Java, Java test�.txt - No such file or directory...
Java, , File::renameTo(String) ? , , , . !