Filter log files (_success and _log) in FileSystem.liststatus

Hi, using the FileSystem.listStatus method, I want to filter log files and list only files that are not log files. How should I do it? Thanks

+2
source share
3 answers

If you look at the source FileInputFormat (line 62), they have a closed static PathFilter that ignores files starting with an underscore or period. As his personal, you will have to make a copy of the code, or you answer enough if your input files always start with a part (i.e. you did not use MultipleOutputs)

+2
source

This is how I got rid of _SUCCESS files

PathFilter clusterFileFilter = new PathFilter() {
                  public boolean accept(Path path) {
                    return path.getName().startsWith("part");
                  }
                };


    FileStatus[] fileStatusArray = fs.listStatus(path, clusterFileFilter);
+1
source

I do not use FileSystem.listStatus (filePath), but still encounter this exception when opening MapFile.Writer (config, filePath, opts).

Below exception

java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
    at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1880)
    at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1829)
    at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
    at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
    at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
0
source

All Articles