Spark / Scala Opening CSV Files with Zipped

I am new to Spark and Scala. We have event log files in CSV format and then compressed using pkzip. I have seen many examples of how to decompress compressed files using Java, but how would I do this using Scala for Spark? Ultimately, we want to receive, retrieve, and load data from each incoming file into the Hbase destination table. Maybe this can be done with HadoopRDD? After that, we will introduce Spark streaming to view these files.

Thanks Ben

+3
source share
2 answers

In Spark, if your files have the correct file name suffix (e.g. .gz for gzipped) and it is supported org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use

sc.textFile(path)

UPDATE: Hadoop bzip2, bzip2 - ArrayIndexOutOfBounds.

+4

@samthebest , , Spark (Hadoop). :

  • bzip2
  • GZIP
  • LZ4

: fooobar.com/questions/557391/...

zip

, zip, . , .

, , : fooobar.com/questions/557381/...

, , sc.binaryFiles, PortableDataStream, :

sc.binaryFiles(path, minPartitions)
  .flatMap { case (name: String, content: PortableDataStream) =>
    val zis = new ZipInputStream(content.open)
    Stream.continually(zis.getNextEntry)
          .takeWhile(_ != null)
          .flatMap { _ =>
              val br = new BufferedReader(new InputStreamReader(zis))
              Stream.continually(br.readLine()).takeWhile(_ != null)
          }
0

All Articles