Spark / Scala Opening CSV Files with Zipped

Question

Spark / Scala Opening CSV Files with Zipped

I am new to Spark and Scala. We have event log files in CSV format and then compressed using pkzip. I have seen many examples of how to decompress compressed files using Java, but how would I do this using Scala for Spark? Ultimately, we want to receive, retrieve, and load data from each incoming file into the Hbase destination table. Maybe this can be done with HadoopRDD? After that, we will introduce Spark streaming to view these files.

Thanks Ben

+3

scala apache-spark

Ben Feb 18 '14 at 10:06

source share

2 answers

samthebest · Answer 1 · 2014-03-23T12:39:45+0000

In Spark, if your files have the correct file name suffix (e.g. .gz for gzipped) and it is supported org.apache.hadoop.io.compress.CompressionCodecFactory, then you can just use

sc.textFile(path)

UPDATE: Hadoop bzip2, bzip2 - ArrayIndexOutOfBounds.

Atais · Answer 2 · 2017-08-30T10:58:11+0000

@samthebest , , Spark (Hadoop). :

bzip2
GZIP
LZ4

: fooobar.com/questions/557391/...

zip

, zip, . , .

, , : fooobar.com/questions/557381/...

, , sc.binaryFiles, PortableDataStream, :

sc.binaryFiles(path, minPartitions)
  .flatMap { case (name: String, content: PortableDataStream) =>
    val zis = new ZipInputStream(content.open)
    Stream.continually(zis.getNextEntry)
          .takeWhile(_ != null)
          .flatMap { _ =>
              val br = new BufferedReader(new InputStreamReader(zis))
              Stream.continually(br.readLine()).takeWhile(_ != null)
          }

Spark / Scala Opening CSV Files with Zipped

zip

More articles: