This post could also be called Reading .gz.tmp
files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from .gz.tmp
to .gz
, therefore those files are unavailable to be read by SparkContext.textFile
API. This post presents our workaround to process those files.