Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient.

coalesce(1)

E.g;

df.coalesce(1).write.mode("overwrite").format(format).save(location)

But how can I specify "exact" number of files that will written after save operation?

So my question is;

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

repartition(50)/coalsesce(50)

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

Thanks

1 Answer
1

Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy is used.

partitionBy

Normally

If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?

Yes

If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?

And yes.

Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?

No.

Note that we can set maxRecordsPerFile since Spark 2.2. With this option the number of files could differ from the number of partitions (gatorsmile.io/…)
– Raphael Roth
Jun 29 at 10:40

maxRecordsPerFile

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk