Spark How to Specify Number of Resulting Files for DataFrame While/After Writing
Spark How to Specify Number of Resulting Files for DataFrame While/After Writing
I saw several q/a's about writing single file into hdfs,it seems using coalesce(1)
is sufficient.
coalesce(1)
E.g;
df.coalesce(1).write.mode("overwrite").format(format).save(location)
But how can I specify "exact" number of files that will written after save operation?
So my question is;
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50)
will it write 50 files?
repartition(50)/coalsesce(50)
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
Thanks
1 Answer
1
Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy
is used.
partitionBy
Normally
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
Yes
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?
And yes.
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
No.
maxRecordsPerFile
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Note that we can set
maxRecordsPerFile
since Spark 2.2. With this option the number of files could differ from the number of partitions (gatorsmile.io/…)– Raphael Roth
Jun 29 at 10:40