Best strategy for repartionBy with few big partitions


Best strategy for repartionBy with few big partitions



I have to repartition geo data by quadkey. Primarily all the data is pretty balanced, but few partitions are 500x times bigger than others. So it causes very unbalanced partition stage, like 20-30 of 3500 tasks are 98 % slower than others. Is there are any good strategy in that case?



I need to do next:


stage.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(destUrl)




1 Answer
1



The .repartition is unnecessary and is probably causing the issue.


.repartition



If you leave that out and just have the .write.partitionBy..., you will still get the same directory structure, you will just have multiple files within each directory.


.write.partitionBy...






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift