Best strategy for repartionBy with few big partitions

I have to repartition geo data by quadkey. Primarily all the data is pretty balanced, but few partitions are 500x times bigger than others. So it causes very unbalanced partition stage, like 20-30 of 3500 tasks are 98 % slower than others. Is there are any good strategy in that case?

I need to do next:

stage.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*) .write.partitionBy(partitionColumns:_*) .format("parquet") .option("compression", "gzip") .mode(SaveMode.Append) .save(destUrl)

1 Answer
1

The .repartition is unnecessary and is probably causing the issue.

.repartition

If you leave that out and just have the .write.partitionBy..., you will still get the same directory structure, you will just have multiple files within each directory.

.write.partitionBy...

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk