Talend Spark error : java.lang.NegativeArraySizeException Serialization trace: XXXXX

java.lang.NegativeArraySizeException
Serialization trace:
SETTLED_TIME (_dw_0400_ld_bonus_XXX_fact_0_1.XXXXStruct)
otherElements (org.apache.spark.util.collection.CompactBuffer)

Solution

You do a combineByKey (so you have a join probably somewhere), which spills on disk because it’s too big. To spill on disk it serializes, and the blocks are > 2GB.

From a 2GB dataset, it’s easy to exand to several TB

Increase parallelism, make sure that your combineByKey has enough different keys, and see what happens.

Talend Spark error : java.lang.NegativeArraySizeException Serialization trace: XXXXX

talend spark-logo

Leave a Reply