跳到主要内容

Spark一些常见的错误及解决办法

1、container因内存不足被yarn kill:

JavaScriptExecutorLostFailure (executor 374 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits:

解决办法

  1. 增大:spark.yarn.executor.memoryOverhead
  2. 扩大并发:
  3. spark.sql.shuffle.partitions(默认200);
  4. 在开启AE(spark.sql.adaptive.enabled=true)后,最大shuffle tasks数由spark.sql.adaptive.maxNumPostShufflePartitions

2、Shuffle Fetch Failed:

JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 11 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection from n20-215-213.byted.org/10.20.215.213:7337 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:569)

解决办法:

设置hdfs-base-shuffle:

spark.shuffle.hdfs.enabled=true

spark.shuffle.io.maxRetries=1

spark.shuffle.io.retryWait=0s

spark.network.timeout=120s

3. 获取太多hive分区

NginxCaused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Fetch to many partitions 20939 max: 12000)

解决方法:

  1. 检查SQL是否正确,是否真的读取太多分区

  2. spark设置如下参数:

spark.sql.hive.convertMetastoreParquet=true;

spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER;

4.堆外内存不足

JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 4 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 4294967296, max: 4294967296)

spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=2560m

  1. 调整并发:(参照1.2)