Spark一些常见的错误及解决办法

1、container因内存不足被yarn kill：

JavaScriptExecutorLostFailure (executor 374 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits：

解决办法

增大：spark.yarn.executor.memoryOverhead
扩大并发：
spark.sql.shuffle.partitions（默认200）；
在开启AE(spark.sql.adaptive.enabled=true)后，最大shuffle tasks数由spark.sql.adaptive.maxNumPostShufflePartitions

2、Shuffle Fetch Failed：

JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 11 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection from n20-215-213.byted.org/10.20.215.213:7337 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:569)

解决办法：

设置hdfs-base-shuffle:

spark.shuffle.hdfs.enabled=true

spark.shuffle.io.maxRetries=1

spark.shuffle.io.retryWait=0s

spark.network.timeout=120s

3. 获取太多hive分区

NginxCaused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Fetch to many partitions 20939 max: 12000)

解决方法：

检查SQL是否正确，是否真的读取太多分区
spark设置如下参数：

spark.sql.hive.convertMetastoreParquet=true;

spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER;

4.堆外内存不足

JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 4 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 4294967296, max: 4294967296)

spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=2560m

调整并发：（参照1.2）

1、container因内存不足被yarn kill：​

2、Shuffle Fetch Failed：​

3. 获取太多hive分区​

4.堆外内存不足​

1、container因内存不足被yarn kill：

2、Shuffle Fetch Failed：

3. 获取太多hive分区

4.堆外内存不足