Spark一些常见的错误及解决办法
1、container因内存不足被yarn kill:
JavaScriptExecutorLostFailure (executor 374 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits:
解决办法
- 增大:spark.yarn.executor.memoryOverhead
- 扩大并发:
- spark.sql.shuffle.partitions(默认200);
- 在开启AE(spark.sql.adaptive.enabled=true)后,最大shuffle tasks数由spark.sql.adaptive.maxNumPostShufflePartitions
2、Shuffle Fetch Failed:
JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 11 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection from n20-215-213.byted.org/10.20.215.213:7337 closed at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:569)
解决办法:
设置hdfs-base-shuffle:
spark.shuffle.hdfs.enabled=true
spark.shuffle.io.maxRetries=1
spark.shuffle.io.retryWait=0s
spark.network.timeout=120s
3. 获取太多hive分区
NginxCaused by: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Fetch to many partitions 20939 max: 12000)
解决方法:
-
检查SQL是否正确,是否真的读取太多分区
-
spark设置如下参数:
spark.sql.hive.convertMetastoreParquet=true;
spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER;
4.堆外内存不足
JavaScriptCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 4 (run at ThreadPoolExecutor.java:1142) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 4294967296, max: 4294967296)
spark.executor.extraJavaOptions=-XX:MaxDirectMemorySize=2560m
- 调整并发:(参照1.2)