华为云用户手册

  • 解决办法 配置自定义参数“allow.everyone.if.no.acl.found”参数为“true”,重启Kafka服务。 采用具有权限用户登录。 例如: kinit test_user 或者赋予用户相关权限。 需要使用Kafka管理员用户(属于kafkaadmin组)操作。 例如: kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --consumer --add --allow-principal User:test --group test [root@10-10-144-2 client]# kafka-acls.sh --authorizer-properties zookeeper.connect=8.5.144.2:2181/kafka --list --topic topic_acl Current ACLs for resource `Topic:topic_acl`: User:test_user has Allow permission for operations: Describe from hosts: * User:test_user has Allow permission for operations: Write from hosts: * User:test has Allow permission for operations: Describe from hosts: * User:test has Allow permission for operations: Write from hosts: * User:test has Allow permission for operations: Read from hosts: * 用户加入Kafka组或者Kafkaadmin组。
  • 原因分析 修改“$Flink_HOME/conf”目录下的“log4j.properties”文件,控制的是JobManager和TaskManager的算子内的日志输出,输出的日志会打印到对应的yarn contain中,可以在Yarn WebUI查看对应日志。 MRS 3.1.0及之后版本的Flink 1.12.0版本开始默认的日志框架是log4j2,配置的方式跟之前log4j的方式有区别,使用如log4j日志规则不会生效。
  • 问题现象 客户侧HBase用户认证失败,报错信息如下: 2019-05-13 10:53:09,975 ERROR [localhost-startStop-1] xxxConfig.LoginUtil: login failed with hbaseuser and /usr/local/linoseyc/hbase-tomcat/webapps/bigdata_hbase/WEB-INF/classes/user.keytab. 2019-05-13 10:53:09,975 ERROR [localhost-startStop-1] xxxConfig.LoginUtil: perhaps cause 1 is (wrong password) keytab file and user not match, you can kinit -k -t keytab user in client server to check. 2019-05-13 10:53:09,975 ERROR [localhost-startStop-1] xxxConfig.LoginUtil: perhaps cause 2 is (clock skew) time of local server and remote server not match, please check ntp to remote server. 2019-05-13 10:53:09,975 ERROR [localhost-startStop-1] xxxConfig.LoginUtil: perhaps cause 3 is (aes256 not support) aes256 not support by default jdk/jre, need copy local_policy.jar and US_export_policy.jar from remote server in path ${BIGDATA_HOME}/jdk/jre/lib/security.
  • 处理步骤 以root用户登录集群Master1节点。 执行如下命令,查看MRS服务认证的jar包。 ll /opt/share/local_policy/local_policy.jar ll /opt/Bigdata/jdk{version}/jre/lib/security/local_policy.jar 将步骤2中的jar包下载到本地。 将下载的jar包替换到本地JDK目录/opt/Bigdata/jdk/jre/lib/security。 执行cd 客户端安装目录/HBase/hbase/bin命令,进入到HBase的bin目录。 执行sh start-hbase.sh命令,重启HBase组件。
  • 用户问题 用户在Beeline命令行执行insert into命令报错: INFO : Concurrency mode is disabled, not creating a lock manager Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_xxx to YARN : User xxx cannot submit application application_xxx to queue root.default. ACL check failed. (state=08S01,code=1)
  • 原因分析 HDFS客户端开始写Block。 例如:HDFS客户端是在2015-05-27 18:50:24,232开始写/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338的。其中分配的块是blk_1099105501_25370893。 2015-05-27 18:50:24,232 | INFO | IPC Server handler 30 on 25000 | BLOCK* allocateBlock: /20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338. BP-1803470917-192.168.57.33-1428597734132 blk_1099105501_25370893{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-b2d7b7d0-f410-4958-8eba-6deecbca2f87:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-76bd80e7-ad58-49c6-bf2c-03f91caf750f:NORMAL|RBW]]} | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveAllocatedBlock(FSNamesystem.java:3166) 写完之后HDFS客户端调用了fsync。 2015-05-27 19:00:22,717 | INFO | IPC Server handler 22 on 25000 | BLOCK* fsync: 20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338 for DFSClient_NONMAPREDUCE_-120525246_15 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3805) HDFS客户端调用close关闭文件,NameNode收到客户端的close请求之后就会检查最后一个块的完成状态,只有当有足够的DataNode上报了块完成才可用关闭文件,检查块完成的状态是通过checkFileProgress函数检查的,打印如下: 2015-05-27 19:00:27,603 | INFO | IPC Server handler 44 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:28,005 | INFO | IPC Server handler 45 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:28,806 | INFO | IPC Server handler 63 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:30,408 | INFO | IPC Server handler 43 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:33,610 | INFO | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:40,011 | INFO | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) NameNode打印了多次checkFileProgress是由于HDFS客户端多次尝试close文件,但是由于当前状态不满足要求,导致close失败, HDFS客户端retry的次数是由参数dfs.client.block.write.locateFollowingBlock.retries决定的,该参数默认是5,所以在NameNode的日志中看到了6次checkFileProgress打印。 但是再过0.5s之后,DataNode就上报块已经成功写入。 2015-05-27 19:00:40,608 | INFO | IPC Server handler 60 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.21:25009 is added to blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393) 2015-05-27 19:00:48,297 | INFO | IPC Server handler 37 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.10:25009 is added to blk_1099105501_25370893 size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393) DataNode上报块写成功通知延迟的原因可能有:网络瓶颈导致、CPU瓶颈导致。 如果此时再次调用close或者close的retry的次数增多,那么close都将返回成功。建议适当增大参数dfs.client.block.write.locateFollowingBlock.retries的值,默认值为5次,尝试的时间间隔为400ms、800ms、1600ms、3200ms、6400ms,12800ms,那么close函数最多需要25.2秒才能返回。
  • 问题背景与现象 HDFS客户端写文件close失败,客户端提示数据块没有足够副本数。 客户端日志: 2015-05-27 19:00:52.811 [pool-2-thread-3] ERROR: /tsp/nedata/collect/UGW/ugwufdr/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338 close hdfs sequence file fail (SequenceFileInfoChannel.java:444) java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2160) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2128) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at com.huawei.pai.collect2.stream.SequenceFileInfoChannel.close(SequenceFileInfoChannel.java:433) at com.huawei.pai.collect2.stream.SequenceFileWriterToolChannel$FileCloseTask.call(SequenceFileWriterToolChannel.java:804) at com.huawei.pai.collect2.stream.SequenceFileWriterToolChannel$FileCloseTask.call(SequenceFileWriterToolChannel.java:792) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
  • 原因分析 进入Yarn原生页面查看MapReduce任务的日志看到报错是无法识别到压缩方式导致错误,看文件后缀是gzip压缩,堆栈却报出是zlib方式。 因此怀疑此语句查询的表对应的HDFS上的文件有问题,Map日志中打印出了解析的对应的文件名,将其从HDFS上下载到本地,看到是gz结尾的文件,使用tar命令解压报错,格式不正确无法解压。使用file命令查看文件属性发现此文件来自于FAT系统的压缩而非UNIX。
  • 处理步骤1 执行以下命令: source /opt/Bigdata/MRS_XXX/install/dbservice/.dbservice_profile gsql -h DBservice浮动IP地址 -p 20051 -d hivemeta -U hive -W hive用户密码 如果不能正确进入交互界面,说明数据库初始化失败。如果报如下错误说明在DBservice所在的节点的配置文件可能丢失了hivemeta的配置。 org.postgresql.util.PSQLException: FATAL: no pg_hba.conf entry for host "192.168.0.146", database "HIVEMETA"。 编辑“/srv/BigData/dbdata_service/data/pg_hba.conf”,在文件最后面追加host hivemeta hive 0.0.0.0/0 sha256配置。 执行source /opt/Bigdata/MRS_XXX/install/dbservice/.dbservice_profile命令配置环境变量。 执行gs_ctl -D $GAUSSDATA reload #命令使修改后的配置生效。
  • 问题背景与现象 执行drop partitions操作,执行异常: MetaStoreClient lost connection. Attempting to reconnect. | org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:187) org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.xxx(TTransport.java:86) at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:376) at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:453) at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:435) ... 查看对应MetaStore日志,有StackOverFlow异常: 2017-04-22 01:00:58,834 | ERROR | pool-6-thread-208 | java.lang.StackOverflowError at org.datanucleus.store.rdbms.sql.SQLText.toSQL(SQLText.java:330) at org.datanucleus.store.rdbms.sql.SQLText.toSQL(SQLText.java:339) at org.datanucleus.store.rdbms.sql.SQLText.toSQL(SQLText.java:339) at org.datanucleus.store.rdbms.sql.SQLText.toSQL(SQLText.java:339) at org.datanucleus.store.rdbms.sql.SQLText.toSQL(SQLText.java:339)
  • 问题现象 使用MRS 1.8集群的Hive 1.2.1通过Hive的JDBC接口连接MRS集群成功,但是使用MRS 1.9.0集群的Hive 2.3.2,通过Hive的JDBC接口连接MRS集群进行计算任务报错。 报错信息如下: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hiveserver2
  • 解决办法 在重启前,主动执行异常checkpoint合并主NameNode的元数据。 停止业务。 获取主NameNode的主机名。 在客户端执行如下命令: source /opt/client/bigdata_env kinit 组件用户 说明:“/opt/client”需要换为实际客户端的安装路径。 执行如下命令,让主NameNode进入安全模式,其中linux22换为主NameNode的主机名。 hdfs dfsadmin -fs linux22:25000 -safemode enter 执行如下命令,在主NameNode,合并editlog。 hdfs dfsadmin -fs linux22:25000 -saveNamespace 执行如下命令,让主NameNode离开安全模式。 hdfs dfsadmin -fs linux22:25000 -safemode leave 检查是否真的合并完成。 cd /srv/BigData/namenode/current 检查先产生的fsimage是否是当前时间的,若是则表示已经合并完成
  • 原因分析 备NameNode会周期性做合并editlog,生成fsimage文件的过程叫做checkpoint。备NameNode在新生成fsimage后,会将fsimage传递到主NameNode。 由于“备NameNode会周期性做合并editlog”,因此当备NameNode异常时,无法合并editlog,因此主NameNode在下次启动的时候,需要加载较多editlog,需要大量内存,并且耗时较长。 合并元数据的周期由以下参数确定,即如果NameNode运行30分钟或者HDFS操作100万次,均会执行checkpoint。 dfs.namenode.checkpoint.period:checkpoint周期,默认1800s。 dfs.namenode.checkpoint.txns:执行指定操作次数后执行checkpoint,默认1000000。
  • 原因分析 在NameNode运行日志(/var/log/Bigdata/hdfs/nn/hadoop-omm-namendoe-XXX.log)中搜索“WARN”,可以看到有大量时间在垃圾回收,如下例中耗时较长63s。 2017-01-22 14:52:32,641 | WARN | org.apache.hadoop.util.JvmPauseMonitor$Monitor@1b39fd82 | Detected pause in JVM or host machine (eg GC): pause of approximately 63750ms GC pool 'ParNew' had collection(s): count=1 time=0ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=63924ms | JvmPauseMonitor.java:189 分析NameNode日志“/var/log/Bigdata/hdfs/nn/hadoop-omm-namendoe-XXX.log”,可以看到NameNode在等待块上报,且总的Block个数过多,如下例中是3629万。 2017-01-22 14:52:32,641 | INFO | IPC Server handler 8 on 25000 | STATE* Safe mode ON. The reported blocks 29715437 needs additional 6542184 blocks to reach the threshold 0.9990 of total blocks 36293915. 打开Manager页面,查看NameNode的GC_OPTS参数配置如下: 图1 查看NameNode的GC_OPTS参数配置 NameNode内存配置和数据量对应关系参考表1。 表1 NameNode内存配置和数据量对应关系 文件对象数量 参考值 10,000,000 “-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M” 20,000,000 “-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G” 50,000,000 “-Xms32G -Xmx32G -XX:NewSize=2G -XX:MaxNewSize=3G” 100,000,000 “-Xms64G -Xmx64G -XX:NewSize=4G -XX:MaxNewSize=6G” 200,000,000 “-Xms96G -Xmx96G -XX:NewSize=8G -XX:MaxNewSize=9G” 300,000,000 “-Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G”
  • 原因分析 HDFS开源3.0.0以下版本的默认端口为50070,3.0.0及以上的默认端口为9870。用户使用的端口和HDFS版本不匹配导致连接端口失败。 登录集群的主Master节点。 执行su - omm命令,切换到omm用户。 执行/opt/Bigdata/om-0.0.1/sbin/queryVersion.sh或者sh ${BIGDATA_HOME}/om-server/om/sbin/queryVersion.sh命令,查看集群中的HDFS版本号。 根据版本号确认开源组件的端口号,查询开源组件的端口号可参考开源组件端口列表,获取对应版本的HDFS端口号。 执行netstat -anp|grep ${port}命令,查看组件的默认端口号是否存在。 如果不存在,说明用户修改了默认的端口号。请修改为默认端口,再重新连接HDFS。 如果存在,请联系技术服务。 ${ port }:表示与组件版本相对应的组件默认端口号。 如果用户修改了默认端口号,请使用修改后的端口号连接HDFS。不建议修改默认端口号。
  • 解决办法 执行select count(*) from table_name;前确认需要查询的数据量大小,确认是否需要在beeline中显示如此数量级的数据。 如数量在一定范围内需要显示,请调整hive客户端的jvm参数, 在hive客户端目录/Hive下的component_env中添加export HIVE_OPTS=-Xmx1024M(具体数值请根据业务调整),并重新执行source 客户端目录/bigdata_env配置环境变量。
  • 解决方案 Hive服务的健康状态(也就是在Manager界面看到的健康状态)有Good,Bad,Partially Healthy,Unknown四种状态 ,四种状态除了取决于Hive本身服务的可用性(会用简单的SQL来检测Hive服务的可用性),还取决于Hive服务所依赖的其他组件的服务状态。 Hive实例分为Hiveserver和Metastore两种,健康状态有Good,Concerning ,Unknown三种状态,这三种状态是通过jmx通信来判定,与实例通信正常时为Good,通信异常时为Concerning,无法通信时为Unknown。
  • 原因分析 该问题多半为HDFS性能较慢,导致健康检查超时,从而导致监控告警。可通过以下方式判断: 首先查看HMaster日志(“/var/log/Bigdata/hbase/hm/hbase-omm-xxx.log”),确认HMaster日志中没有频繁打印“system pause”或“jvm”等GC相关信息。 然后可以通过下列三种方式确认原因为HDFS性能慢造成告警产生。 使用客户端验证,通过hbase shell进入hbase命令行后,执行list验证需要运行多久。 开启HDFS的debug日志,然后查看下层目录很多的路径(hadoop fs -ls /XXX/XXX),验证需要运行多久。 打印HMaster进程jstack: su - omm jps jstack pid 如下图所示,Jstack显示一直卡在DFSClient.listPaths。 图1 异常
  • 问题背景与现象 Hive任务报错,提示执行用户没有HDFS目录权限: 2019-04-09 17:49:19,845 | ERROR | HiveServer2-Background-Pool: Thread-3160445 | Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(Permission denied: user=hive_quanxian, access=READ_EXECUTE, inode="/user/hive/warehouse/bigdata.db/gd_ga_wa_swryswjl":zhongao:hive:drwx------ at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkAccessAcl(FSPermissionChecker.java:426) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:329) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSubAccess(FSPermissionChecker.java:300) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:241) at com.xxx.hadoop.adapter.hdfs.plugin.HWAccessControlEnforce.checkPermission(HWAccessControlEnforce.java:69) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1910) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1894) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getContentSummary(FSDirStatAndListingOp.java:135) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getContentSummary(FSNamesystem.java:3983) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getContentSummary(NameNodeRpcServer.java:1342) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getContentSummary(ClientNamenodeProtocolServerSideTranslatorPB.java:925) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2260) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2256) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2254) )'
  • 问题现象 安装Flume客户端失败,提示JAVA_HOME is null或flume has been installed。 CST 2016-08-31 17:02:51 [flume-client install]: JAVA_HOME is null in current user,please install the JDK and set the JAVA_HOME CST 2016-08-31 17:02:51 [flume-client install]: check environment failed. CST 2016-08-31 17:02:51 [flume-client install]: check param failed. CST 2016-08-31 17:02:51 [flume-client install]: install flume client failed.
  • 原因分析 使用客户端命令,打印NoAuthException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --producer --add --allow-principal User:test_acl 提示错误NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002。 具体如下: Error while executing ACL command: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:1038) at kafka.utils.ZkUtils.deletePath(ZkUtils.scala:499) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:118) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:112) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at kafka.common.ZkNodeChangeNotificationListener.purgeObsoleteNotifications(ZkNodeChangeNotificationListener.scala:112) at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:97) at kafka.common.ZkNodeChangeNotificationListener.processAllNotifications(ZkNodeChangeNotificationListener.scala:77) at kafka.common.ZkNodeChangeNotificationListener.init(ZkNodeChangeNotificationListener.scala:65) at kafka.security.auth.SimpleAclAuthorizer.configure(SimpleAclAuthorizer.scala:136) at kafka.admin.AclCommand$.withAuthorizer(AclCommand.scala:73) at kafka.admin.AclCommand$.addAcl(AclCommand.scala:80) at kafka.admin.AclCommand$.main(AclCommand.scala:48) at kafka.admin.AclCommand.main(AclCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.apache.zookeeper.KeeperException.create(KeeperException.java:117) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1416) at org.I0Itec.zkclient.ZkConnection.delete(ZkConnection.java:104) at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:1042) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
  • 问题现象 普通集群在Core节点新建用户安装使用客户端报错如下: 2020-03-14 19:16:17,166 WARN shortcircuit.DomainSocketFactory: error creating DomainSocket java.net.ConnectException: connect(2) error: Permission denied when trying to connect to '/var/run/MRS-HDFS/dn_socket' at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:256) at org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.createSocket(DomainSocketFactory.java:168) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:799) ...
  • 处理步骤 当前Hive支持以下几种压缩格式: org.apache.hadoop.io.compress.BZip2Codec org.apache.hadoop.io.compress.Lz4Codec org.apache.hadoop.io.compress.DeflateCodec org.apache.hadoop.io.compress.SnappyCodec org.apache.hadoop.io.compress.GzipCodec 如需要全局设置,即对所有表都进行压缩,可以在Manager页面对Hive的服务配置参数进行如下全局配置: hive.exec.compress.output设置为true mapreduce.output.fileoutputformat.compress.codec设置为org.apache.hadoop.io.compress.BZip2Codec hive.exec.compress.output参数必须设置为true,才能使下边的参数选项生效。 如需在session级设置,只需要在执行命令前增加如下设置即可: set hive.exec.compress.output=true; set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
  • 原因分析 FileNotFoundException...No lease on...File does not exist,该日志说明文件在操作的过程中被删除了。 搜索HDFS的NameNode的审计日志(Active NameNode的/var/log/Bigdata/audit/hdfs/nn/hdfs-audit-namenode.log)搜索文件名,确认文件的创建时间。 搜索文件创建到出现异常时间范围的NameNode的审计日志,搜索该文件是否被删除或者移动到其他目录。 如果该文件没有被删除或者移动,可能是该文件的父目录,或者更上层目录被删除或者移动,需要继续搜索上层目录。如本样例中,是文件的父目录被删除。 2017-05-31 02:04:08,286 | INFO | IPC Server handler 30 on 25000 | allowed=true ugi=appUser@HADOOP.COM (auth:TOKEN) ip=/192.168.1.22 cmd=delete src=/user/sparkhive/warehouse/daas/dsp/output/_temporary dst=null perm=null proto=rpc | FSNamesystem.java:8189 如上日志说明:192.168.1.22 节点的appUser用户删除了/user/sparkhive/warehouse/daas/dsp/output/_temporary。 可以使用zgrep "文件名" *.zip命令搜索zip包的内容。
  • 问题背景与现象 有MapReduce任务所有map任务均成功,但reduce任务失败,查看日志发现报异常“FileNotFoundException...No lease on...File does not exist”。 Error: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): No lease on /user/sparkhive/warehouse/daas/dsp/output/_temporary/1/_temporary/attempt_1479799053892_17075_r_000007_0/part-r-00007 (inode 6501287): File does not exist. Holder DFSClient_attempt_1479799053892_17075_r_000007_0_-1463597952_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3442) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3409) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:789)
  • 解决办法 在初始化Kafka生产者实例时,设置此配置项“max.request.size ”的值。 例如,参考本例,可以将此配置项设置为“5252880”: // 协议类型:当前支持配置为SASL_PLAINTEXT或者PLAINTEXT props.put(securityProtocol, kafkaProc.getValues(securityProtocol, "SASL_PLAINTEXT")); // 服务名 props.put(saslKerberosServiceName, "kafka"); props.put("max.request.size", "5252880"); .......
  • 问题背景与现象 用户在开发一个Kafka应用,作为一个生产者调用新接口(org.apache.kafka.clients.producer.*)往Kafka写数据,单条记录大小为1100055,超过了kafka配置文件server.properties中message.max.bytes=1000012。用户修改了Kafka服务配置中message.max.bytes大小为5242880,同时也将replica.fetch.max.bytes大小修改为5242880后,仍然无法成功。 报异常如下:
  • 原因分析 HDFS未启动或故障。 查看Flume运行日志: 2019-02-26 11:16:33,564 | ERROR | [SinkRunner-PollingRunner-DefaultSinkProcessor] | opreation the hdfs file errors. | org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:414) 2019-02-26 11:16:33,747 | WARN | [hdfs-CCCC-call-runner-4] | A failover has occurred since the start of call #32795 ClientNamenodeProtocolTranslatorPB.getFileInfo over 192-168-13-88/192.168.13.88:25000 | org.apache.hadoop.io.retry.RetryInvocationHandler$ProxyDescriptor.failover(RetryInvocationHandler.java:220) 2019-02-26 11:16:33,748 | ERROR | [hdfs-CCCC-call-runner-4] | execute hdfs error. {} | org.apache.flume.sink.hdfs.HDFSEventSink$3.call(HDFSEventSink.java:744) java.net.ConnectException: Call From 192-168-12-221/192.168.12.221 to 192-168-13-88:25000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused HDFS Sink未启动。 查看Flume运行日志,发现“ flume current metrics”中并没有Sink信息: 2019-02-26 11:46:05,501 | INFO | [pool-22-thread-1] | flume current metrics:{"CHANNEL.BBBB":{"ChannelCapacity":"10000","ChannelFillPercentage":"0.0","Type":"CHANNEL","ChannelStoreSize":"0","EventProcessTimedelta":"0","EventTakeSuccessCount":"0","ChannelSize":"0","EventTakeAttemptCount":"0","StartTime":"1551152734999","EventPutAttemptCount":"0","EventPutSuccessCount":"0","StopTime":"0"},"SOURCE.AAAA":{"AppendBatchAcceptedCount":"0","EventAcceptedCount":"0","AppendReceivedCount":"0","MonTime":"0","StartTime":"1551152735503","AppendBatchReceivedCount":"0","EventReceivedCount":"0","Type":"SOURCE","TotalFilesCount":"1001","SizeAcceptedCount":"0","UpdateTime":"605410241202740","AppendAcceptedCount":"0","OpenConnectionCount":"0","MovedFilesCount":"1001","StopTime":"0"}} | org.apache.flume.node.Application.getRestartComps(Application.java:467)
  • 原因分析 Flume堆内存设置的值大于机器剩余内存,查看Flume启动日志: [CST 2019-02-26 13:31:43][INFO] [[checkMemoryValidity:124]] [GC_OPTS is invalid: Xmx(40960000MB) is bigger than the free memory(56118MB) in system.] [9928] Flume文件或文件夹权限异常,界面或后台会提示如下信息: [2019-02-26 13:38:02]RoleInstance prepare to start failure [{ScriptExecutionResult=ScriptExecutionResult [exitCode=126, output=, errMsg=sh: line 1: /opt/Bigdata/MRS_XXX/install/FusionInsight-Flume-1.9.0/flume/bin/flume-manage.sh: Permission denied JAVA_HOME配置错误,查看Flume agent启动日志: Info: Sourcing environment configuration script /opt/FlumeClient/fusioninsight-flume-1.9.0/conf/flume-env.sh + '[' -n '' ']' + exec /tmp/MRS-Client/MRS_Flume_ClientConfig/JDK/jdk-8u18/bin/java '-XX:OnOutOfMemoryError=bash /opt/FlumeClient/fusioninsight-flume-1.9.0/bin/out_memory_error.sh /opt/FlumeClient/fusioninsight-flume-1.9.0/conf %p' -Xms2G -Xmx4G -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -Dkerberos.domain.name=hadoop.hadoop.com -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/Bigdata//flume-client-1/flume/flume-root-20190226134231-%p-gc.log -Dproc_org.apache.flume.node.Application -Dproc_name=client -Dproc_conf_file=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/properties.properties -Djava.security.krb5.conf=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf//krb5.conf -Djava.security.auth.login.config=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf//jaas.conf -Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com -Dzookeeper.request.timeout=120000 -Dflume.instance.id=884174180 -Dflume.agent.name=clientName1 -Dflume.role=client -Dlog4j.configuration.watch=true -Dlog4j.configuration=log4j.properties -Dflume_log_dir=/var/log/Bigdata//flume-client-1/flume/ -Dflume.service.id=flume-client-1 -Dbeetle.application.home.path=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/service -Dflume.called.from.service -Dflume.conf.dir=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf -Dflume.metric.conf.dir=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf -Dflume.script.home=/opt/FlumeClient/fusioninsight-flume-1.9.0/bin -cp '/opt/FlumeClient/fusioninsight-flume-1.9.0/conf:/opt/FlumeClient/fusioninsight-flume-1.9.0/lib/*:/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/service/' -Djava.library.path=/opt/FlumeClient/fusioninsight-flume-1.9.0/plugins.d/native/native org.apache.flume.node.Application --conf-file /opt/FlumeClient/fusioninsight-flume-1.9.0/conf/properties.properties --name client /opt/FlumeClient/fusioninsight-flume-1.9.0/bin/flume-ng: line 233: /tmp/FusionInsight-Client/Flume/FusionInsight_Flume_ClientConfig/JDK/jdk-8u18/bin/java: No such file or directory
共100000条