[jira] [Commented] (APEXCORE-770) Application is killed due to NPE in ApplicationMaster

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (APEXCORE-770) Application is killed due to NPE in ApplicationMaster

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/APEXCORE-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16118730#comment-16118730 ]

Vlad Rozov commented on APEXCORE-770:
-------------------------------------

There are several preconditions required to reproduce the issue and it happens in the following scenario:

- more than one application attempt (application master restart)
- attempt to kill a container that was started by already terminated application master

In this case, in {{StreamingAppMasterService.sendContainerAskToRM()}} invokes {{NMClientAsync.stopContainerAsync()}} for {{containerId}} that was started by already terminated application master and not by the current application master. It leads to {{onStopContainerError}} being raised by {{NMClientAsync}} (see {{NMClientAsyncImpl}}) as {{containers}} map does not contain requested {{containerId}}:
{noformat}
2017-07-25 11:24:51,681 WARN com.datatorrent.stram.StreamingAppMasterService: Failed to stop container container_e47_1499808956620_0716_01_000090
org.apache.hadoop.yarn.exceptions.YarnException: Container container_e47_1499808956620_0716_01_000090 is neither started nor scheduled to start
        at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45)
        at org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.stopContainerAsync(NMClientAsyncImpl.java:234)
        at com.datatorrent.stram.StreamingAppMasterService.sendContainerAskToRM(StreamingAppMasterService.java:1175)
        at com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:865)
        at com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
        at com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
{noformat}
{{NMCallbackHandler.onStopContainerError()}} tries to recover the container and removes {{containerId}} from {{allocatedContainers}} and sets the state of the corresponding PTContainer to {{PTContainer.State.KILLED}}. It leads to a shutdown request in the heartbeat response to the container and the container terminates (normally). At that point RM (that is fully unaware that the container was requested to stop), reports that it terminated normally and as {{containerId}} is already removed from {{allocatedContainers}} NPE is reasied when {{allocatedContainer}} is used.

> Application is killed due to NPE in ApplicationMaster
> -----------------------------------------------------
>
>                 Key: APEXCORE-770
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-770
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: Vinay Bangalore Srikanth
>            Assignee: Sandesh
>
> In my apex-application, I was trying to delete different containers ( except the app master ) randomly.
> The application got killed unexpectedly with the following exception -
> {noformat}
> 2017-07-25 11:24:51,681 WARN com.datatorrent.stram.StreamingAppMasterService: Failed to stop container container_e47_1499808956620_0716_01_000090
> org.apache.hadoop.yarn.exceptions.YarnException: Container container_e47_1499808956620_0716_01_000090 is neither started nor scheduled to start
> at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45)
> at org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl.stopContainerAsync(NMClientAsyncImpl.java:234)
> at com.datatorrent.stram.StreamingAppMasterService.sendContainerAskToRM(StreamingAppMasterService.java:1175)
> at com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:865)
> at com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
> at com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
> 2017-07-25 11:24:51,681 INFO com.datatorrent.stram.StreamingAppMasterService: Requested stop container container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:51,681 INFO org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Processing Event EventType: STOP_CONTAINER for Container container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:51,681 INFO org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Container container_e47_1499808956620_0716_01_000090 is already stopped or failed
> 2017-07-25 11:24:51,686 INFO com.datatorrent.stram.StreamingContainerManager: Initiating recovery for [hidden email]:8041
> 2017-07-25 11:24:51,686 INFO com.datatorrent.stram.StreamingContainerManager: Affected operators [PTOperator[id=38,name=passthrough,state=ACTIVE], PTOperator[id=105,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=97,name=console,state=ACTIVE], PTOperator[id=106,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=103,name=console,state=ACTIVE], PTOperator[id=107,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=100,name=console,state=ACTIVE], PTOperator[id=108,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=99,name=console,state=ACTIVE], PTOperator[id=109,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=101,name=console,state=ACTIVE], PTOperator[id=110,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=102,name=console,state=ACTIVE], PTOperator[id=111,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=98,name=console,state=ACTIVE], PTOperator[id=112,name=passthrough.output#unifier,state=ACTIVE], PTOperator[id=104,name=console,state=ACTIVE], PTOperator[id=68,name=randomGenerator.out#unifier,state=ACTIVE]]
> 2017-07-25 11:24:52,260 ERROR com.datatorrent.stram.StreamingContainerManager: Unknown container container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:52,263 INFO com.datatorrent.stram.StreamingContainerParent: child msg: [container_e47_1499808956620_0716_01_000090] Exiting heartbeat loop.. context: PTContainer[id=38(container_e47_1499808956620_0716_01_000090),state=KILLED,operators=[PTOperator[id=38,name=passthrough,state=PENDING_DEPLOY], PTOperator[id=68,name=randomGenerator.out#unifier,state=PENDING_DEPLOY]]]
> 2017-07-25 11:24:52,697 INFO com.datatorrent.stram.ResourceRequestHandler: Strict anti-affinity = [] for container with operators PTOperator[id=38,name=passthrough,state=PENDING_DEPLOY],PTOperator[id=68,name=randomGenerator.out#unifier,state=PENDING_DEPLOY]
> 2017-07-25 11:24:52,698 INFO com.datatorrent.stram.ResourceRequestHandler: Found host null
> 2017-07-25 11:24:52,698 INFO com.datatorrent.stram.BlacklistBasedResourceRequestHandler: No node specific request
> 2017-07-25 11:24:53,710 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl: Replacing token for : node18.morado.com:8041
> 2017-07-25 11:24:53,710 INFO com.datatorrent.stram.StreamingAppMasterService: Got new container., containerId=container_e47_1499808956620_0716_02_000034, containerNode=node18.morado.com:8041, containerNodeURI=node18.morado.com:8042, containerResourceMemory4096, priority32
> 2017-07-25 11:24:53,710 INFO com.datatorrent.stram.StreamingContainerManager: Removing container agent container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:53,711 INFO com.datatorrent.stram.LaunchContainerRunnable: Setting up container launch context for containerid=container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,711 INFO com.datatorrent.stram.LaunchContainerRunnable: CLASSPATH: ./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*:.
> 2017-07-25 11:24:53,946 INFO com.datatorrent.common.util.BasicContainerOptConfigurator: property map for operator {Generic=null, -Xmx=1920m}
> 2017-07-25 11:24:53,947 INFO com.datatorrent.common.util.BasicContainerOptConfigurator: property map for operator {Generic=null, -Xmx=768m}
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.LaunchContainerRunnable: Jvm opts  -Xmx3355443200  for container container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.LaunchContainerRunnable: Launching on node: node18.morado.com:8041 command: $JAVA_HOME/bin/java  -Xmx3355443200  -Ddt.attr.APPLICATION_PATH=hdfs://node18.morado.com:8020/user/vinay/datatorrent/apps/application_1499808956620_0716 -Djava.io.tmpdir=$PWD/tmp -Ddt.cid=container_e47_1499808956620_0716_02_000034 -Dhadoop.root.logger=INFO,RFA -Dhadoop.log.dir=<LOG_DIR> -Dapex.application.name=$'SlowConsumerTimeoutWindowCountSet.apa' com.datatorrent.stram.engine.StreamingContainer 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr  
> 2017-07-25 11:24:53,947 INFO org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_e47_1499808956620_0716_02_000034
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.StreamingAppMasterService: Completed containerId=container_e47_1499808956620_0716_01_000090, state=COMPLETE, exitStatus=0, diagnostics=
> 2017-07-25 11:24:53,947 INFO com.datatorrent.stram.StreamingAppMasterService: Container completed successfully., containerId=container_e47_1499808956620_0716_01_000090
> 2017-07-25 11:24:53,947 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy : node18.morado.com:8041
> 2017-07-25 11:24:53,948 ERROR com.datatorrent.stram.StreamingAppMaster: Exiting Application Master
> java.lang.NullPointerException
> at com.datatorrent.stram.StreamingAppMasterService$AllocatedContainer.access$1000(StreamingAppMasterService.java:1251)
> at com.datatorrent.stram.StreamingAppMasterService.execute(StreamingAppMasterService.java:1014)
> at com.datatorrent.stram.StreamingAppMasterService.run(StreamingAppMasterService.java:671)
> at com.datatorrent.stram.StreamingAppMaster.main(StreamingAppMaster.java:106)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
Loading...