手抜きはいくないということで...
再びSparkにハマる - なぜか数学者にはワイン好きが多い
ちゃんとソースインストールします.
現在のHadoop環境を極力活かしたいので,Sparkのローカルモードやスタンドアロンモード,Mesosモードは考えずに,YARNモードを作ります.HadoopはCDH4.4が稼働しています.
HadoopやCDHのバージョンの指定方法はドキュメントにあるので,それをそっくり真似して見ます.
Redirecting…
> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0.tgz > tar xf spark-1.0.0.tgz > cd spark-1.0.0 > export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" > mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests clean package Downloading: http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-common/2.0.0-cdh4.4.0/hadoop-yarn-common-2.0.0-cdh4.4.0.pom (中略) ※↓これはよく分からなかったのですが進んでるので無視 [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile (中略) [ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (default) on project spark-core_2.10: An Ant BuildException has occured: Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry. [ERROR] around Ant part ...<fail message="Please set the SCALA_HOME (or SCALA_LIBRARY_PATH if scala is on the path) environment variables and retry.">... @ 6:126 in /home/foo/spark-1.0.0/core/target/antrun/build-main.xml
ああ,先にScalaを入れないとダメなのね.
> wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz > su # tar xf scala-2.10.4.tgz -C /usr/local # ln -sv /usr/local/scala-2.10.4 /usr/local/scala `/usr/local/scala' -> `/usr/local/scala-2.10.4' # exit
再チャレンジ.
> export SCALA_HOME=/usr/local/scala
> mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.4.0 -DskipTests package
(中略)
[ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:36: object AMResponse is not a member of package org.apache.hadoop.yarn.api.records
[ERROR] import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId}
[ERROR] ^
[ERROR] /home/foo/spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:110: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse
[ERROR] val amResp = allocateExecutorResources(executorsToRequest).getAMResponse
[ERROR] ^
[ERROR] two errors foundエラー.
しかし,これは前回のノウハウがある!
[一旦解決] Sparkのインストールにハマる[5] - なぜか数学者にはワイン好きが多い
そもそもエラーになっている,メソッド呼び出し
[error] /usr/local/spark-0.9.1-bin-hadoop2/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala:106: value getAMResponse is not a member of org.apache.hadoop.yarn.api.protocolrecords.AllocateResponse
[error] val amResp = allocateWorkerResources(workersToRequest).getAMResponseが,不必要なようです.そう言われると,確かに
そこで
spark-1.0.0/yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
を編集.
//import org.apache.hadoop.yarn.api.records.{AMResponse, ApplicationAttemptId} import org.apache.hadoop.yarn.api.records.ApplicationAttemptId // Keep polling the Resource Manager for containers // val amResp = allocateExecutorResources(executorsToRequest).getAMResponse val amResp = allocateExecutorResources(executorsToRequest)
再挑戦.
[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .......................... SUCCESS [3.493s] [INFO] Spark Project Core ................................ SUCCESS [18.886s] [INFO] Spark Project Bagel ............................... SUCCESS [2.343s] [INFO] Spark Project GraphX .............................. SUCCESS [3.148s] [INFO] Spark Project ML Library .......................... SUCCESS [2.899s] [INFO] Spark Project Streaming ........................... SUCCESS [3.553s] [INFO] Spark Project Tools ............................... SUCCESS [0.828s] [INFO] Spark Project Catalyst ............................ SUCCESS [3.429s] [INFO] Spark Project SQL ................................. SUCCESS [1.317s] [INFO] Spark Project Hive ................................ SUCCESS [4.915s] [INFO] Spark Project REPL ................................ SUCCESS [1.466s] [INFO] Spark Project YARN Parent POM ..................... SUCCESS [1.480s] [INFO] Spark Project YARN Alpha API ...................... SUCCESS [36.159s] [INFO] Spark Project Assembly ............................ SUCCESS [39.842s] [INFO] Spark Project External Twitter .................... SUCCESS [23.469s] [INFO] Spark Project External Kafka ...................... SUCCESS [33.848s] [INFO] Spark Project External Flume ...................... SUCCESS [27.625s] [INFO] Spark Project External ZeroMQ ..................... SUCCESS [26.915s] [INFO] Spark Project External MQTT ....................... SUCCESS [27.329s] [INFO] Spark Project Examples ............................ SUCCESS [1:45.440s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 6:09.339s [INFO] Finished at: Thu Jul 03 13:27:09 JST 2014 [INFO] Final Memory: 55M/968M [INFO] ------------------------------------------------------------------------
OK.
とりあえずPiを動かしたいです.
まずYARN Clientモード.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client examples/target/spark
-examples_2.10-1.0.0.jar 2
※クライアントモードなので,ドライバーがクライアントサーバから通信準備をする
14/07/03 13:36:00 INFO Remoting: Starting remoting
14/07/03 13:36:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@myclientnode:58567]
14/07/03 13:36:00 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@myclientnode:58567]
14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/07/03 13:36:01 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
※Hadoopのデータノード...というかApplication Managerの数をゲット
14/07/03 13:36:02 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11
14/07/03 13:36:02 INFO yarn.Client: Preparing Local resources
※SparkのライブラリをHDFSの置いてデータノードで共有
14/07/03 13:36:03 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22542/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar
※アプリケーションマネージャにジョブをお任せ
14/07/03 13:36:05 INFO yarn.Client: Submitting application to ASM
14/07/03 13:36:05 INFO client.YarnClientImpl: Submitted application application_1401264313901_22542 to ResourceManager at myresourcemanager/192.168.26.29:8040
14/07/03 13:36:05 INFO cluster.YarnClientSchedulerBackend: Application report from ASM:
appMasterRpcPort: 0
appStartTime: 1404362165270
yarnAppState: ACCEPTED
14/07/03 13:36:12 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode01:40692/user/Executor#81931987] with ID 2
14/07/03 13:36:12 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 2: mydatanode01 (PROCESS_LOCAL)
14/07/03 13:36:12 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1404 bytes in 2 ms
14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode01:52317 with 589.2 MB RAM
14/07/03 13:36:13 INFO cluster.YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@mydatanode02:60460/user/Executor#-1060070018] with ID 1
14/07/03 13:36:13 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: mydatanode02 (PROCESS_LOCAL)
14/07/03 13:36:13 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1404 bytes in 0 ms
14/07/03 13:36:13 INFO storage.BlockManagerInfo: Registering block manager mydatanode02:38750 with 589.2 MB RAM
14/07/03 13:36:14 INFO scheduler.TaskSetManager: Finished TID 1 in 922 ms on mydatanode02 (progress: 2/2)
14/07/03 13:36:14 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 2.516 s
14/07/03 13:36:14 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/07/03 13:36:14 INFO spark.SparkContext: Job finished: reduce at parkPi.scala:35, took 2.653324513 s
※円周率の近似値
Pi is roughly 3.14024次にYARN Clusterモード.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster examples/target/spar
k-examples_2.10-1.0.0.jar 2
※アプリケーションマネージャの走っているデータノードの数をゲット
14/07/03 13:48:12 INFO yarn.Client: Got Cluster metric info from ASM, numNodeManagers = 11
14/07/03 13:48:12 INFO yarn.Client: Queue info ... queueName = default, queueCurrentCapacity = 0.0, queueMaxCapacity = 1.0,
queueApplicationCount = 10000, queueChildQueueCount = 0
14/07/03 13:48:12 INFO yarn.Client: Preparing Local resources
※実行するアプリをHDFSに置いてデータノードで共有
14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/examples/target/spark-examples_2.10-1.0.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-examples_2.10-1.0.0.jar
※SparkのライブラリをHDFSの置いてデータノードで共有
14/07/03 13:48:13 INFO yarn.Client: Uploading file:/home/foo/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar to hdfs://mycluster/user/hadoop/.sparkStaging/application_1401264313901_22543/spark-assembly-1.0.0-hadoop2.0.0-cdh4.4.0.jar
※データノードに処理を任せて分離
14/07/03 13:48:15 INFO yarn.Client: Submitting application to ASM
14/07/03 13:48:15 INFO client.YarnClientImpl: Submitted application application_1401264313901_22543 to ResourceManager at myresourcemanager/192.168.26.29:8040
14/07/03 13:48:16 INFO yarn.Client: Application report from ASM:
application identifier: application_1401264313901_22543
appId: 22543
clientToken: null
appDiagnostics:
appMasterHost: N/A
appQueue: default
appMasterRpcPort: 0
appStartTime: 1404362895315
yarnAppState: ACCEPTED
distributedFinalState: UNDEFINED
appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/
appUser: hadoop
※定期的にデータノードと通信して様子をうかがう
14/07/03 13:48:20 INFO yarn.Client: Application report from ASM:
application identifier: application_1401264313901_22543
appId: 22543
clientToken: null
appDiagnostics:
appMasterHost: mydatanode03
appQueue: default
appMasterRpcPort: 0
appStartTime: 1404362895315
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: myresourcemanager:8088/proxy/application_1401264313901_22543/
appUser: hadoop
14/07/03 13:48:25 INFO yarn.Client: Application report from ASM:
application identifier: application_1401264313901_22543
appId: 22543
clientToken: null
appDiagnostics:
appMasterHost: mydatanode03
appQueue: default
appMasterRpcPort: 0
appStartTime: 1404362895315
yarnAppState: FINISHED
distributedFinalState: SUCCEEDED
appTrackingUrl:
appUser: hadoop実行されたmydatanode03に移って...
cat userlogs/application_1401264313901_22543/*/stdout Pi is roughly 3.14018
問題なしです.
次はexampleじゃなくて何かサンプルプログラムを動かしてみます.