まずはSqoopから置きます.プロンプトが#はroot,$は一般ユーザです.
# tar xf sqoop-1.4.3-cdh4.5.0.tar.gz -C /usr/local/ # chown -R hadoop:hadoop /usr/local/sqoop-1.4.3-cdh4.5.0/ # ln -s /usr/local/sqoop-1.4.3-cdh4.5.0 /usr/local/sqoop # emacs /etc/bashrc export PATH=$PATH:/usr/local/sqoop/bin #おっと.怒られた. $ sqoop help Error: /usr/lib/hadoop does not exist! Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation. # emacs /etc/bashrc export HADOOP_COMMON_HOME=/usr/local/hadoop $ sudo su hadoop $ sqoop help 14/04/30 16:06:50 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0 usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information イケタ.
次にHiveを置く.ちょっと変更があったのをのちほど忘れてなければ書きます.
# wget http://archive.cloudera.com/cdh5/cdh/5/hive-0.12.0-cdh5.0.0.tar.gz # tar xf hive-0.12.0-cdh5.0.0.tar.gz -C /usr/local/ # chown -R hadoop:hadoop /usr/local/hive-0.12.0-cdh5.0.0/ # rm /usr/local/hive # ln -s /usr/local/hive-0.12.0-cdh5.0.0/ /usr/local/hive # emacs /etc/bashrc export PATH=$PATH:/usr/local/sqoop/bin:/usr/local/hive/bin # ls -l /usr/local/hive/bin/schematool -rwxr-xr-x. 1 hadoop hadoop 884 Mar 28 15:14 /usr/local/hive/bin/schematool $ cp -pv /usr/local/hive/conf/hive-default.xml.template /usr/local/hive/conf/hive-site.xml `/usr/local/hive/conf/hive-default.xml.template' -> `/usr/local/hive/conf/hive-site.xml' #このへんはお好みで. <property> <name>hive.cli.print.header</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to include the current database in the hive prompt.</description> </property> #ローカルモードなのでこれは変えない. <property> <name>hive.metastore.uris</name> <value></value> <description>Thrift uri for the remote metastore. Used by metastore client to connect to remote metastore.</description> </property> #こちらを変える.メタストア用のMySQLを立てて,そのアドレスを記載する <property> <name>javax.jdo.option.ConnectionURL</name> <!-- <value>jdbc:derby:;databaseName=metastore_db;create=true</value> --> <value>jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> #コレも変える.メタストアの種類に合わせるだけ <property> <name>javax.jdo.option.ConnectionDriverName</name> <!-- <value>org.apache.derby.jdbc.EmbeddedDriver</value> --> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> #この辺りはMySQLに設定するものに合わせる <property> <name>javax.jdo.option.ConnectionUserName</name> <!-- <value>APP</value> --> <value>hive_user</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <!-- <value>mine</value> --> <value>hive_password</value> <description>password to use against metastore database</description> </property> <property> <name>hive.exec.compress.output</name> <!-- <value>false</value> --> <value>true</value> <description> This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The compression codec and other options are determined from ha\ doop config variables mapred.output.compress* </description> </property> <property> <name>hive.exec.compress.intermediate</name> <!-- <value>false</value> --> <value>true</value> <description> This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from \ hadoop config variables mapred.output.compress* </description> </property> <property> <name>hive.exec.parallel</name> <!-- <value>false</value> --> <value>true</value> <description>Whether to execute jobs in parallel</description> </property> #ダイナミックパーティションを自由に作られるよう設定 <property> <name>hive.exec.dynamic.partition.mode</name> <!-- <value>strict</value> --> <value>nonstrict</value> <description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description> </property>
次に,MySQLを設定します.
# wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz
# tar xvf mysql-connector-java-5.1.30.tar.gz mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar
# cp -pv mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar /usr/local/hive/lib/
`mysql-connector-java-5.1.30/mysql-connector-java-5.1.30-bin.jar' -> `/usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar'
# rm -fR mysql-connector-java-5.1.30
#最近の,例えばCDH5.0.0では,今までのようにMySQLにクエリを流し込んで初期化するのではなく,schematoolを使うことが推奨されているので,従ってみる
$ schematool -dbType mysql -info
Metastore connection URL: jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive_user
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
*** schemaTool failed ***
#エラーが出る...
$ mysql -u root --password=root_password -e "SELECT host, user FROM mysql.user WHERE user='hive_user'"
+-----------+-----------+
| host | user |
+-----------+-----------+
| % | hive_user |
| localhost | hive_user |
+-----------+-----------+
$ schematool -verbose -dbType mysql -initSchema -dryRun
Metastore connection URL: jdbc:mysql://metastore-server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive_user
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
at org.apache.hive.beeline.HiveSchemaTool.getConnectionToMetastore(HiveSchemaTool.java:187)
at org.apache.hive.beeline.HiveSchemaTool.testConnectionToMetastore(HiveSchemaTool.java:150)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:279)
at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:265)
at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:512)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.sql.SQLException: Access denied for user 'savaway'@'savaway01.nhn-playart.com' (using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1084)
$ mysql -h metastore-server -u hive_user --password=hive_password
ERROR 1045 (28000): Access denied for user 'hive_user'@'hive_server' (using password: YES)
$ mysql -h localhost -u hive --password=hive_password
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 30
Server version: 5.5.5-10.0.10-MariaDB Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> quit
#ああ.localhostでログインできるってことは,MySQLのドキュメントにも書いてるヤツです.
$ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''"
+---------------------------+------+
| host | user |
+---------------------------+------+
| localhost | |
| hive_server.example.com | |
+---------------------------+------+
#パスワード無しユーザを全部消す
$ mysql -u root --password=MariaDB -e "DROP USER ''@'hive_server.example..com'"
$ mysql -u root --password=MariaDB -e "SELECT host, user FROM mysql.user WHERE user=''"
+-----------+------+
| host | user |
+-----------+------+
| localhost | |
+-----------+------+
#イケタ
$ schematool -verbose -dbType mysql -initSchema -dryRun
Metastore connection URL: jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive_user
Starting metastore schema initialization to 0.12.0
Initialization script hive-schema-0.12.0.mysql.sql
schemaTool completeted
#メタストアの初期化もイケタ
$ schematool -verbose -dbType mysql -initSchema
Starting metastore schema initialization to 0.12.0
Initialization script hive-schema-0.12.0.mysql.sql
Connecting to jdbc:mysql://hive_server/metastore?createDatabaseIfNotExist=true
Connected to: MySQL (version 5.5.5-10.0.10-MariaDB)
Driver: MySQL Connector Java (version mysql-connector-java-5.1.30 ( Revision: alexander.soklakov@oracle.com-20140310090514-8xt1yoht5ksg2e7c ))
Transaction isolation: TRANSACTION_READ_COMMITTED
Beeline version 0.12.0-cdh5.0.0 by Apache Hive
0: jdbc:mysql://hive_server/metastore> !autocommit on
Autocommit status: true
0: jdbc:mysql://hive_server/metastore> /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */
No rows affected (0.002 seconds)
0: jdbc:mysql://hive_server/metastore> !closeall
Closing: com.mysql.jdbc.JDBC4Connection
beeline> Initialization script completed
schemaTool completetedなんとなく出来たので,DBをつくります.
#Hive_userのディレクトリがHDFSに無いことの確認
$ hdfs dfs -ls /user
drwxr-xr-x - hadoop supergroup 0 2014-04-30 15:57 /user/hadoop
#DBをHiveに作ると...
$ hive -e "CREATE DATABASE test"
OK
Time taken: 3.74 seconds
#ディレクトリも勝手に作られる
$ hdfs dfs -ls /user/hive_user/warehouse
drwxr-xr-x - hadoop supergroup 0 2014-04-30 18:19 /user/hive_user/warehouse/test.db
#テーブルも作る.TSV形式から流し込めるように作る.
$ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
OK
Time taken: 2.933 seconds
OK
Time taken: 0.939 seconds
#TSVファイルの準備
$ echo -e "2\talice" > /tmp/users.tsv
$ echo -e "5\tbob" >> /tmp/users.tsv
$ cat /tmp/users.tsv
2 alice
5 bob
#TSVファイルから流し込む
$ hive --database=test -e "LOAD DATA LOCAL INPATH '/tmp/users.tsv' INTO TABLE users"
OK
Time taken: 2.842 seconds
Copying data from file:/tmp/users.tsv
Copying file: file:/tmp/users.tsv
Loading data to table test.users
Table test.users stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 14, raw_data_size: 0]
OK
Time taken: 1.354 seconds
#確認オーケー
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.113 seconds
OK
id name
2 alice
5 bob
Time taken: 1.175 seconds, Fetched: 2 row(s)
#次にSqoopでMySQLからHiveにテーブルを自動作成します
$ mysql -h metastore-server -u hive_user --password=hive_password test -e "CREATE TABLE users (id INT, name CHAR(10))"
$ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users"
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
#ダミーデータを準備する
$ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=2, name='alice'"
$ mysql -u hive_user --password=hive_password test -e "INSERT INTO users SET id=5, name='bob'"
$ mysql -u hive_user --password=hive_password test -e "SELECT COUNT(*) FROM users"
+----------+
| COUNT(*) |
+----------+
| 2 |
+----------+
#確認
$ mysql -u hive_user --password=hive_password test -e "SELECT * FROM users"
+------+-------+
| id | name |
+------+-------+
| 2 | alice |
| 5 | bob |
+------+-------+
まだHiveにもLOAD DATAしたのが残っています.
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 2.974 seconds
OK
id name
2 alice
5 bob
Time taken: 1.479 seconds, Fetched: 2 row(s)
$ hdfs dfs -ls /user/hive_user/warehouse/test.db/users
Found 1 items
-rwxr-xr-x 3 hadoop supergroup 14 2014-04-30 18:27 /user/hive_user/warehouse/test.db/users/users.tsv
$ hdfs dfs -cat /user/hive_user/warehouse/test.db/users/users.tsv
2 alice
5 bob
#ファイルごと消します.
$ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/users.tsv
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/users.tsv' to trash at: hdfs://mycluster/user_user/hadoop/.Trash/Current
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.181 seconds
OK
id name
Time taken: 1.06 seconds
#消えました.
cp -pv /usr/local/hive/lib/mysql-connector-java-5.1.30-bin.jar /usr/local/sqoop/lib/
#スクープ!
sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_server --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001"
14/05/02 12:47:44 INFO hive.HiveImport: OK
14/05/02 12:47:44 INFO hive.HiveImport: Time taken: 3.677 seconds
14/05/02 12:47:45 INFO hive.HiveImport: Loading data to table test.users
14/05/02 12:47:45 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0]
14/05/02 12:47:45 INFO hive.HiveImport: OK
14/05/02 12:47:45 INFO hive.HiveImport: Time taken: 1.06 seconds
14/05/02 12:47:45 INFO hive.HiveImport: Hive import complete.
14/05/02 12:47:45 INFO hive.HiveImport: Export directory is empty, removing it.
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.005 seconds
OK
id name
NULL NULL
NULL NULL
Time taken: 1.176 seconds, Fetched: 2 row(s)
#あれ?
$ hdfs dfs -text /user/hive_user/warehouse/test.db/users/part-m-00000.deflate
2alice
#値は入っている.
#あ,テーブル作る時にこうやったからですね.デリミタ.
$ hive --database=test -e "CREATE TABLE users (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'"
#せっかくSqoopで,デフォルトの\001にしているので,変更します.
$ hive --database=test -e "ALTER TABLE users SET SERDEPROPERTIES ('field.delim' = '\001')"
OK
Time taken: 3.158 seconds
OK
Time taken: 0.391 seconds
またファイルごと消して...
$ hdfs dfs -rm /user/hive_user/warehouse/test.db/users/*
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/_SUCCESS' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00000.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00001.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/CMoved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00002.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current
Moved: 'hdfs://mycluster/user/hive_user/warehouse/test.db/users/part-m-00003.deflate' to trash at: hdfs://mycluster/user/hadoop/.Trash/Current
#消えたのを確認して...
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.097 seconds
OK
id name
Time taken: 1.103 seconds
#スクープ!
$ sqoop import --connect "jdbc:mysql://mysql-server:3306/test?zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false" --username hive_user --password hive_password --query "SELECT * FROM users WHERE \$CONDITIONS" --split-by id --target-dir users --hive-table users --hive-database test --hive-import --hive-drop-import-delims --fields-terminated-by "\001"
14/05/02 13:08:37 INFO hive.HiveImport: OK
14/05/02 13:08:37 INFO hive.HiveImport: Time taken: 3.598 seconds
14/05/02 13:08:37 INFO hive.HiveImport: Loading data to table test.users
14/05/02 13:08:38 INFO hive.HiveImport: Table test.users stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 46, raw_data_size: 0]
14/05/02 13:08:38 INFO hive.HiveImport: OK
14/05/02 13:08:38 INFO hive.HiveImport: Time taken: 1.032 seconds
14/05/02 13:08:38 INFO hive.HiveImport: Hive import complete.
14/05/02 13:08:38 INFO hive.HiveImport: Export directory is empty, removing it.
#確認します.やれやれ.テーブルも自動作成されて中身も入っています.
$ hive --database=test -e "SELECT * FROM users"
OK
Time taken: 3.245 seconds
OK
id name
2 alice
5 bob
Time taken: 1.242 seconds, Fetched: 2 row(s)
ファイルはここに.
$ hdfs dfs -ls /user/hive_user/warehouse/test.db/users/
Found 5 items
-rwxr-xr-x 3 hadoop supergroup 0 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/_SUCCESS
-rwxr-xr-x 3 hadoop supergroup 16 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00000.deflate
-rwxr-xr-x 3 hadoop supergroup 8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00001.deflate
-rwxr-xr-x 3 hadoop supergroup 8 2014-05-02 13:08 /user/hive_user/warehouse/test.db/users/part-m-00002.deflate
-rwxr-xr-x 3 hadoop supergroup 14 2014-05-02 13:08 /user/hive/warehouse/test.db/users/part-m-00003.deflate
#こんなエラーが出た時は,HIVE_HOMEをちゃんと設定すること.
14/05/02 13:46:22 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)