新一代大数据技术比赛任务一操作手册

必须掌握的Linux操作

解压 *.tar.gz 安装包

1
#tar -zxvf {path} -C {out path}

ssh操作(可能已经配置好了,那直接看最后两步就行)

安装ssh

1
#yum install -y openssl openssh-server

生成密钥

1
#ssh -keygen -t rsa

将私钥写入公钥

1
2
3
4
5
6
#cd /root/.ssh/
#cat id_rsa.pub >> authorized_keys
修改权限
#chmod 644 authorized_keys
重启ssh服务
#systemctl restart sshd.service

ssh复制文件或目录

1
2
3
4
单个文件
# scp path ip:path
目录
# scp -r path ip:path

多台服务器免密登陆配置

1
2
3
每台服务器都完成上面前两个操作后
#scp /root/.ssh/authorized_keys 服务器2ip:/root/.ssh/authorized_keys
#scp /root/.ssh/authorized_keys 服务器3ip:/root/.ssh/authorized_keys

登陆/退出其他服务器命令

1
2
#ssh 服务器ip
#exit

关闭防火墙操作

1
2
3
4
停止防火墙
#systemctl stop firewalld.service
关闭防火墙开机启动
#systemctl disable firewalld.service

查看ip

1
#ip a

vim操作

编辑文档或配置文件操作

1
#vi path

查找操作(在配置hive时会用到)

1
2
3
在用vi打开文件后,输入/**str 回车即可查找堪忧str的位置
输入n,查找下一个
输入N,查找上一个

退出

1
2
shift+zz或者:wq  保存退出
:q! 不保存退出

系统环境变量

编辑环境变量

1
vi ~/.bashrc

添加环境变量(举个例子)

1
export PATH=$PATH:/usr/local/jdk1.8.0_221

应用环境变量

1
source ~/.bashrc

修改文件或目录的所有者

1
# chown -R 用户名 path

安装zookeeper

解压zookeeper至指定位置

进入zookeeper目录/conf

1
2
3
cd /usr/local/zookeeper/conf
将样例配置文件改名
mv zoo_sample.cfg zoo.cfg

编辑配置文件

1
2
3
4
5
6
7
vi zoo.cfg
加入以下配置
dataDir=/usr/local/zookeeper/data
dataLogDir=/usr/local/zookeeper/logs
server.1=服务器1ip:2888:3888
server.2=服务器2ip:2888:3888
server.3=服务器3ip:2888:3888

创建data和logs目录

1
2
3
cd /usr/local/zookeeper
mkdir data
mkdir logs

加入机器的id标识

1
2
3
cd /usr/local/zookeeper/data
vi myid
写入当前机器编号如:1/2/3

将zookeeper文件目录分发至其他服务器

1
2
scp -r /usr/local/zookeeper 服务器id:/usr/local/
复制完成后按照上一步修改机器id

配置环境变量,并分发至各个服务器

在每个服务器启动zookeeper

1
2
3
4
zkServer.sh start
zkServer.sh status
如果报错可能是因为没有关闭防火墙
zkServer.sh stop 停止zookeeper

zookeeper的启动必须是每台服务器都启动,zookeeper会自动选择一台作为leader,其他的作为follower

安装配置hadoop

解压hadoop

修改配置文件

在hadoop-env.sh和mapred-env.sh还有yarn-env.sh中

写上你的jdk路径(有可能这条属性被注释掉了,记得解开,把前面的#去掉就可以了)

修改hdfs配置文件

配置core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<!--把两个NameNode的地址组装成一个集群cluster1 -->
<value>hdfs://cluster1</value>
</property>
<!--这里的值指的是默认的HDFS路径 ,在哪一台配,namenode就在哪一台启动 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop/data/tmp</value>
</property>
<!--hadoop的临时目录,如果需要配置多个目录,需要逗号隔开,data目录需要我们自己创建-->
<!--指定ZKFC故障自动切换转移-->
<property>
<name>ha.zookeeper.quorum</name>
<value>master:2181,salve1:2181,salve2:2181</value>
</property>
<!--配置Zookeeper 管理HDFS-->

</configuration>

配置hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<!--配分文件个数,默认的就是3个,可以不用配置-->
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<!--权限默认配置为false-->
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<!--命名空间,它的值与fs.defaultFS的值要对应,namenode高可用之后有两个namenode,cluster1是对外提供的统一入口-->
<name>dfs.nameservices</name>
<value>cluster1</value>
</property>
<property>
<!--指定 nameService 是 cluster1 时的nameNode有哪些,这里的值也是逻辑名称,名字随便起,相互不重复即可-->
<name>dfs.ha.namenodes.cluster1</name>
<value>master,salve1</value>
</property>
<property>
<!--master的RPC通信地址,master所在地址-->
<name>dfs.namenode.rpc-address.cluster1.master</name>
<value>master:9000</value>
</property>
<property>
<!--master的http通信地址,外部访问地址-->
<name>dfs.namenode.http-address.cluster1.master</name>
<value>master:50070</value>
</property>
<property>
<!--salve1的RPC通信地址,salve1所在地址-->
<name>dfs.namenode.rpc-address.cluster1.salve1</name>
<value>salve1:9000</value>
</property>
<property>
<!--salve1的http通信地址,外部访问地址-->
<name>dfs.namenode.http-address.cluster1.salve1</name>
<value>salve1:50070</value>
</property>
<property>
<!--指定NameNode的元数据在JournalNode日志上的存放位置(一般和zookeeper部署在一起) -->
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master:8485;salve1:8485;salve2:8485/cluster1</value>
</property>
<property>
<!--指定JournalNode集群在对nameNode的目录进行共享时,JournalNode自己在本地磁盘存放数据的位置-->
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/hadoop/journaldata</value>
</property>
<property>
<!--客户端通过代理访问namenode,访问文件系统,HDFS 客户端与Active 节点通信的Java 类,使用其确定Active 节点是否活跃,指定 cluster1 出故障时,哪个实现类负责执行故障切换 -->
<name>dfs.client.failover.proxy.provider.cluster1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<!--这是配置自动切换的方法,有多种使用方法,具体可以看官网,这个参数的值可以有多种,可以是shell(/bin/true),也可以是 <value>sshfence</value> 的远程登录杀死的方法,这个脚本do nothing 返回0 -->
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<!--这个是使用sshfence隔离机制时才需要配置ssh免登陆-->
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<property>
<!-- 配置sshfence隔离机制超时时间,这个属性同上,如果你是用脚本的方法切换,这个应该是可以不配置的-->
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>10000</value>
</property>
<property>
<!--这个是开启自动故障转移-->
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>

配置slave

DataNode、NodeManager决定于:slaves文件。(默认localhost,删掉即可)

谁跑dataNode,slaves文件写谁。当namenode跑的时候,会通过配置文件开始扫描slaves文件,slaves文件有谁,谁启动dataNode.当启动yarn时,会通过扫描配置文件开始扫描slaves文件,slaves文件有谁,谁启动NodeManager

1
2
vi salves
写入想作为namenode的节点,一般是前两台服务器,即master 和 slave1

修改yarn配置文件

配置mapred-site.xml

1
2
3
4
5
6
7
<configuration>
<property>
<!-- 指定mr运行在yarn上 -->
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

配置yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<!-- reducer获取数据的方式 -->
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<!--启用resourcemanager ha-->
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<!--声明两台resourcemanager的地址-->
<name>yarn.resourcemanager.cluster-id</name>
<value>rmCluster</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>salve1</value>
</property>
<property>
<!--指定zookeeper集群的地址-->
<name>yarn.resourcemanager.zk-address</name>
<value>master:2181,salve1:2181,salve2:2181</value>
</property>
<property>
<!--启用自动恢复-->
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<!--指定resourcemanager的状态信息存储在zookeeper集群-->
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
</configuration>

将hadoop目录分发值其他服务器

修改环境变量

启动journalnode服务

1
2
3
4
5
6
7
启动Journalnode是为了创建namenode元数据存放目录
[hadoop@hadoop1 ~]$ hadoop-daemon.sh start journalnode
[hadoop@hadoop2 ~]$ hadoop-daemon.sh start journalnode
[hadoop@hadoop3 ~]$ hadoop-daemon.sh start journalnode
确认
Jps 三个节点都存在JournalNode进程
主namenode data目录下创建目录成功

格式化主Namenode

在主namenode节点执行

1
hdfs namenode –format

查看一下是否已经生成了tmp目录

启动主Namenode

1
hadoop-daemon.sh start namenode

同步主Namenode(hadoop1)的信息到备namenode(hadoop2)上面

1
2
3
在备用节点上运行
hdfs namenode –bootstrapStandby
确认目录是否存在

只在主NameNode上格式化zkfc,创建znode节点

1
hdfs zkfc –formatZK

停掉主namenode和所有的journalnode进程

1
2
3
4
5
[hadoop@hadoop1 ~]$hadoop-daemon.sh stop namenode
[hadoop@hadoop1 ~]$ hadoop-daemon.sh stop journalnode
[hadoop@hadoop2 ~]$ hadoop-daemon.sh stop journalnode
[hadoop@hadoop3 ~]$ hadoop-daemon.sh stop journalnode
jps检查只存在zookeeper

启动hdfs并用jsp检查进程

start-dfs.sh

有可能没有datanode

datanode的clusterID 和 namenode的clusterID 不匹配
查看namenode节点上的cluserID
在current目录下有一个VERSION文件,用name下的那个VERSION中的 clusterID 覆盖 data下的VERSION中的clusterID

启动yarn集群

1
2
在主节点上运行
start-yarn.sh

在备节点上手动启动resourceMabager

1
yarn-daemon.sh start resourcemanager

web页面

hdfs页面
在网页中输入 ip:50070
yarn页面
在网页中输入 ip:8088

完成所有配置后,hadoop的启动停止可以使用

1
2
start-all.sh
stop-all.sh

Hive+Mysql配置

安装mysql

1
2
3
4
5
6
7
进入下载目录
cd /usr/local

安装mysql
#wget http://repo.mysql.com/mysql57-community-release-el7-8.noarch.rpm
#rpm -ivh mysql57-community-release-el7-8.noarch.rpm
#yum -y install mysql-server

重启mysql

1
#service mysqld restart

初始化获取随机密码

1
#grep "password" /var/log/mysqld.log

登陆mysql

1
2
#mysql -u root -p
输入刚才生成的随机密码

更改密码

1
2
3
4
5
6
7
8
9
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY '12Zhang.';
Query OK, 0 rows affected (0.00 sec)
#修改/etc/my.cnf文件,在文件中添加以下内容以禁用密码策略
#validate_password=off
#重新启动mysql服务
[root@master opt]$ systemctl restart mysqld
#再次进入mysql,修改密码
[root@master ~]$ mysql -uroot -p
mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'root';

添加root用户的远程登录权限

默认只允许root帐户在本地登录,如果要在其它机器上连接mysql,必须修改root允许远程连接,或者添加一个允许远程连接的帐户

1
2
3
4
5
6
7
8
9
#进入mysql
[root@master ~]$ mysql -uroot -p123456
#修改root的远程访问权限
#root代表用户名 , %代表任何主机都可以访问 , 123456为root访问的密码
mysql> GRANT ALL PRIVILEGES ON root.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION;
#flush privileges刷新MySQL的系统权限,使其即时生效,否则就重启服务器
mysql> FLUSH PRIVILEGES;
#退出
mysql> exit;

安装hive

解压安装包

1
2
tar -zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin.tar.gz hive-2.3.3

配置环境变量,这里不加赘述

配置Hive

1
2
3
4
5
cd hive-2.3.3/conf/
cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

hive的所有配置文件都不需要自己写,在安装包中已经包含了所有的配置文件,所以只需要改名就可以打开直接配置

修改hive-evn.sh

1
2
3
4
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191 ##Java路径 
export HADOOP_HOME=/home/dc2-user/hadoop-2.7.7 ##Hadoop安装路径
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin ##Hive安装路径
export HIVE_CONF_DIR=$HIVE_HOME/conf ##Hive配置文件路径

修改hive-site.xml

在这个配置文件中,发现会用很多很多的配置项,但是只需要修改几个属性的value就可以
vim的搜索功能看本文最上面的部分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
vi hive-site.xml 
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive-${user.name}</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/${user.name}</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive/resources</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name> hive.querylog.location</name>
<value>/tmp/${user.name}</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/${user.name}/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>

配置Hive Metastore

Hive Metastore 是用来获取 Hive 表和分区的元数据,本例中使用 mariadb 来存储此类元数据。
将 mysql-connector-java-5.1.40-bin.jar 放入 $HIVE_HOME/lib 下并在 hive-site.xml 中配置 MySQL 数据库连接信息。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<property> 
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<!-- 开启metastore远程连接 -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>

为Hive创建Hdfs目录

1
2
3
4
5
start-all.sh #如果在安装配置hadoop是已经启动,则此命令可省略 
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /usr/hive/warehouse

启动Hive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[root@master lib]# cd $HIVE_HOME/bin
[root@master bin]# schematool -initSchema -dbType mysql
#出现以下信息代表初始化成功
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/SoftWare/Hive/hive-2.3.2/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/SoftWare/Hadoop/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: root
Sun Mar 04 15:30:33 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Sun Mar 04 15:30:34 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Initialization script completed
Sun Mar 04 15:30:36 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
schemaTool completed
#启动hive服务端
[root@master bin]# hive --service metastore
#启动Hive客户端
[root@master bin]# hive
#输入show tables;显示以下信息,说明Hive已经启动
hive> show tables;
OK
Time taken: 1.594 seconds

版本报错

将配置文件中hive.metastore.schema.verification的属性改为false

sqoop组件部署

修改配置文件

conf 下面的 sqoop-env.sh 文件
配置 haddop hive hbases环境变量

拷贝mysql-connector-java-*至 $SQOOP_HOME/lib

1
cp mysql-connector-java-5.1.44-bin.jar $SQOOP_HOME/lib

修改环境变量

测试sqoop,连接mysql数据库

1
2
3
sqoop list-databases --connect jdbc:mysql://master:3306/ --username root --password root

在一顿waring和please后输出了语句运行结果,则配置完成

a.png

参考资料

感谢各位大佬的分享
https://blog.csdn.net/baidu_36414377/article/details/85554824
https://blog.csdn.net/l1028386804/article/details/88014099
https://blog.csdn.net/qq_32808045/article/details/77152496
https://www.cnblogs.com/liuhouhou/p/8975812.html
https://www.cnblogs.com/zlslch/p/9190906.html
https://blog.csdn.net/gywtzh0889/article/details/52818429
https://blog.csdn.net/qq_39142369/article/details/90442686
https://blog.csdn.net/weixin_41804049/article/details/81637750
https://blog.csdn.net/hhj724/article/details/79094138
https://www.cnblogs.com/wormgod/p/10761933.html
https://blog.csdn.net/cndmss/article/details/80149952
https://blog.csdn.net/adamlinsfz/article/details/84108536
https://blog.csdn.net/adamlinsfz/article/details/84333389
https://www.cnblogs.com/Jomini/p/10749657.html
https://blog.csdn.net/lxyuuuuu/article/details/83109287
https://blog.csdn.net/yumingzhu1/article/details/80678525
https://blog.csdn.net/dwld_3090271/article/details/50747639
https://blog.csdn.net/tao_wei162/article/details/84763035

坚持原创技术分享,您的支持也将成为我的动力!
-------------本文结束感谢您的阅读-------------
undefined