Overview
An HDFS URL setting item appears when you configure a Transwarp Inceptor data connection or a Hadoop Hive data connection, as shown in the following figure.
The HDFS URL is described as follows.
Enter the address of an active node in the Hadoop HDFS file system.
Fill in the address in the format of hdfs://IP address: Port number. For example, hdfs://192.168.101.119:8020.
This document introduces the method to determine the IP address and the port number in the HDFS address.
Procedure
Determining the Port Number
Execute the following SQL statement in the database: desc formatted Database name.Table name. Check information in the Location row in the query result.
Case One
You can get the port number and the hostname from the query result, as shown in the following figure, where the port number is 9000 and the hostname is hive1.
Case Two
In some scenarios, the Location row in the query result does not contain a port number, as shown in the following figure, where the hostname is HDFS-HA. In this case, you can use the default port number 8020.
Case Three
Sometimes, you cannot confirm the port number from the Location row.
This occurs in a high availability (HA) HDFS cluster. In a HA HDFS cluster, there are two NameNodes with equal status. One is the Active NameNode, and the other is the Standby NameNode. Both NameNodes start simultaneously, but only one NameNode enters the working state. At any given time, one NameNodes is in an Active state, while the other is in a Standby state. The Active NameNode is responsible for serving all client requests, while the Standby NameNode maintains enough state to provide a fast failover if necessary. You can confirm the HDFS node address through the hdfs-site.xml file. Typically, the configuration in the file is as follows, and you can obtain the NameNode's RPC port from it.
hdfs-site.xml
<!-- Configure the cluster identifier for HA setup (myclustser). -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- NameNode identifiers -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- RPC communication port number for NameNode1 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>node00:8020</value>
</property>
<!-- RPC communication port number for NameNode2 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>node01:8020</value>
</property>
<!-- HTTP communication port number for NameNode1 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>node00:50070</value>
</property>
<!-- HTTP communication port number for NameNode2 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>node01:50070</value>
</property>
<!-- Configure shared storage configuration for JournalNodes (JN). -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node00:8485;node01:8485;node02:8485/mycluster</value>
</property>
<!-- Configure the failover proxy provider class. -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
</value>
</property>
<!-- Configure a fencing method to kill the previous Active NameNode during failover. -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- Configure the SSH private key for fencing. -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- Directory for JN metadata storage -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/software/hadoop/hdfs/journalnode/data</value>
</property>
<!-- Enable automatic failover when NameNode fails. -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
Obtaining the IP Address
You can use the ping command in the database server to check the connectivity to the corresponding hostname.
As the hostname in Case One is hive1, you can perform the ping test from the server where the Hive database is located, as shown in the following figure.
From the above figure, the IP address of HDFS is 192.168.101.243.
For Case Two, you can query the active IP address on the CDH platform, which is 192.168.9.188, as shown in the following figure.
In Case Three, a HA HDFS cluster contains one Active NameNode and one Standby NameNode and only the Active NameNode can be connected. You need to use the HA connection method to ensure that your connection is always directed to the Active NameNode, even if a failover occurs. The following describes the configuration steps.
Note: High availability configuration is not supported in FineDataLink of versions before 4.1.13.2. You need to confirm the master node address.
Use the following command to query Active and Standby NameNode status.
Commonly used services usually provide their own command lines (such as hdfs cli), and you can determine the Active and Standby NameNode status through specific commands and Node ID.
The following is a command used to query HDFS NameNode status.
hdfs haadmin
You can check HDFS parameters on the CDH platform. The default path of the HDFS configuration file is /etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml. In the hdfs-site.xml file, you can find high availability related configurations.
High availability configuration:
<property>
<name>dfs.ha.namenodes.sdg</name>
<value>namenode25,namenode30</value>
</property>
In the above example, there are two Namenodes. One is the Active NameNode, and the other is the Standby NameNode.
You can use the hdfs haadmin -getServiceState command to check the node status.
$ hdfs haadmin -getServiceState namenode30
active
$ hdfs haadmin -getServiceState namenode25
standby
And then you can confirm the IP address of the active node.
Note: FineDataLink of 4.1.13.2 and later versions supports high availability configuration.
Configuration Item | Description |
---|---|
HDFS Address | FineDataLink of 4.1.13.2 and later versions supports the configuration of multiple HDFS addresses. Separate HDFS addresses by comma (,). For example, hdfs://IP address 1:Port number 1,hdfs://IP address 2:Port number 2,hdfs://IP address 3:Port number 3. The system uses the multiple HDFS addresses you configured to construct the configuration file for connecting to HDFS. |