Confirming the IP Address and the Port Number in HDFS URL

  • Last update: March 11, 2025
  • Overview

    An HDFS URL setting item appears when you configure a Transwarp Inceptor data connection or a Hadoop Hive data connection, as shown in the following figure.

    1.png

    The HDFS URL is described as follows.

    • Enter the address of an active node in the Hadoop HDFS file system.

    • Fill in the address in the format of hdfs://IP address: Port number. For example, hdfs://192.168.101.119:8020.

    This document introduces the method to determine the IP address and the port number in the HDFS address.

    Procedure

    Determining the Port Number

    Execute the following SQL statement in the database: desc formatted Database name.Table name. Check information in the Location row in the query result.

    Case One

    You can get the port number and the hostname from the query result, as shown in the following figure, where the port number is 9000 and the hostname is hive1.

    2.1.1-1.png

    Case Two

    In some scenarios, the Location row in the query result does not contain a port number, as shown in the following figure, where the hostname is HDFS-HA. In this case, you can use the default port number 8020.

    2.1.2-1.png

    Case Three

    Sometimes, you cannot confirm the port number from the Location row.

    2.1.3-1.png

    This occurs in a high availability (HA) HDFS cluster. In a HA HDFS cluster, there are two NameNodes with equal status. One is the Active NameNode, and the other is the Standby NameNode. Both NameNodes start simultaneously, but only one NameNode enters the working state. At any given time, one NameNodes is in an Active state, while the other is in a Standby state. The Active NameNode is responsible for serving all client requests, while the Standby NameNode maintains enough state to provide a fast failover if necessary. You can confirm the HDFS node address through the hdfs-site.xml file. Typically, the configuration in the file is as follows, and you can obtain the NameNode's RPC port from it.

    hdfs-site.xml 
    <!-- Configure the cluster identifier for HA setup (myclustser). -->
    <property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
    </property>
    <!-- NameNode identifiers -->
    <property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
    </property>
    <!-- RPC communication port number for NameNode1 -->
    <property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>node00:8020</value>
    </property>
    <!-- RPC communication port number for NameNode2 -->
    <property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>node01:8020</value>
    </property>
    <!-- HTTP communication port number for NameNode1 -->
    <property>
    <name>dfs.namenode.http-address.mycluster.nn1</name>
    <value>node00:50070</value>
    </property>
    <!-- HTTP communication port number for NameNode2 -->
    <property>
    <name>dfs.namenode.http-address.mycluster.nn2</name>
    <value>node01:50070</value>
    </property>
    <!-- Configure shared storage configuration for JournalNodes (JN). -->
    <property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://node00:8485;node01:8485;node02:8485/mycluster</value>
    </property>
    <!-- Configure the failover proxy provider class. -->
    <property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
    </value>
    </property>
    <!-- Configure a fencing method to kill the previous Active NameNode during failover. -->
    <property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
    </property>
    <!-- Configure the SSH private key for fencing. -->
    <property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/root/.ssh/id_rsa</value>
    </property>
    <!-- Directory for JN metadata storage -->
    <property>
    <name>dfs.journalnode.edits.dir</name>
    <value>/opt/software/hadoop/hdfs/journalnode/data</value>
    </property>
    <!-- Enable automatic failover when NameNode fails. -->
    <property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
    </property>
    Expand

    Obtaining the IP Address

    You can use the ping command in the database server to check the connectivity to the corresponding hostname.

    As the hostname in Case One is hive1, you can perform the ping test from the server where the Hive database is located, as shown in the following figure.

    1741656403840089.png

    From the above figure, the IP address of HDFS is 192.168.101.243.

    For Case Two, you can query the active IP address on the CDH platform, which is 192.168.9.188, as shown in the following figure.

    12.png

    In Case Three, a HA HDFS cluster contains one Active NameNode and one Standby NameNode and only the Active NameNode can be connected. You need to use the HA connection method to ensure that your connection is always directed to the Active NameNode, even if a failover occurs. The following describes the configuration steps.

    Note: High availability configuration is not supported in FineDataLink of versions before 4.1.13.2. You need to confirm the master node address.

    Use the following command to query Active and Standby NameNode status.

    Commonly used services usually provide their own command lines (such as hdfs cli), and you can determine the Active and Standby NameNode status through specific commands and Node ID.

    The following is a command used to query HDFS NameNode status.

    hdfs haadmin

    You can check HDFS parameters on the CDH platform. The default path of the HDFS configuration file is /etc/hadoop/conf.cloudera.hdfs/hdfs-site.xml. In the hdfs-site.xml file, you can find high availability related configurations.

    High availability configuration: 

    <property> 
      <name>dfs.ha.namenodes.sdg</name> 
      <value>namenode25,namenode30</value> 
    </property>

    In the above example, there are two Namenodes. One is the Active NameNode, and the other is the Standby NameNode.

    You can use the hdfs haadmin -getServiceState command to check the node status.

    $ hdfs haadmin -getServiceState namenode30
    active
    $ hdfs haadmin -getServiceState namenode25
    standby

    And then you can confirm the IP address of the active node.

    Note: FineDataLink of 4.1.13.2 and later versions supports high availability configuration.

    Configuration ItemDescription

    HDFS Address

    FineDataLink of 4.1.13.2 and later versions supports the configuration of multiple HDFS addresses. Separate HDFS addresses by comma (,).

    For example, hdfs://IP address 1:Port number 1,hdfs://IP address 2:Port number 2,hdfs://IP address 3:Port number 3.

    The system uses the multiple HDFS addresses you configured to construct the configuration file for connecting to HDFS.

















    附件列表


    主题: Data Source Configuration
    • Helpful
    • Not helpful
    • Only read

    滑鼠選中內容,快速回饋問題

    滑鼠選中存在疑惑的內容,即可快速回饋問題,我們將會跟進處理。

    不再提示

    7s后關閉

    Get
    Help
    Online Support
    Professional technical support is provided to quickly help you solve problems.
    Online support is available from 9:00-12:00 and 13:30-17:30 on weekdays.
    Page Feedback
    You can provide suggestions and feedback for the current web page.
    Pre-Sales Consultation
    Business Consultation
    Business: international@fanruan.com
    Support: support@fanruan.com
    Page Feedback
    *Problem Type
    Cannot be empty
    Problem Description
    0/1000
    Cannot be empty

    Submitted successfully

    Network busy