CentOS Hadoop部署指南
CentOS Hadoop Deployment Guide
1. Environment Preparation
Before deploying Hadoop, ensure the following prerequisites are met:
- Operating System: CentOS 7 or higher (64-bit recommended).
- Hardware Requirements: At least 2GB RAM (4GB+ for production), 40GB+ storage per node, dual-core CPU.
- Network Configuration: All nodes must be in the same LAN with static IP addresses and hostname resolution (configure
/etc/hoststo avoid DNS dependencies). - User Setup: Create a non-root user (e.g.,
hadoop) withsudoprivileges for security.
2. Install Java Environment
Hadoop requires Java 8 (OpenJDK or Oracle JDK). Run the following commands on all nodes:
# Install OpenJDK 8
sudo yum install -y java-1.8.0-openjdk-devel
# Verify installation
java -version # Should show Java 1.8.x
# Set JAVA_HOME (replace path if using Oracle JDK)
echo "export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk" >
>
~/.bashrc
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >
>
~/.bashrc
source ~/.bashrc
This ensures Hadoop can locate the Java runtime.
3. Download and Extract Hadoop
Download the latest stable Hadoop release from the Apache website. Extract it to a dedicated directory (e.g., /usr/local):
# Create a hadoop user-owned directory
sudo mkdir -p /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop
# Download and extract (replace version as needed)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzvf hadoop-3.3.4.tar.gz -C /usr/local/hadoop --strip-components=1
This installs Hadoop in /usr/local/hadoop with proper ownership.
4. Configure Hadoop Environment Variables
Set up environment variables to make Hadoop commands accessible globally. Edit ~/.bashrc (or /etc/profile for system-wide access):
echo "export HADOOP_HOME=/usr/local/hadoop" >
>
~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >
>
~/.bashrc
source ~/.bashrc
Verify with hadoop version—it should display the installed version.
5. Configure SSH Passwordless Login
Hadoop requires passwordless SSH between the NameNode and DataNodes for secure communication. On the NameNode (e.g., node1):
# Generate SSH key pair
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
# Copy public key to local machine (for testing)
cat ~/.ssh/id_rsa.pub >
>
~/.ssh/authorized_keys
ssh localhost # Test login (should not prompt for password)
# Copy key to all DataNodes (replace node2, node3 with actual IPs/hostnames)
ssh-copy-id hadoop@node2
ssh-copy-id hadoop@node3
Repeat for all nodes to enable seamless communication.
6. Configure Hadoop Core Files
Edit Hadoop’s configuration files in $HADOOP_HOME/etc/hadoop to define cluster behavior:
core-site.xml
Specifies the default file system (HDFS) and NameNode address:
<
configuration>
<
property>
<
name>
fs.defaultFS<
/name>
<
value>
hdfs://node1:9000<
/value>
<
!-- Replace with your NameNode's hostname/IP -->
<
/property>
<
/configuration>
hdfs-site.xml
Configures HDFS replication (set to 3 for production, 1 for testing) and data directories:
<
configuration>
<
property>
<
name>
dfs.replication<
/name>
<
value>
1<
/value>
<
!-- Change to 3 for multi-node clusters -->
<
/property>
<
property>
<
name>
dfs.namenode.name.dir<
/name>
<
value>
/usr/local/hadoop/data/namenode<
/value>
<
/property>
<
property>
<
name>
dfs.datanode.data.dir<
/name>
<
value>
/usr/local/hadoop/data/datanode<
/value>
<
/property>
<
/configuration>
mapred-site.xml
Enables YARN as the MapReduce framework (create the file if it doesn’t exist, copying from mapred-site.xml.template):
<
configuration>
<
property>
<
name>
mapreduce.framework.name<
/name>
<
value>
yarn<
/value>
<
/property>
<
/configuration>
yarn-site.xml
Configures YARN’s ResourceManager and shuffle service:
<
configuration>
<
property>
<
name>
yarn.resourcemanager.hostname<
/name>
<
value>
node1<
/value>
<
!-- Replace with your ResourceManager's hostname/IP -->
<
/property>
<
property>
<
name>
yarn.nodemanager.aux-services<
/name>
<
value>
mapreduce_shuffle<
/value>
<
/property>
<
/configuration>
These configurations define the cluster’s core structure.
7. Format HDFS NameNode
The NameNode must be formatted once before first use to initialize its metadata storage. Run this command on the NameNode:
hdfs namenode -format
This creates the HDFS directories specified in hdfs-site.xml and initializes the file system.
8. Start Hadoop Cluster
Start HDFS and YARN services using the following commands on the NameNode:
# Start HDFS (NameNode and DataNodes)
start-dfs.sh
# Start YARN (ResourceManager and NodeManagers)
start-yarn.sh
To verify services are running, use jps on each node:
- NameNode: Should show
NameNodeandSecondaryNameNode. - DataNode: Should show
DataNode. - ResourceManager: Should show
ResourceManager. - NodeManager: Should show
NodeManager.
9. Verify Cluster Status
Check the health of your cluster using these commands:
# View HDFS status
hdfs dfsadmin -report # Shows DataNodes and storage usage
# Access Web UIs
# HDFS NameNode: http://node1:9870 (default port)
# YARN ResourceManager: http://node1:8088 (default port)
The Web UIs provide real-time insights into cluster metrics (e.g., node status, storage usage).
10. Optional: Deploy High Availability (HA)
For production environments, configure Hadoop HA to eliminate single points of failure (SPOFs). This involves:
- Setting up two NameNodes (active/passive).
- Using ZooKeeper for leader election.
- Configuring JournalNodes for shared storage.
Refer to the Hadoop HA documentation for detailed steps.
By following these steps, you can successfully deploy a Hadoop cluster on CentOS, enabling distributed storage and processing of large datasets.
声明:本文内容由网友自发贡献,本站不承担相应法律责任。对本内容有异议或投诉,请联系2913721942#qq.com核实处理,我们将尽快回复您,谢谢合作!
若转载请注明出处: CentOS Hadoop部署指南
本文地址: https://pptw.com/jishu/737234.html
