This project explores the core concepts of distributed data processing using the MapReduce programming model , implemented with Python via Hadoop Streaming , and deployed on a multi-node Google Cloud Dataproc cluster.
The goal of this project was to gain hands-on experience with large-scale data processing in a distributed cloud environment using industry-standard tools. Rather than using Java for MapReduce (the traditional approach), this project leverages Hadoop Streaming to execute custom Python scripts for the map and reduce phases making it more accessible for data engineers and Python practitioners.
- ✅ MapReduce-based Word Count implementation
- ✅ Executed on a multi-node Dataproc cluster with production-like configuration
- ✅ Uses HDFS for input/output data storage
- ✅ Mapper and Reducer implemented in Python
- ✅ Demonstrates full Map → Shuffle & Sort → Reduce pipeline
- ✅ Integrates Python + Hadoop via Streaming API
- ✅ Cloud-native setup using Google Cloud Platform (GCP)
wordcount_project/
├── data
|└── input.txt
├── mapper.py
└── reducer.py
- A working Hadoop environment, either:
- Locally set up on your system (with HDFS and MapReduce configured), or
- Cloud-based, e.g. a Google Cloud Dataproc cluster with Hadoop installed
- Python 3 installed on all nodes (local machine or cluster)
- Hadoop Streaming JAR available(commonly found at:
/usr/lib/hadoop/hadoop-streaming-3.3.6.jar
) - Executable permissions for both scripts:
chmod +x wordcount_project/mapper.py wordcount_project/reducer.py
- Cluster Name:
cluster-name
- Nodes: 1 master, 2 worker nodes
- Hadoop Version: 3.3.x (Streaming enabled)
- OS Environment: Debian/Ubuntu-based Dataproc VM
You can SSH into the Dataproc master node using either of the following methods:
Option 1: Via Google Cloud Console (Web UI)
- Go to the Dataproc Clusters page
- Click on your cluster:
cluster-name
- Select "VM Instances" and click "SSH" next to the master node to open a terminal in the browser
Option 2: Via gcloud CLI
gcloud compute ssh <your-username>@<cluster-name>-m --zone=<your-zone>
- Upload Input File to HDFS
hdfs dfs -mkdir -p /user/morevarun4004/wordcount_input
hdfs dfs -put wordcount_project/data/input.txt /user/morevarun4004/wordcount_input
- Ensure Mapper and Reducer Scripts Are Present Locally Make sure mapper.py and reducer.py are in your current working directory or provide the correct relative paths.
- Run Hadoop Streaming Job
hadoop jar /usr/lib/hadoop/hadoop-streaming-3.3.6.jar \
-input /user/morevarun4004/wordcount_input \
-output /user/morevarun4004/wordcount_output \
-mapper wordcount_project/mapper.py \
-reducer wordcount_project/reducer.py \
-file wordcount_project/mapper.py \
-file wordcount_project/reducer.py
- View the Output
hdfs dfs -cat /user/morevarun4004/wordcount_output/part-0000*
For the input:
hello world
hello hadoop
map reduce world
Expected output:
hadoop 1
hello 2
map 1
reduce 1
world 2
- Understood the MapReduce lifecycle (Map → Shuffle & Sort → Reduce)
- Learned how to use Hadoop Streaming to run Python scripts
- Gained experience with Google Cloud Dataproc and cluster-based HDFS storage
- Built confidence in deploying distributed data jobs in a cloud environment