Hadoop & MapReduce Project

Project Name: Hadoop & MapReduce Project
Format: Hands-on using Amazon Web Services (aws)
Length: Two weeks
Due: By start of Thanksgiving break (on or before Wed Nov 25th @ 11:20am)
Teams: 2 students per project team, teams assigned, must be different from previous project(s)

Summary

In this project, we'll be using Amazon's "Elastic MapReduce" to read a large text file, perform a count function, and feed the results into SimpleDB. MapReduce is a two part process where input data is first "mapped" (split into buckets) and then "reduced". The best way to understand these concepts is to walk through a flow.

Note: MapReduce is a proprietary implementation of this technology. Google was the first large company to use it. The public implementation (open source) of MapReduce is called Hadoop. It was started by Yahoo! and taken over by Apache to bring it into the Open Source community. More information is available here: hadoop

Activity One

  • You'll start by signing up for Amazon's "Simple Storage Service" (S3). Go to the s3 page on Amazon Web Services to sign up. We'll be using S3 to store and manage files that we'll use for MapReduce.
  • To view and manipulate files created on S3, you'll need to install an add-on for Firefox. Go to the add-on page at [https://addons.mozilla.org/en-US/firefox/addon/3247] and click "add to Firefox"
  • Now we'll use the add-on. On the Firefox menu bar, click "tools" and then "s3 organizer".
  • You'll need to set up your account with your security credentials to access your S3 files. Click "manage accounts". Enter an account name (anything you want), your access key, and your secret key. (Your security credentials are available at aws.amazon.com under "Your Account".)
  • Next, we'll create a "bucket" in S3. A "bucket" is the same as a folder on unix or windows. Use your last name (the last name of one team member) for your bucket. For the remainder of this project, substitute your last name for "lastname" in the code.
  • Now we'll put the Cruise data in this bucket. At this point, you're all very familiar with this data. Create 6 separate csv (comma separated) files if you don't already have them, one for each table.
  • Drag and drop all 6 csv's to your newly created bucket.
  • We'll now run our first MapReduce job flow. We're going to find all unique words (text) in these csv files and the count of the number of occurrences of each unique word. To start, go to the AWS Management Console (where you launched instances). Choose "Elastic MapReduce" and sign in.

Click "Create new job flow"
Give it a name (you can use any name)
Select "Run a sample application"
From the pull down list, choose the sample called "word count"
—> continue
On the next screen, you'll provide information about where to find everything for the job flow.
Input location /lastname/
Output location /lastname/output/cruise-count/ (can't exist, will be created)
Leave the names of the programs use for Mapper and Reducer as they are
->continue
Number of instances (use 4)
Click "show advanced options"
Log path /lastname/log/cruise/ (can't exist, will be created)
No key pair
-> continue
-> click "Create job flow"

Congratulations! You are running your first MapReduce job flow. You can monitor progress as your job runs. The longest part of the process will be initializing the 4 instances. The mapper and reducer jobs will run very quickly since they are manipulating a small amount of data.

You'll be charge a whopping 1.5 cents per instance hour. This job flow will complete in minutes — but you'll still be charge for 4 instance hours (6 cents) since you started 4 instances for the job. As you can see, you can run a lot of Elastic MapReduce job flows with the $100 credit available.

When the job is done (the state will change to COMPLETED), use s3 browser to review the results.

  • Navigate to /lastname/output/cruise-count/
  • You'll see multiple files named "part-nnnnn" (e.g. part-00000, part-00001, etc.)
  • To view these files, you need to download them by drag/drop files onto your local computer
  • Open part-00000 in notepad to see part of the results from your MapReduce run
  • Now look at the log file of the job
  • Navigate to /lastname/log/cruise/
  • You'll see a job number like "j-O8CJGX3SZRPH"
  • Click on that directory. You'll see two directories inside "steps" and "daemons".
  • We are interested in "steps" so click on that directory name.
  • Our job only has one step so click on the directory named "1"
  • You'll see three important files: syslog, stdout, and stderr. Download these to your local machine and open them in notepad.

IMPORTANT Take some time at this juncture to try to read these files. If you have errors when you are running jobs, this is the only place you can come to try and debug your code. Not too friendly, huh?

Activity Two

Now we'll use a much larger dataset and we'll use something other than the sample applications provided by Amazon. We'll walk through an example that maps data from a website called Freebase. Freebase is an open database of the world’s information covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.

The data we'll use for this activity is a subset of all the facts and assertions in the Freebase database. The Freebase data is organized into a large number of TSV (tab separated) files. Freebase data has only three columns: names, ids and references. Each "object" in Freebase is given a unique id that looks like "/guid/9202a8c04000641f8000000000003e2d". Each "object" may have references.

Consider the example below - a small part of the American Football TSV. This file lists facts and assertions about the sport of football. The first 2 lines of this file are shown below. Each coaching position (i.e. Assistant Coach, Head coach, etc) in American football is given a unique id – these are the first 2 columns in the data. The 3rd column indicates who has held these positions. Somewhere in the Freebase database, the IDs in this 3rd column are resolved to a name.

name id coaches holding this position
Assistant Coach /guid/9202a8c04000641f8000000000003e2d /guid/9202a8c04000641f800000000000417c
Head Coach /guid/9202a8c04000641f8000000000004fc2 /guid/9202a8c04000641f8000000000005e7e, /guid/9202a8c04000641f800000000000682f, /guid/9202a8c04000641f800000000000683c, /guid/9202a8c04000641f8000000000006f17, /guid/9202a8c04000641f80000000000073b5, /guid/9202a8c04000641f8000000000007616, /guid/9202a8c04000641f8000000000004270, /guid/9202a8c04000641f8000000000004b3a, /guid/9202a8c04000641f8000000000004d40, /guid/9202a8c04000641f8000000000004f01, /guid/9202a8c04000641f800000000000417c

In this activity, we'll process a series of job flows and we will eventually filter a set of Freebase data and store it into Amazon Web Services SimpleDB data store. Our first job flow will simply look for the most popular Freebase entities. (One of the fundamental concepts in Amazon Elastic MapReduce is the notion of a job flow. A job takes input and generates output; the output of one job can be used as the input to another job. This is why jobs are referred to as job flows.)

Freebase makes their data publicly available for download. You can see the downloads available here: Freebase Data Dumps. For this activity, I've already downloaded the data and selected a subset of the TSV files. The whole dataset is 787MB which is too much to work with for this activity. Our dataset has 52MB of data across 176 TSV files.

(Google, Yahoo!, and others have data similar to this where the primary focus is a URL. This is a bit different than a traditional business data where the focus is an entity and it's attributes. Here we are looking at something more similar to a log file where every time a URL is served up in Search, the activity is tracked. The challenge is to process these huge log files to pull out some intelligence.)

Download a zip file with the Freebase data: fbzip.zip
Create a directory on your desktop to temporary hold this data
Use winzip or similar (should be installed on the lab computers already) to extract all the files into the directory you created
On S3, create a new bucket to hold this data. Call it "freebase-data"
Now drag & drop all the TSV files from your local computer to the freebase-data s3 bucket

Now we are ready to run a MapReduce job flow against this data.

For the first job, we are going to run a "python" program against the data. Open the program on your local machine in WordPad to view the code. The file is here.

Amazon Elastic MapReduce will iterate over each file of input that we give it. As you can see from the first few lines of mapper.py, we are reading each line of a file that we are given. We token each word in that line and look for the Freebase nodes (i.e. the ones starting with /guid/). Every time we find a node, we output LongValueSum <node id>\t1. Basically, we are searching through the input and pulling out each /guid/. Using the data from above, the Mapper step would give us:

/guid/9202a8c04000641f8000000000003e2d
/guid/9202a8c04000641f800000000000417c
/guid/9202a8c04000641f8000000000004fc2
/guid/9202a8c04000641f8000000000005e7e
/guid/9202a8c04000641f800000000000682f
/guid/9202a8c04000641f800000000000683c
/guid/9202a8c04000641f8000000000006f17
/guid/9202a8c04000641f80000000000073b5
/guid/9202a8c04000641f8000000000007616
/guid/9202a8c04000641f8000000000004270
/guid/9202a8c04000641f8000000000004b3a
/guid/9202a8c04000641f8000000000004d40
/guid/9202a8c04000641f8000000000004f01
/guid/9202a8c04000641f800000000000417c

Notice the only data that is repeated in the last one. This means the same person held the position of "assistant coach" and "head coach" (at different times, of course.)

Now the reducer will read this data. For this job flow, we haven't written any code for the reducer. Instead, we use a "built-in" reducer called "aggregate". You will recall the aggregate function in SQL (SUM, COUNT, MIN, MAX). The aggregate reducer will sort and count the data sent to it from the Mapper. For the example of the Football data, we only have one occurrence for each — except the last one. What we will get is each node that we see in our input with a count next to it. The count will be the number of times that node was encountered in the Freebase database. So the output from "aggregate" will be:

/guid/9202a8c04000641f8000000000003e2d 1
/guid/9202a8c04000641f800000000000417c 2
/guid/9202a8c04000641f8000000000004fc2 1
/guid/9202a8c04000641f8000000000005e7e 1
/guid/9202a8c04000641f800000000000682f 1
/guid/9202a8c04000641f800000000000683c 1
/guid/9202a8c04000641f8000000000006f17 1
/guid/9202a8c04000641f80000000000073b5 1
/guid/9202a8c04000641f8000000000007616 1
/guid/9202a8c04000641f8000000000004270 1
/guid/9202a8c04000641f8000000000004b3a 1
/guid/9202a8c04000641f8000000000004d40 1
/guid/9202a8c04000641f8000000000004f01 1

Let's go ahead and run this on Elastic Map Reduce:

First, let's put the code for our Mapper up on S3 so we can use it
Create a new bucket on S3 called /lastname/code/ (navigate to /lastname/ and then create a new bucket or subdirectory called "code")
Drag/drop mapper.py to /lastname/code/
Now go back into the "AWS Management Console" and create a new job flow in Elastic MapReduce.
Select "Run your own application"
In "Choose Job Type" select "Streaming"
—> continue
Enter the details of your job flow

Input /lastname/freebase-data/
Output /lastname/output/freebase-output
Mapper /lastname/code/mapper.py
Leave Reducer as aggregate
->continue
Number of instances (use 4)
Click "show advanced options"
Log path /lastname/log/freebase/ (can't exist, will be created)
No key pair
-> continue
-> click "Create job flow"

Once again, you can monitor progress as your job runs. With 4 instances, your job flow should take about 15 minutes to complete. You can start the next activity while you are waiting. When the job is done, use S3 browser once again to review the results.

  • Navigate to /lastname/output/freebase-output/
  • Open part-00000 in notepad to see part of the results from your MapReduce run
  • Now look at the log file of the job
  • Navigate to /lastname/log/freebase/jobname/1/
  • Once again, you'll see: syslog, stdout, and stderr. Download these to your local machine and open them in notepad.

IMPORTANT Take some time at this juncture to try to read these files. If you have errors when you are running jobs, this is the only place you can come to try and debug your code.

Activity Three

Our next job flow will take the output from the previous activity and store it in Amazon SimpleDB. Once there, we can access it using "ScratchPad" as we did in the first project or directly with API calls.

Let's first look at the code we'll be using for our Mapper and Reducer. The files are below. Notice that they are written in Ruby instead of Python. Download the files and open them in WordPad.

top_sdb_mapper.rb
top_sdb_reducer.rb

The Mapper is very simple; it is basically just echoing our output.

The Reducer is where all the work is happening. We are taking the output of our mapper and writing the values into SimpleDB. We are populating three pieces of information in SimpleDB: the URL (/guid/xxx), the name, and the count. For now, the name attribute will be empty. We could run another job flow to populate it but we won't for this activity.

In reviewing the Reducer code, you'll see that we use a ‘back-off’ algorithm for writing attributes to SimpleDB. We do this since it is possible to receive intermittent connection errors whenever you use a web service in a highly concurrent environment. By using a ‘back-off’ algorithm, we mitigate this issue. If we receive an error, we will simply wait for a short period of time and then try again.

Let's go ahead and run this MapReduce job flow:

First, drag/drop the two Ruby files (.rb) to /lastname/code/
Now go back into the "AWS Management Console" and create a new job flow in Elastic MapReduce.
Select "Run your own application"
In "Choose Job Type" select "Streaming"
—> continue
Enter the details of your job flow
Input Location /lastname/output/freebase-output/
Output Location /lastname/output/simpledb-output/
Mapper /lastname/code/top_sdb_mapper.rb
Reducer /lastname/code/top_sdb_reducer.rb
This job flow requires additional arguments. Enter these in the box.
-cacheFile s3n://elasticmapreduce/samples/freebase/code/base64.rb#base64.rb
-cacheFile s3n://elasticmapreduce/samples/freebase/code/aws_sdb.rb#aws_sdb.rb
->continue
Number of instances (use 10 as this is a big job with a lot of data to process)
Click "Show advanced options"
Log path /lastname/log/freebase-simpledb/ (can't exist, will be created)
No key pair
-> continue
-> click "Create job flow"
We can check out the results directly in SimpleDB. We'll use ScratchPad again to do so. If you don't remember the details, go back to the first project (Cloud API Project) to review the steps to use ScratchPad to see your data in SimpleDB.

The MapReduce job put data into a domain called FreeBase_Names_Guide
You can do a simple query to see all the data in Freebase_Names_Guide.
Choose "Select" from the "Explore API" pulldown.
In the "Select Expression" box write "select * from Freebase_Names_Guide"
If everything worked, you'll see:

- <SelectResponse>
- <SelectResult>

- <Item>
<Name>/guid/9202a8c04000641f8000000000003c45</Name>
- <Attribute>
<Name>Name</Name>
<Value>empty</Value>
- </Attribute>
- <Attribute>
<Name>Count</Name>
<Value>1</Value>
- </Attribute>
- </Item>

- <Item>
<Name>/guid/9202a8c04000641f8000000000003cfc</Name>
- <Attribute>
<Name>Name</Name>
<Value>empty</Value>
- </Attribute>
- <Attribute>
<Name>Count</Name>
<Value>1</Value>
- </Attribute>
- </Item>

and so on

Now you are ready for the assignment!

IMPORTANT: You will be turning in the API results from above as evidence of completing the activities for this project.

Hadoop & MapReduce Assignment

Part One: Speed Up and Scale Up

Run the job above to look at scale-up and speed-up.
Remember that scale up looks at the impact of adding another instance to process a job. If you have linear scale up, your job will run 2x as fast with 2x as many instances. In other words, if you have 2 instances your job will take say 10 minutes. With 4 instances, you job should take 5 minutes. For this assignment, we'll explore scale up with the job flow from Activity Two.

Run the job flow with 1 instance, 2 instances, 4 instances, and 8 instances. Create a chart showing your results. Use a chart in Excel that best demonstrates the linear (or non linear) nature of your scale up.
Put the chart into a word document and add a paragraph that describes in English the results of your scale up test. Include in the paragraph your thoughts about why you did or did not achieve linear scale up.


Now we'll do the same for speed-up.
Remember that speed up considers the issues around processing 2x data using a set number of instances. If you double the data from the input, you would hope that it doesn't take double the time to complete the job. Linear speed up means that you can process 2x the data with 2x instances in the same time as 1x the data with 1x instances. We are looking for better than linear speed up.
Again, use the data and job flow from Activity Two.

Use whatever method you prefer to double the data input for the job. Your new dataset should have 100MB+ data. It's okay to exactly duplicate the data so the first 52MB and the second 52MB are identical.
Run the job flow with 2 instances, 4 instances, and 8 instances.
Compare each with the previous results (4 instances with 52MB data vs 4 instances with 104MB data). Again, use a chart (or a series of charts) to show scale-up.
Put the chart(s) into your word document and add a paragraph that describes in English the results of your speed up test. Include in the paragraph your thoughts about the throughput (performance) of the system.

Part Two: Lastfm MapReduce Job


For this part, you'll create a new job flow using different data. You'll walk through many of the steps from above with a different data source.

The input data for the mapper for this job comes from a public data source on Amazon. We are not going to download the data. Instead, we'll just reference the public data set as "elasticmapreduce/samples/similarity/lastfm/input". If we could look at the data source, we would find a file with input to code userid (person), itemid (song), number of times the person listened to the song. Each time userid listens to itemid, data is generated in the file. Our reducer is just an aggregate. As we already know, it will do a count. In this case, we are counting the total number of items (songs) listened to by each userid (person).
The data is formatted as follows:

3 columns: userid artistid playcount
2 4164 23
3 4164 4
2 1002471 1
3 1002471 643
4 1000324 1


The relevant parts of the python code for new job flow is below. Modify the mapper.py job for this project. The mapper.py file from above is here with the text of the file repeated below:

#!/usr/bin/python

import sys

  1. This is our first basic mapper, just aggregate based on node id. This output will be
  2. used by top_sdb_mapper/reducer to save this data into SimpleDB

def main (argv):

# read each line
line = sys.stdin.readline()

try:
while line:
# strip out trailing characters
line = line.rstrip()

# look at each word
words = line.split()

for word in words:
# our ids might be separated by commas as well, so do 1 more level of parsing
ids = word.split(",")

for id in ids:
# we only care about the ones that start with guid
if id.startswith("/guid/"):
# send out our sum
print "LongValueSum:" + id + "\t" + "1"

# get our next line
line = sys.stdin.readline()
except "end of file":
return None

if name == "main":
main(sys.argv)


In this code, Amazon Elastic MapReduce will iterate over each file of input that we give it. As you can see from the first few lines of mapper.py, we are reading each line of a file that we are given. We token each word in that line and look for the Freebase nodes (i.e. the ones starting with /guid/). Every time we find a node, we output LongValueSum <node id>\t1. Basically, we are searching through the input and pulling out each /guid/.

Your task is to modify this file. The new file will be called "user_count_mapper.py". You'll need to change two lines in mapper.py for your new mapper.

(user, item, rating) = line.strip().split()

print "LongValueSum:%s\t1" % user

There are lots of great resources for python on the internet. One website that might be helpful is:
http://wiki.python.org/moin/SimplePrograms

For this part of the project, create a new job flow to use the mapper above against the public data set.

Create and run the job flow
When it is done, download the files generated by the job (part-00000, etc.)
Import all the data (every file created by the job) into Excel. You'll have about 160k rows when you are done. All rows should be in one worksheet not separate separate worksheets.
Perform some basic statistical analysis to determine the maximum number of artists listened to by one user. In your project submission, fill in "n" for the following statements:

25% of users listened to less than n artists
50% of users listened to less than n artists
75% of users listened to less than n artists

Hint: Use an Excel Statistical Function for the above


the maximum number of artists listened to by one user was n
the minimum number of artists listened to by one user was n
the average number of artists listened to by one user was n

To turn in your project, turn in the following:

  1. API output from ScratchPad for Activity Three (completed during lab time)
  2. Word document for Part One
  3. Excel spreadsheet for Part Two

VERY IMPORTANT Be sure to remove everything from your S3 storage area or you will continue to be charged.

Project Grading

The project will be graded based on two criteria:

  • Completeness (all requirements)
  • Correctness (functionally accurate)
  • Creativity (where applicable, creative approach)
Bibliography
1. Source for Activity Two and Activity Three: Getting Started with Elastic MapReduce
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License