Homework 1
Due: February 26, 2019 @ 11:59pm
Instructions
The first part of this page explains how homework assignments will be handled and evaluated, since they are completed in groups. The questions for Homework 1 start further down, click this link to skip to that part of the page.
Overview
As a group, solve the homework problems and write your answers in the R Markdown file homework_1.Rmd. Grades for the group submissions will, in addition to correctness, be based on document formatting, visualization quality, writing quality, and code style. This means that your group submission is to be written in the style of a exploratory data report, meaning:
Each exercise must be written up using full sentences such that it is clear what question is being answered.
There needs to be plain text above each code block explaining what you are doing, and the code blocks should be organized.
The R Markdown file must knit without error and generate a PDF file, and the final PDF output must look nice, clean, and be easy to read.
Participation
Credit for group participation will be determined using the following sources:
A CONTRIBUTIONS.md file distributed with your group repository
Commit history on GitHub
Discussion history in your group’s private Slack channel
Each group will need to fill out the CONTRIBUTIONS.md file as part of their submission. This file is where where each group member lists what he or she contributed to the final submission. Read the section Fill out the CONTRIBUTIONS.md file for more details on how this works.
Google Docs
If your group used an external document to coordinate and organize your work, such as a Google Doc, then that can also count as evidence of participation, provided that there is a visible writing history and it is possible to identify which student is responsible for each edit. This will require you to share the relevant file with the instructor with full privileges on the document so that it’s possible to review the document’s editing history. Please note that anonymous edits to Google Docs documents cannot be used as participation evidence, since there is no way to verify the account responsible for the added content. Also, for similar reasons, offline documents traded back and forth via email cannot be accepted as evidence of participation.
How to answer the questions as a group
The following is a checklist you may follow to help you get started with answering the questions as a group:
Read through all the problems individually. Then, as a group, discuss what will be needed to fully answer each question.
As a group, decide how you will split up writing responsibilities. A typical way to do this is to have each group member be responsible for writing out the full answer to a certain number of questions.
(Important) Before you start, make a copy of homework_1.Rmd and rename the copied file to include your last name. For example, if your last name is Smith, then your file copy should be renamed to homework_1_smith.Rmd.
Commit and push your copied file to GitHub.
- Draft your contributions in your file. For example, if my last name was Smith and I agreed to write-up the answers to questions 4, 5, and 6, then I would open up homework_1_smith.Rmd and put my answers there. When I’m done, I would save my file, then commit and push my work to GitHub.
How to edit and merge your answers into the group submission
While you will be writing your answers in separate files, eventually the group will need to merge in everyone’s answers into the main homework_1.Rmd document. The following checklist may help with this:
Select an editor to be in charge of merging everyone’s answers into the final document homework_1.Rmd. Because the editor needs to prepare the document for submission, it is reasonable if he or she contributes slightly less in terms of answering the questions (for example, if everyone else answers three questions, it would be okay if the editor answers two).
The editor should ensure that everyone has committed and pushed their answers to GitHub so they can copy and paste in each contribution.
The editor needs to make sure that the final submission reads like a coherent document and that the writing style and code style are uniform across all the answers. In other words, it should read like a single person answered all the questions, not a group of four people.
The editor will be in charge of of committing and pushing the final R Markdown file to GitHub, knitting to PDF, and uploading the PDF file on Blackboard.
Fill out the CONTRIBUTIONS.md file
After everything is written up and ready for submission, the last thing the group will need to do is fill out the CONTRIBUTIONS.md file. By default, the file looks like this:
# Contributions to group submission
## Editor: FirstName LastName Member 1
* Questions answered:
## FirstName LastName Member 2
* Questions answered:
## FirstName LastName Member 3
* Questions answered:
## FirstName LastName Member 4
* Questions answered:
At a minimum, you must remove the FirstName LastName Member entries in the template and fill in the names of the people in your group, indicate which group member served as the editor, and state which questions were written up by each member.
Additional information beyond this should be supplied, such as indicating when a group member helped another group member edit an answer or gave helpful comments in a Slack discussion. For example, one group member’s contribution list may read as follows:
## Jane Smith
* Questions answered: 4, 5, 6
* Helped with editing on answers 8 and 9
* Collaborated with group member Jack Williams on answering question 10
* Pointed out spelling errors and suggested fixes to the document layout in the merged group document
Working with a GitHub repository as a group
You will likely encounter some issues while working in a group-based GitHub repository. In particular, you might find that when you click “Push” in the Git tab of RStudio, that it doesn’t seem to work and instead you get an annoying error message! This will happen when another member of your group has uploaded work before you did. While this can be irritating to deal with, this is actually a good thing, as it is GitHub’s way of protecting your files from accidential overwrites and deletions.
So what should you do to keep things running smoothly? First, always work in your own file, never in another person’s file. If you are not the editor, then you should not edit homework_1.Rmd either! Also, do not remove or rename any files that are not your own. Finally, when you are getting ready to work, following the procedure below should help keep the error messages to a minimum:
When you start working, you should begin by going to the Git tab and clicking “Pull” (notice this is not the same as “Push”). This will synchronize any new changes that your group may have made into your files.
Work on your file as normal. When you have completed your work, save your file.
Now we want to commit. But first, go to the Git tab and click “Pull” one more time to check for any other changes. Then, still in the Git tab, click the checkmark next to your updated file, type a message in the messagebox, and click the Commit button.
If the updated file is no longer in the list of files in the Git tab, then your commit was successful.
Click “Push” to upload your changed file.
If the above advice doesn’t work…
If, even after following the advice below, you still encounter error messages when Pulling from and Pushing to GitHub, contact the course instructor for help.
How to submit
The editor should follow the steps below to submit the homework for his/her group.
Make sure that everyone has committed and pushed their R Markdown files so that everything is synchronized to GitHub. If you do this right, then you will be able to view all the completed files on the GitHub website.
Knit your group’s R Markdown document to the PDF format, export (download) the PDF file from RStudio Server, and then upload it to Homework 1 posting on Blackboard.
The rail trail dataset
For this homework assignment, you will be working though a set of visualization problems based on the rail_trail dataset. The rail_trail dataset was collected by the Pioneer Valley Planning Commission (PVPC) and counts the number of people that walked through a sensor on a rail trail during a ninety day period. A rail trail is a retired or abandoned railway that was converted into a walking trail. The data was collected from April 5, 2005 to November 15, 2005 using a laser sensor placed at a location north of Chestnut Street in Florence, MA.
The dataset contains the following variables:
Variable | Description |
---|---|
hightemp | daily high temperature (in degrees Fahrenheit) |
lowtemp | daily low temperature (in degrees Fahrenheit) |
avgtemp | average of daily low and daily high temperature (in degrees Fahrenheit) |
season | indicates whether the season was Spring, Summer, or Fall |
cloudcover | measure of cloud cover (in oktas) |
precip | measure of precipitation (in inches) |
volume | estimated number of trail users that day (number of breaks recorded) |
weekday | indicator of whether the day was a non-holiday weekday |
How to describe your visualizations
When describing the contents of a visualization, follow the ideas discussed in these resources:
Questions
In the rail_trail dataset, how many rows are there? How many columns? Which variables in the dataset are continuous/numerical and which are categorical?
Create a histogram of the variable
volume
using the following code:ggplot(data = rail_trail) + geom_histogram(mapping = aes(x = volume))
Describe the shape and center of the distribution. Afterward, try adjusting the size of the histogram bins by adding the
binwidth
input. To start with, usebinwidth = 21
. If you need help with where to placebinwidth
, read the documentation by running?geom_histogram
in your Console window. Then, find a binwidth that’s too narrow and another one that’s too wide to produce a meaningful histogram.Choosing a proper bin width for a histogram can be tricky, and for that reason it’s preferable to use visualizations that avoid using bin widths when possible. An easy-to-use alternative to the histogram is
geom_density
, which creates a density plot. Usegeom_density
to create a density plot of the variablevolume
.Create a density plot for each of the remaining numerical variables, and describe the shape and center of each distribution. Are there any distributions that are similar in shape to each other?
Use
geom_point()
to create a scatterplot that plotsweekday
versusseason
. Why is this plot not useful?Create a
geom_count()
plot (an alternative to a mosaic plot) using the same variables you considered in question 5:ggplot(data = rail_trail) + geom_count(mapping = aes(x = season, y = weekday))
Which circle in the plot takes up the most area? Explain the meaning of the different size circles in the plot and what information it contains that is missing in the previous scatter plot.
Run
?geom_bar
in the Console window and read the documentation forgeom_bar()
, and then look at the entry for it on the ggplot2 cheatsheet Usegeom_bar()
to reproduce the following bar chart:After reproducing the plot, explain what the height of each bar means.
Starting from the code snippet you deduced in question 7, create two more bar charts:
Create a bar chart by supplying the input
position = "dodge"
togeom_bar()
Create a bar chart by supplying the input
position = "fill"
togeom_bar()
.
After creating the visualizations, describe the feature that
position
controls.Create a bar chart that maps its aesthetic
aes()
toprecip > 0
. Interpret what this bar chart means.Create a scatter plot of
volume
versushightemp
usinggeom_point()
. Describe any trends that you see.Take the code snippet you wrote for question 10 and map the
weekday
variable tocolor
. Then create a second plot where, instead of mappingweekday
tocolor
, you facet overweekday
using eitherfacet_wrap()
orfacet_grid()
. Discuss the advantages and disadvantages to faceting instead of mapping to thecolor
aesthetic. How might the balance change if you had a larger dataset?Take the code snippet that you wrote down in question 11 that faceted over
weekday
and create a model for each facet panel usinggeom_smooth()
. Discuss the trends in the number of rail trail users thatgeom_smooth()
picks up.Copy the code snippet you deduced in question 12 and use the input
se = FALSE
forgeom_smooth()
. What does these
input option forgeom_smooth()
control?
Cheatsheets
You are encouraged to review and keep the following cheatsheets handy while working on this mini-homework: