Additional Resources:
Run the code below to load the congress
dataframe we will use in
this quiz
load(url('https://dssoc.github.io/datasets/congress.RData'))
1. In your own words, describe what statistical modeling means. When is it used? What does it allow data scientists to do?
2. Create three new variables related to our congress dataset: (a) the
age of the member, (b) the number of committees they are on, and (c) the
percentage of instances where they hold a title in the committees they
belong to (i.e. when the title
entry in the committee membership
dataframe is not empty). You will want to save these new variables for
future problems. Then use the summary
function to create summary
statistics for these new variables.
**3. Create a linear model predicting the number of committees that
members belong to from age
, then create a scatter plot with a linear
trendline. Describe the relationship. What do each of these (the model
summary and the plot) show that you cannot see from the other? **
Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable on the x-axis.
4. Create a bar graph showing the average number of committees that congress members belong to by gender (i.e. a bar for M and a bar for F) with error bars. What can you see from this visualization? Does there appear to be a significant difference?
Hint: you may want to see geom_errorbar
.
Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.
5. Construct a model predicting the percentage of time that a member holds a title in the committees they belong to from age, gender, and political party. Which variables might be related to holding a title? Try removing and adding different variables. Does changing any of the used variables change your original interpretation?
Note: you may want to save the full model for the next question.
6. Use the model from the previous question to make a scatter plot that includes prediction lines for BOTH F and M Democrats. That is, your plot should include two prediction lines - one for M and one for F, and the visualization (not the model) should only include democrats. This is important because our original model included information about all the variables, but we mainly want to visualize a single relationship, and how it might differ by gender. How do you interpret this plot?
Hint: you may need to follow the examples in the R4DS modeling section
for this (see required readings) - see the modelr
package.
Thanks to Devin Cornell and Taylor Brown for helping to create these quizzes!