Summer Institute in Computational Social Science

Additional Resources:

R for Data Science: Modeling (Chapters 23-25)

Run the code below to load the congress dataframe we will use in this quiz

load(url('https://dssoc.github.io/datasets/congress.RData'))

Questions

1. In your own words, describe what statistical modeling means. When is it used? What does it allow data scientists to do?

2. Create three new variables related to our congress dataset: (a) the age of the member, (b) the number of committees they are on, and (c) the percentage of instances where they hold a title in the committees they belong to (i.e. when the title entry in the committee membership dataframe is not empty). You will want to save these new variables for future problems. Then use the summary function to create summary statistics for these new variables.

**3. Create a linear model predicting the number of committees that members belong to from age, then create a scatter plot with a linear trendline. Describe the relationship. What do each of these (the model summary and the plot) show that you cannot see from the other? **

Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable on the x-axis.

4. Create a bar graph showing the average number of committees that congress members belong to by gender (i.e. a bar for M and a bar for F) with error bars. What can you see from this visualization? Does there appear to be a significant difference?

Hint: you may want to see geom_errorbar.

Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.

5. Construct a model predicting the percentage of time that a member holds a title in the committees they belong to from age, gender, and political party. Which variables might be related to holding a title? Try removing and adding different variables. Does changing any of the used variables change your original interpretation?

Note: you may want to save the full model for the next question.

6. Use the model from the previous question to make a scatter plot that includes prediction lines for BOTH F and M Democrats. That is, your plot should include two prediction lines - one for M and one for F, and the visualization (not the model) should only include democrats. This is important because our original model included information about all the variables, but we mainly want to visualize a single relationship, and how it might differ by gender. How do you interpret this plot?

Hint: you may need to follow the examples in the R4DS modeling section for this (see required readings) - see the modelr package.

Thanks to Devin Cornell and Taylor Brown for helping to create these quizzes!