**Additional Resources:**

- R for Data Science: Modeling (Chapters 23-25)

**Run the code below to load the congress dataframe we will use in
this quiz**

```
load(url('https://dssoc.github.io/datasets/congress.RData'))
```

**1. In your own words, describe what statistical modeling means. When
is it used? What does it allow data scientists to do?**

**2. Create three new variables related to our congress dataset: (a) the
age of the member, (b) the number of committees they are on, and (c) the
percentage of instances where they hold a title in the committees they
belong to (i.e. when the title entry in the committee membership
dataframe is not empty). You will want to save these new variables for
future problems. Then use the summary function to create summary
statistics for these new variables.**

**3. Create a linear model predicting the number of committees that
members belong to from `age`

, then create a scatter plot with a linear
trendline. Describe the relationship. What do each of these (the model
summary and the plot) show that you cannot see from the other? **

Note: usually we see the dependent variable (number of committees in this case) on the y-axis and the independent variable on the x-axis.

**4. Create a bar graph showing the average number of committees that
congress members belong to by gender (i.e. a bar for M and a bar for F)
with error bars. What can you see from this visualization? Does there
appear to be a significant difference?**

Hint: you may want to see `geom_errorbar`

.

Hint: error bars are usually calculated by taking the average plus and minus the standard deviations.

**5. Construct a model predicting the percentage of time that a member
holds a title in the committees they belong to from age, gender, and
political party. Which variables might be related to holding a title?
Try removing and adding different variables. Does changing any of the
used variables change your original interpretation?**

Note: you may want to save the full model for the next question.

**6. Use the model from the previous question to make a scatter plot
that includes prediction lines for BOTH F and M Democrats. That is, your
plot should include two prediction lines - one for M and one for F, and
the visualization (not the model) should only include democrats. This is
important because our original model included information about all the
variables, but we mainly want to visualize a single relationship, and
how it might differ by gender. How do you interpret this plot?**

Hint: you may need to follow the examples in the R4DS modeling section
for this (see required readings) - see the `modelr`

package.

Thanks to Devin Cornell and Taylor Brown for helping to create these quizzes!