install.packages(c("devtools", "labelled", "gtsummary"))
devtools::install_github("jamesmartherus/anesr")
install.packages("gtsummary") # optional for step 12Final project tutorial: ANES 2020
Prerequisite: install required packages
You only need to install these packages once. For your actual final projects, skip this step.
Step 1: load required packages
Remember: you need to load the tidyverse package every time you start a new R project. Other applicable packages depend on the project. For your final projects, load the following libraries:
library(tidyverse)
library(labelled)
library(anesr)
library(gtsummary) # optional for step 12Step 2: import the dataset
The dataset should appear in the environment tab.
data(timeseries_2020)Step 3: identify your variables
To find variables that you will be using in your final project, you need to know their names or labels. You have several ways to get that information.
- Option 1 (not recommended): search directly within package
anesrin R. In the tab that opens, use regular search option in the upper right corner: type any keywords of your interest. You can click on any variable to see its categories. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g. v1408a).
data(timeseries_cum_doc)
view(timeseries_cum_doc)- Option 2 (recommended): search in R using the
labelledpackage. In the following code, replace “abortion” with a keyword of your interest. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g. v1408a).
look_for(timeseries_2020, "abortion") pos variable label col_type missing values
317 V201336 PRE: STD Abortion: sel~ dbl+lbl 0 [-9] -9. Refused
[-8] -8. Don't know
[1] 1. By law, abortio~
[2] 2. The law should ~
[3] 3. The law should ~
[4] 4. By law, a woman~
[5] 5. Other {SPECIFY}
318 V201336z PRE: STD Abortion: sel~ dbl+lbl 0 [-2] -2. Data will be ~
319 V201337 PRE: Importance of abo~ dbl+lbl 0 [-9] -9. Refused
[-8] -8. Don't know
[1] 1. Not at all impo~
[2] 2. Not too importa~
[3] 3. Somewhat import~
[4] 4. Very important
[5] 5. Extremely impor~
320 V201338 PRE: STD Abortion: Dem~ dbl+lbl 0 [-9] -9. Refused
[-8] -8. Don't know
[1] 1. By law, abortio~
[2] 2. The law should ~
[3] 3. The law should ~
[4] 4. By law, a woman~
321 V201339 PRE: STD Abortion: Rep~ dbl+lbl 0 [-9] -9. Refused
[-8] -8. Don't know
[1] 1. By law, abortio~
[2] 2. The law should ~
[3] 3. The law should ~
[4] 4. By law, a woman~
322 V201340 PRE: Abortion rights S~ dbl+lbl 0 [-9] -9. Refused
[-8] -8. Don't know
[1] 1. Pleased
[2] 2. Upset
[3] 3. Neither pleased~
323 V201341 PRE: Abortion rights S~ dbl+lbl 0 [-9] -9. Refused
[-1] -1. Inapplicable
[1] 1. Extremely
[2] 2. Moderately
[3] 3. A little
324 V201342x PRE: SUMMARY: Abortion~ dbl+lbl 0 [-2] -2. DK/RF in V201~
[1] 1. Extremely pleas~
[2] 2. Moderately plea~
[3] 3. A little pleased
[4] 4. Neither pleased~
[5] 5. A little upset
[6] 6. Moderately upset
[7] 7. Extremely upset
- Option 3 (recommended): use the ANES Search tool - it is a fantastic resource to see variables and their distribution. Adjust the toggle to limit your search to 2020 only. Pay attention if the variable has any missing or negative values. If you need exact wording of any survey question, find it in the ANES pdf. Once you’ve identified all of your variables, take note of their names (e.g. v1408a).
For example, I’ve identified the following three variables that I will use in my analysis. Most likely, your final projects will include more than 3 variables.
V201380 Has corruption in government increased, decreased, or stayed the same since Donald Trump became president? (DV)
V201511x Respondent 5 Category level of education (IV)
V201600 What is your (R) sex? (CV)
Step 4: create a smaller, clean dataset
This dataset will contain only your variables of interest from Step 3. You will work with this dataset from now on, not the main complete dataset. Give your new dataset an intuitive simple name (mine is named test_project; yours will be different).
You will also need to rename your variables to something intuitive. In the example below, original variable names and the new names are for demonstration only. Yours will be different.
Exception: the line of code weight = V200010a must be included exactly as written. It is responsible for weighting the dataset.
test_project <- timeseries_2020 |>
select(V201380, V201511x, V201600, V200010a) |>
rename(
corruption = V201380,
education = V201511x,
sex = V201600,
weight = V200010a
)Step 5: check for missing and abnormal values
Before you move on to analysis, you need to check for abnormal values. In ANES, invalid answers (e.g. refused or don’t know) are coded as -8 and -9. It is very important to convert them into missing values, otherwise, R will treat those negative values as part of the equation, and it will make your analysis statistically invalid.
Replace corruption in the code below with the name of your variable. Repeat the same step for all the variables one by one. Take note which abnormal values you see.
For categorical variables:
test_project |>
count(corruption)# A tibble: 5 × 2
corruption n
<dbl+lbl> <int>
1 -9 [-9. Refused] 67
2 -8 [-8. Don't know] 11
3 1 [1. Increased] 4669
4 2 [2. Decreased] 1162
5 3 [3. Stayed the same] 2371
For continuous variables (two options):
summary(test_project$corruption) Min. 1st Qu. Median Mean 3rd Qu. Max.
-9.00 1.00 1.00 1.62 3.00 3.00
ggplot(test_project, aes(x = corruption)) +
geom_histogram()Step 6: get rid of abnormal values
Once you’ve identified abnormal values, turn them into NAs (missing values).
Notice that I “renamed” my dataset using the same name. This action will quietly replace your dataset with missing values in place of abnormal ones. It is easier if you prefer to keep your working space clean. If you feel uneasy about it, provide a new name for your dataset - it will appear in the environment. From now on, you will be working with the new dataset. In this case, do not confuse the names of the dataset!
test_project <- test_project |>
mutate(across(everything(), ~if_else(.x < 0, NA, .x)))Step 7: verify that abnormal values are gone
If you’ve done everything correctly, abnormal values should be replaced with NAs. Those NAs will be excluded from analysis automatically and won’t skew your results.
test_project |>
count(education)# A tibble: 6 × 2
education n
<dbl+lbl> <int>
1 1 [1. Less than high school credential] 376
2 2 [2. High school credential] 1336
3 3 [3. Some post-high school, no bachelor's degree] 2790
4 4 [4. Bachelor's degree] 2055
5 5 [5. Graduate degree] 1592
6 NA 131
Step 8: recode the variables if needed
In ANES, most variables appear as numerical (double). In most cases, it’s not a big deal. But in some cases, it is necessary to convert them into categorical variables (called factors in R). In other cases, the order of a variable’s categories is incorrect, and you will need to change that.
To check the coding of the original variables, your best bet is to examine the original variables via the ANES Search tool or ANES pdf.
- Example 1: Convert the
sexvariable into a factor. It helps with interpretation of the regression results.
test_project <- test_project |>
mutate(sex = as_factor(sex))- Example 2: recode the
corruptionvariable’s categories. Original variable’s categories come as:
Increased
Decreased
Stayed the same
This is an ordinal variable where the categories reflect a ranking. In this case, we want to streamline the categories in a way that unambiguously show increase in corruption. In other words, we are moving from less corruption to more corruption.
Note that I am “renaming” my new recoded variable using the same name. It will quietly rewrite the old variable.
test_project <- test_project |>
mutate(
corruption = case_match(corruption,
2 ~ 1, # Old Decreased (2) becomes 1
3 ~ 2, # Old Stayed same (3) becomes 2
1 ~ 3, # Old Increased (1) becomes 3
.default = NA)
) |>
set_value_labels(corruption = c("Decreased" = 1,
"Stayed the same" = 2,
"Increased" = 3))Step 9: check that recoding was successful
Note: in Quarto, the labels aren’t printed. When typing in script, you can see the labels in the console.
test_project |>
count(corruption)# A tibble: 4 × 2
corruption n
<dbl+lbl> <int>
1 1 [Decreased] 1162
2 2 [Stayed the same] 2371
3 3 [Increased] 4669
4 NA 78
Step 10: create a linear model
Replace the example variables below with the names of your variables. Note that DV comes first. Always include the weight variable as is.
Give your model an intuitive name (I am using generic model_test_project; yours will be different).
model_test_project <- lm(corruption ~ education + sex,
data = test_project,
weights = weight)Step 11: examine the results of the linear model
summary(model_test_project)
Call:
lm(formula = corruption ~ education + sex, data = test_project,
weights = weight)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.4684 -0.4057 0.2624 0.4858 1.7556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.118688 0.024405 86.814 < 2e-16 ***
education 0.085978 0.007038 12.215 < 2e-16 ***
sex2. Female 0.041719 0.016176 2.579 0.00992 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7257 on 8034 degrees of freedom
(243 observations deleted due to missingness)
Multiple R-squared: 0.01928, Adjusted R-squared: 0.01904
F-statistic: 78.98 on 2 and 8034 DF, p-value: < 2.2e-16
You can stop at this point. If you want to format this regression table in a word processor or Excel, just copy it and paste it in the program where you are writing your final project. If you want to create a pretty table in R and save it as is, continue below.
model_test_project |>
tbl_regression(
exponentiate = FALSE, # Set to TRUE for Logistic Regression (Odds Ratios)
label = list(
education ~ "Education Level",
sex ~ "Gender"
)
) |>
add_significance_stars(
hide_ci = FALSE, # Keeps the Confidence Interval
hide_p = FALSE # Keeps the p-value column alongside the stars
) |>
bold_labels() |> # Makes variable names bold
italicize_levels() |> # Makes the categories italic
modify_caption("**Effect of Education and Sex on Corruption Perception**") |>
modify_footnote(
estimate ~ "\\*p<0.05; \\**p<0.01; \\***p<0.001"
)| Characteristic | Beta1 | SE | 95% CI | p-value |
|---|---|---|---|---|
| Education Level | 0.09*** | 0.007 | 0.07, 0.10 | <0.001 |
| Gender | ||||
| 1. Male | — | — | — | |
| 2. Female | 0.04** | 0.016 | 0.01, 0.07 | 0.010 |
| 1 *p<0.05; **p<0.01; ***p<0.001 | ||||
| Abbreviations: CI = Confidence Interval, SE = Standard Error | ||||