install.packages("labelled")
install.packages("gtsummary") # optional for step 10Final project tutorial: GSS 2024
Prerequisite: install required packages
You only need to install these packages once. For your actual final projects, skip this step.
Step 1: load required packages
Remember: you need to load the tidyverse package every time you start a new R project. Other applicable packages depend on the project. For your final projects, load the following libraries:
library(tidyverse)
library(labelled)
library(gtsummary) # optional for step 10Step 2: import the dataset
The dataset should appear in the Environment tab. Set up the proper path to the dataset on your computer.
load(url("https://raw.githubusercontent.com/valeriia-popova/r-survey-analysis/main/gss2024.RData"))Step 3: identify your variables
To find variables that you will be using in your final project, you need to know their names or labels. You have several ways to get that information.
- Option 1: search in R using the
labelledpackage. In the following code, replace “abortion” with a keyword of your interest. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g.educ).
look_for(gss2024, "corrupt") pos variable label col_type missing values
424 world4 world image:man is go~ dbl+lbl 3309 [1] people are good
[7] people are evil
1164 rotapple r agrees immoral pers~ dbl+lbl 3309 [1] agree strongly
[2] agree somewhat
[3] disagree somewhat
[4] disagree strongly
4492 corrupt1 opinion of corruption~ dbl+lbl 3309 [1] almost none
[2] a few
[3] some
[4] quite a lot
[5] almost all
4493 corrupt2 opinion of corruption~ dbl+lbl 3309 [1] almost none
[2] a few
[3] some
[4] quite a lot
[5] almost all
4741 corrupt must be corrupt to ge~ dbl+lbl 3309 [1] strongly agree
[2] agree
[3] neither agree nor~
[4] disagree
[5] strongly disagree
5422 corruptn how widespread corrup~ dbl+lbl 1833 [1] hardly anyone is ~
[2] a small number of~
[3] a moderate number~
[4] a lot of people a~
[5] almost everyone i~
- Option 2: use the GSS Search tool - it is a fantastic resource to see variables and their distribution. Adjust the year to limit your search to 2024 only. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g.
educ).
For example, I’ve identified the following three variables that I will use in my analysis. Most likely, your final projects will include more than 3 variables.
corruptn How widespread corruption is in public service in America? (DV)
educ Highest year of school completed (IV)
sex Respondent’s sex (CV)
If, during your work on your project, you need to refresh your memory of what your variable is about, type its name after ? in R - the Help window will open with all available information about the variable. It’s a great way to see the variable’s categories, wording of the questions, etc.
?educStep 4: create a smaller, clean dataset
This dataset will contain only your variables of interest from Step 3. You will work with this dataset from now on, not the main complete dataset. Give your new dataset an intuitive, simple name (mine is named test_project; yours will be different).
You will also need to rename your variables to something intuitive. In the example below, original variable names and the new names are for demonstration only. Yours will be different.
Exception: the line of code weight = wtssps must be included exactly as written. It is responsible for weighting the dataset.
test_project <- gss2024 |>
select(
corruptn, educ, sex, wtssps) |>
rename(
corruption = corruptn,
education = educ,
sex = sex,
weight = wtssps
)Step 5: check for missing and abnormal values
Before you move on to analysis, you need to check for missing and abnormal values. If you forget to exclude missing values from analysis, your charts will appear with useless categories.
Replace corruption in the code below with the name of your variable. Repeat the same step for all the variables one by one. Take note which abnormal and missing values you see.
Note: sometimes in Quarto, the labels aren’t printed. When typing in script, you can see the labels in the console.
For categorical and ordinal variables:
test_project |>
count(corruption)# A tibble: 6 × 2
corruption n
<dbl+lbl> <int>
1 1 [hardly anyone is involved] 35
2 2 [a small number of people are involved] 265
3 3 [a moderate number of people are involved] 444
4 4 [a lot of people are involved] 564
5 5 [almost everyone is involved] 168
6 NA 1833
barplot(table(test_project$corruption)) For continuous variables:
summary(test_project$education) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 12.00 14.00 14.22 16.00 20.00 23
hist(test_project$education)boxplot(test_project$education)Step 6: recode the variables if needed
In GSS, most variables appear as numerical (double). If you know your variable is categorical (nominal or ordinal), it is necessary to convert it into a factor in R.
To check the coding of the original variable, type ? before the variable name.
- Example: Convert the
sexvariable into a factor. It helps with interpretation of the regression results. - Note that I am “renaming” my new recoded variable using the same name. It will quietly rewrite the old variable.
test_project <- test_project |>
mutate(sex = as_factor(sex))Step 7: check that recoding was successful
test_project |>
count(sex)# A tibble: 3 × 2
sex n
<fct> <int>
1 male 1467
2 female 1823
3 <NA> 19
Step 8: create a linear model
Replace the example variables below with the names of your variables. Note that DV comes first. Always include the weight variable as is.
Give your model an intuitive name (I am using generic model_test_project; yours will be different).
model_test_project <- lm(corruption ~ education + sex,
data = test_project,
weights = weight)Step 9: examine the results of the linear model
summary(model_test_project)
Call:
lm(formula = corruption ~ education + sex, data = test_project,
weights = weight)
Weighted Residuals:
Min 1Q Median 3Q Max
-5.4576 -0.4797 -0.1530 0.6073 3.8088
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.57913 0.13764 26.004 <2e-16 ***
education -0.01665 0.00939 -1.773 0.0764 .
sexfemale 0.05388 0.05125 1.051 0.2933
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9772 on 1461 degrees of freedom
(1845 observations deleted due to missingness)
Multiple R-squared: 0.002846, Adjusted R-squared: 0.001481
F-statistic: 2.085 on 2 and 1461 DF, p-value: 0.1247
You can stop at this point. If you want to format this regression table in a word processor or Excel, just copy it and paste it in the program where you are writing your final project. If you want to create a pretty table in R and save it as is, continue below.
Step 10 (optional): create a custom table
model_test_project |>
tbl_regression(
label = list(
education ~ "Years of education",
sex ~ "Sex"
)
) |>
add_significance_stars(hide_ci = FALSE, hide_p = FALSE) |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Effect of Education and Sex on Corruption Perception (GSS 2024)**") |>
modify_footnote(
estimate ~ "\\*p<0.05; \\**p<0.01; \\***p<0.001"
)| Characteristic | Beta1 | SE | 95% CI | p-value |
|---|---|---|---|---|
| Years of education | -0.02 | 0.009 | -0.04, 0.00 | 0.076 |
| Sex | ||||
| male | — | — | — | |
| female | 0.05 | 0.051 | -0.05, 0.15 | 0.3 |
| 1 *p<0.05; **p<0.01; ***p<0.001 | ||||
| Abbreviations: CI = Confidence Interval, SE = Standard Error | ||||