Final project tutorial: GSS 2024

Prerequisite: install required packages

You only need to install these packages once. For your actual final projects, skip this step.

install.packages("labelled")
install.packages("gtsummary") # optional for step 10

Step 1: load required packages

Remember: you need to load the tidyverse package every time you start a new R project. Other applicable packages depend on the project. For your final projects, load the following libraries:

library(tidyverse)
library(labelled)
library(gtsummary) # optional for step 10

Step 2: import the dataset

The dataset should appear in the Environment tab. Set up the proper path to the dataset on your computer.

load(url("https://raw.githubusercontent.com/valeriia-popova/r-survey-analysis/main/gss2024.RData"))

Step 3: identify your variables

To find variables that you will be using in your final project, you need to know their names or labels. You have several ways to get that information.

  • Option 1: search in R using the labelled package. In the following code, replace “abortion” with a keyword of your interest. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g. educ).
look_for(gss2024, "corrupt")
 pos  variable label                  col_type missing values                
 424  world4   world image:man is go~ dbl+lbl  3309    [1] people are good   
                                                       [7] people are evil   
 1164 rotapple r agrees immoral pers~ dbl+lbl  3309    [1] agree strongly    
                                                       [2] agree somewhat    
                                                       [3] disagree somewhat 
                                                       [4] disagree strongly 
 4492 corrupt1 opinion of corruption~ dbl+lbl  3309    [1] almost none       
                                                       [2] a few             
                                                       [3] some              
                                                       [4] quite a lot       
                                                       [5] almost all        
 4493 corrupt2 opinion of corruption~ dbl+lbl  3309    [1] almost none       
                                                       [2] a few             
                                                       [3] some              
                                                       [4] quite a lot       
                                                       [5] almost all        
 4741 corrupt  must be corrupt to ge~ dbl+lbl  3309    [1] strongly agree    
                                                       [2] agree             
                                                       [3] neither agree nor~
                                                       [4] disagree          
                                                       [5] strongly disagree 
 5422 corruptn how widespread corrup~ dbl+lbl  1833    [1] hardly anyone is ~
                                                       [2] a small number of~
                                                       [3] a moderate number~
                                                       [4] a lot of people a~
                                                       [5] almost everyone i~
  • Option 2: use the GSS Search tool - it is a fantastic resource to see variables and their distribution. Adjust the year to limit your search to 2024 only. Pay attention if the variable has any missing or negative values. Once you’ve identified all of your variables, take note of their names (e.g. educ).

For example, I’ve identified the following three variables that I will use in my analysis. Most likely, your final projects will include more than 3 variables.

  • corruptn How widespread corruption is in public service in America? (DV)

  • educ Highest year of school completed (IV)

  • sex Respondent’s sex (CV)

If, during your work on your project, you need to refresh your memory of what your variable is about, type its name after ? in R - the Help window will open with all available information about the variable. It’s a great way to see the variable’s categories, wording of the questions, etc.

?educ

Step 4: create a smaller, clean dataset

This dataset will contain only your variables of interest from Step 3. You will work with this dataset from now on, not the main complete dataset. Give your new dataset an intuitive, simple name (mine is named test_project; yours will be different).

You will also need to rename your variables to something intuitive. In the example below, original variable names and the new names are for demonstration only. Yours will be different.

Exception: the line of code weight = wtssps must be included exactly as written. It is responsible for weighting the dataset.

test_project <- gss2024  |> 
  select(
    corruptn, educ, sex, wtssps)  |> 
  rename(
    corruption = corruptn,
    education = educ,
    sex = sex, 
    weight = wtssps
  )

Step 5: check for missing and abnormal values

Before you move on to analysis, you need to check for missing and abnormal values. If you forget to exclude missing values from analysis, your charts will appear with useless categories.

Replace corruption in the code below with the name of your variable. Repeat the same step for all the variables one by one. Take note which abnormal and missing values you see.

Note: sometimes in Quarto, the labels aren’t printed. When typing in script, you can see the labels in the console.

For categorical and ordinal variables:

test_project |> 
  count(corruption)
# A tibble: 6 × 2
  corruption                                        n
  <dbl+lbl>                                     <int>
1  1 [hardly anyone is involved]                   35
2  2 [a small number of people are involved]      265
3  3 [a moderate number of people are involved]   444
4  4 [a lot of people are involved]               564
5  5 [almost everyone is involved]                168
6 NA                                             1833
barplot(table(test_project$corruption)) 

For continuous variables:

summary(test_project$education)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00   12.00   14.00   14.22   16.00   20.00      23 
hist(test_project$education)

boxplot(test_project$education)

Step 6: recode the variables if needed

In GSS, most variables appear as numerical (double). If you know your variable is categorical (nominal or ordinal), it is necessary to convert it into a factor in R.

To check the coding of the original variable, type ? before the variable name.

  • Example: Convert the sex variable into a factor. It helps with interpretation of the regression results.
  • Note that I am “renaming” my new recoded variable using the same name. It will quietly rewrite the old variable.
test_project <- test_project |> 
  mutate(sex = as_factor(sex))

Step 7: check that recoding was successful

test_project |> 
  count(sex)
# A tibble: 3 × 2
  sex        n
  <fct>  <int>
1 male    1467
2 female  1823
3 <NA>      19

Step 8: create a linear model

Replace the example variables below with the names of your variables. Note that DV comes first. Always include the weight variable as is.

Give your model an intuitive name (I am using generic model_test_project; yours will be different).

model_test_project <- lm(corruption ~ education + sex, 
                         data = test_project, 
                         weights = weight)

Step 9: examine the results of the linear model

summary(model_test_project)

Call:
lm(formula = corruption ~ education + sex, data = test_project, 
    weights = weight)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-5.4576 -0.4797 -0.1530  0.6073  3.8088 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.57913    0.13764  26.004   <2e-16 ***
education   -0.01665    0.00939  -1.773   0.0764 .  
sexfemale    0.05388    0.05125   1.051   0.2933    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9772 on 1461 degrees of freedom
  (1845 observations deleted due to missingness)
Multiple R-squared:  0.002846,  Adjusted R-squared:  0.001481 
F-statistic: 2.085 on 2 and 1461 DF,  p-value: 0.1247

You can stop at this point. If you want to format this regression table in a word processor or Excel, just copy it and paste it in the program where you are writing your final project. If you want to create a pretty table in R and save it as is, continue below.

Step 10 (optional): create a custom table

model_test_project |> 
  tbl_regression(
    label = list(
      education ~ "Years of education",
      sex ~ "Sex"
    )
  ) |> 
  add_significance_stars(hide_ci = FALSE, hide_p = FALSE) |> 
  bold_labels() |> 
  italicize_levels() |> 
  modify_caption("**Effect of Education and Sex on Corruption Perception (GSS 2024)**") |>
  modify_footnote(
    estimate ~ "\\*p<0.05; \\**p<0.01; \\***p<0.001"
  )
Effect of Education and Sex on Corruption Perception (GSS 2024)
Characteristic Beta1 SE 95% CI p-value
Years of education -0.02 0.009 -0.04, 0.00 0.076
Sex



    male
    female 0.05 0.051 -0.05, 0.15 0.3
1 *p<0.05; **p<0.01; ***p<0.001
Abbreviations: CI = Confidence Interval, SE = Standard Error