Central Limit Theorem

Author

Jeremy Rappel, Carl F. Falk, Mira Saad, Jens Kreitewolf

Published

March 16, 2026

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 1200

######################################################################
## Interactive Statistics Project - Module 1: Central Limit Theorem
##
## We thank the Association for Psychological Science for a small
## teaching grant (Fund for Teaching and Public Understanding of
## Psychological Science) to help in the development of this software.
##
## Copyright 2025-2026 Jeremy Rappel, Carl F. Falk, Mira Saad &
##  Jens Kreitewolf
##
## This program is free software: you can redistribute it and/or
## modify it under the terms of the GNU Affero General Public License as
## published by the Free Software Foundation, either version 3 of
## the License, or (at your option) any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU Affero General Public License for more details.
## <http://www.gnu.org/licenses/>
##


library(tibble) # CFF: this one also apparently required for shinylive to work
library(shiny)
library(munsell) # was included due to bug in getting ggplot2 to work with shinylive; still required for some reason
library(bslib) # For bootstrap UIs
library(bsicons) # for tooltip icons
library(ggplot2) #include.only = c("select", "mutate"))


infopop <- tooltip(
  bs_icon("info-circle"),
  "This is the distribution of all scores in the population.",
  placement="right"
)


infosamp <- tooltip(
  bs_icon("info-circle"),
  "These are the scores from a single sample from the population. Every dot corresponds to one observation.",
  placement="right"
)

infosampdist <- tooltip(
  bs_icon("info-circle"),
  "Every time we draw a sample, the mean of that sample counts as one observation in the sampling distribution.",
  placement="right"
)

# CFF population distribution options
popopts <- popover(
  bs_icon("gear"),  # Gear icon
  selectInput(
    "distType", "Distribution Type", c("Uniform","Normal","Chi-Squared"),
    selected = "Normal"
  ),
  conditionalPanel(
    condition = "input.distType == 'Normal'", #This is javascript, thus the . notation
    sliderInput("mean", "μ", min = 1, max = 10, value = 5, step=1),
    sliderInput("sd", "SD", min = 1, max = 10, value = 1, step=1)
  ), #end conditionalPanel()
  
  conditionalPanel(
    condition = "input.distType == 'Uniform'",
    sliderInput("min_limit", "Minimum", min = 1, max = 10, value = 1, step=1), 
    sliderInput("max_limit", "Maximum", min = 1, max = 10, value = 5, step=1)
  ), #end conditionalPanel()
  conditionalPanel(
    condition = "input.distType == 'Chi-Squared'",
    sliderInput("df", "Degrees of Freedom", min = 1, max = 10, value = 5, step=1)
  ), #end conditionalPanel()
  checkboxInput(
    "yaxis", "Toggle Y-Axis"
  ),
  placement="right"
)

#Extra settings (skewness, kurtosis) for sampling distribution pllot
popopts_samplingdist <- popover(
  bs_icon("gear"),  # Gear icon
  checkboxInput("overlay", "Toggle Normal Overlay"),  
  checkboxInput(
    "skew", "Skewness & Kurtosis"
  ),  
  sliderInput("bins",
              "Number of bins for histogram:",
              min = 30,
              max = 200,
              value = 75),
  placement="right"
)

popopts_samplingplot<- popover(
  bs_icon("gear"),  # Gear icon
  checkboxInput(
    "se", "Estimated Standard Error"
  ),
  actionButton("prevSamp","Previous Sample"), actionButton("nextSamp","Next Sample"),
  placement="right"
)

#instead of a bloated ui object, here we're creating a list of each card to inject later
cards <- list(
  navset_card_underline(
    title = card_header("Population Distribution", infopop, popopts),
    full_screen=TRUE,    
    height=850,
    nav_panel("Graph",
      plotOutput("popPlot")#,
      #min_height=250
    )
  ),
  navset_card_underline(
    title= card_header("Sample Values",infosamp, popopts_samplingplot), 
    full_screen=T,
    height=850, #JR: Fixing this to minimize the plots looking squished; may need a more dynamic solution
    nav_panel("Graph", plotOutput("sampleValues")),
    nav_panel("Table", tableOutput("sampleTable")),
    selected = "Graph"
  ),
  navset_card_underline(
    title= card_header("Sampling Distribution",infosampdist,popopts_samplingdist), 
    full_screen=T,
    height=850, #JR: Fixing this to minimize the plots looking squished; may need a more dynamic solution
    nav_panel("Graph", plotOutput("sampPlot")),
    nav_panel("Table", tableOutput("sampleValuesTable")),
    selected = "Graph"
  )
)

# Define UI for application that draws a histogram
ui <- page_sidebar(
  #Creates a sidebar containing option toggles
  title = div(h5('Sampling and Sampling Distributions', style="margin: 0;"), h6('The Central Limit Theorem', style="margin: 0;")),
  theme = bs_theme(version=5, preset="lumen"),
  sidebar = sidebar(
    fluidRow(actionButton("plus1", "+1 sample"),
             actionButton("plus5", "+5 samples"),
             actionButton("plus1000", "+ 1000 samples")
             ),
    fluidRow(actionButton("reset","Reset"),
             sliderInput("sampSize", "Observations per Sample:", min=1, max=100, value=30, step=1)), #JR: Added button to reset the app
  ),
  layout_columns(
    col_widths = c(-1, 7, -1),
    cards[[1]]#, #JR: The population distribution
  ),
  layout_columns(
    col_widths = c(-1, 7, -1),
    cards[[2]]), #JR: The sample distribution
  layout_columns(
    col_widths = c(-1, 7, -1),
    cards[[3]]) #JR: The sampling distribution
  #layout_columns(col_widths = c(-2, 8),)) #JR: Combined sample table and graph
)

# Define server logic required to draw a histogram
server <- function(input, output, session) { # JR: Added "session" to get the reset button to work; also necessary if we add introJS functionality
  observeEvent(input$reset,
               {session$reload()}) #JR: Restarts the session if user presses the "Reset" button

    #################################
   # Population Distribution Values#
  #################################
  mu <- reactive({as.numeric(input$mean)}) #Remember, these values are stored as functions
  stdev <- reactive({input$sd})
  min <- reactive({as.numeric(input$min_limit)})
  max <- reactive({as.numeric(input$max_limit)})
  distType <- reactive({input$distType})
  df <- reactive({as.numeric(input$df)}) #JR: Updated to make sure it's a numeric value
  
  
    #############################
   # Sample Distribution Values#
  #############################
  N <- 10000 # CFF hardcoded total number of possible samples
  obs <- reactive({input$sampSize})
  se <- reactive({stdev()/sqrt(obs())}) # CFF - do we use this ever?
  
  whichsamp <- reactiveValues(i = 0)
  index <- reactiveValues(i=0)

  #JR: Moved these from the Population Plot object so we can access the same values for the sample plot
  #FIXME: Looks kinda weird with for the Chi-Squared distribution
  lower <- reactive({switch(distType(),
                  "Normal" =mu()-3*stdev(),
                  "Chi-Squared" =0,
                  "Uniform" = min())})
  upper <- reactive({switch(distType(),
                  "Normal" =mu()+3*stdev(),
                  "Chi-Squared" =(df()*2)+5, #JR: Stand-in, should be updated to something more dynamic/sensible
                  "Uniform" = max())})#The lower bound of the chart; if H1 > H0, then -3 is the floor, otherwise the floor is 3SD below the H1 distribution
  
    ##########################
   # Generating Sample Data #
  ##########################
  datasets <- reactive({
    whichsamp$i <- 0
    index$i <- 0
    dat <- matrix(nrow=N,ncol=obs())
    nobs <- obs()
    for(i in 1:N){
        if(distType()=="Normal"){
          dat[i,] <- rnorm(obs(), mean=mu(),sd=stdev()) 
        } else if (distType()=="Uniform"){
          dat[i,] <- runif(obs(), min=min(), max = max())
        } else if (distType()=="Chi-Squared"){
          dat[i,] <- rchisq(obs(), df=df())      
        }
    }
    return(list(dat = dat, Nsamp = N, nobs=nobs))
  })
  
  observeEvent(input$plus1, { #Updates the sample size value when the relevant button is clicked
    if ((whichsamp$i + 1)<10000) { #JR: Unsure if these are the most efficient way to avoid out-of-bounds erros
    whichsamp$i <- whichsamp$i + 1
    } else {whichsamp$i = 10000}
  })
  observeEvent(input$plus5, {
    if ((whichsamp$i+ 5) < 10000) {
      whichsamp$i <- whichsamp$i + 5
    } else {whichsamp$i = 10000}
  })
  observeEvent(input$plus1000, {
    if ((whichsamp$i + 1000) < 10000) {
      whichsamp$i <- whichsamp$i + 1000
    } else {whichsamp$i = 10000}
  })
  
 #JR: This section updates the display of the sample values when using the "Previous/Next Sample Buttons"
 #It works by updating the "index" reactive value, which is subtracted from the current sample to move things about
  observeEvent(input$nextSamp, {
    if ((whichsamp$i-index$i) < whichsamp$i) {
    index$i <- index$i-1
  }})
  
  observeEvent(input$prevSamp, {
    if (whichsamp$i - index$i > 1) {
    index$i <- index$i+1
    }})
  
  ## Population plot
  output$popPlot <- renderPlot({

    popplt <- ggplot(data=data.frame(x=c(lower(),upper())),aes(x)) +
      #The true population distribution
      switch(input$distType,
             "Uniform" = stat_function(fun=dunif,args=list(min=lower(),max=upper()),geom="line", linetype="dotted", color="red",xlim=c(lower(),upper())),
             "Normal" = stat_function(fun=dnorm,args=list(mean=mu(), sd=stdev()),geom="area", linetype="dotted",fill=NA, color="red",xlim=c(lower(),upper())),
             "Chi-Squared" = stat_function(fun=dchisq,args=list(df=df()),geom="area", linetype="dotted",fill=NA, color="red",xlim=c(lower(),upper()))) +
      #Clears the y-axis label
      ylab("")+
      #Sets the x axis ticks to cover the whole plot
      scale_x_continuous(limits=c(lower(),upper()),breaks= seq(round(lower()),round(upper()))) +
      #Clears the y-axis ticks
      #scale_y_continuous(breaks=NULL) +
      theme_classic() +
      annotate("text",x=Inf,y=Inf, label = 
        switch(
          input$distType,
          "Normal" = paste0("Population Mean = ",mu()),
          "Uniform" = paste0("Population Maximum = ",max()),
          "Chi-Squared" = paste0("Degrees of Freedom = ",df()) #JR: Added these annotations to reflect info relevant to a particular distribution; may require some re-thinking though to tie in w/ sampling distribution means
      ), vjust = 1, hjust = 1)+ # JR: End first annotate()
      annotate("text",x=Inf,y=Inf, label = 
          switch(input$distType,
                 "Normal" = paste0("Population SD = ",stdev()),
                 "Uniform" = paste0("Population Minimum = ",min()),
                 "Chi-Squared" = paste0("")
                        ), vjust = 3, hjust = 1) # JR: End second annotate()
    
    if (input$yaxis == FALSE){
      popplt <- popplt + scale_y_continuous(breaks=NULL)
    } 
    
    popplt
      
  }) #end output$popPlot
  
  output$sampPlot <- renderPlot({
    alldat <- datasets() # get all pre-generated samples
    if(whichsamp$i>0){
      dat <- alldat$dat[1:whichsamp$i,,drop=FALSE] # grab just ones necessary for *this* plot
    } else {
      dat <- alldat$dat[1:2,,drop=FALSE]
    }
    means <- rowMeans(dat)
    meanofmeans <- mean(means)
    sdofmeans <- if (length(means) > 1) {sd(means)} else {1} # JR: Perhaps inefficient? Using an if/else statement as otherwise we don"t have an SD when N=1
        # doing this necessitates changing sd() to stdev() throughout; is this how we usually compute SE? it's the SE of the simulated sampling distribution values...
    
    #FIXME: Skew and Kurtosis values are slighly off from what JASP gives
    #CFF: could have a citation for where the equations come from
    n <- whichsamp$i
    skewness <- sum(((means-mean(means))/sd(means))^3) * (n / ((n - 1) * (n - 2))) # Sums the cubed standardized values, then multiplies them by [something]
    skewnessSE <- sqrt((6 * n * (n - 1) / ((n - 2) * (n + 1) * (n + 3))))
    
    #This section for debugging skewness/kurtosis
    #print(means)
    #print(meanofmeans)
    #print(sum((means-meanofmeans)^2))
    #print(sum((means-meanofmeans)^4))
    #print(sum(((means-mean(means))/sd(means))^3))
    
    s2 <- sum((means-meanofmeans)^2)
    s4 <- sum((means-meanofmeans)^4)
    v <- s2 / (n-1)
    b <- s4 / (v^2)
    kurtosis <- ((n * (n + 1)) / ((n - 1) * (n - 2) * (n - 3))) * b + ((-3 * (n - 1)^2) / ((n - 2) * (n - 3)))
    kurtosisSE <- 2 * skewnessSE * sqrt((n^2 - 1) / ((n - 3) * (n + 5)))
    
    lower <- meanofmeans - 3*sdofmeans #JR: If we're changing the layout of the x-axis, won't need these anymore
    upper <- meanofmeans + 3*sdofmeans
    
    #binwidth= (upper-lower)/30
    #binwidth= (base::max(means)-base::min(means))/input$bins
    
    plt <- ggplot(data=as.data.frame(means), aes(x=means)) +
      #Clears the y-axis label
      ylab("Frequency")+
      #Sets the x axis ticks to cover the whole plot
      #scale_x_continuous(limits=c(lower,upper),breaks= seq(round(lower),round(upper),by=1)) +
      scale_x_continuous(limits=c(lower(),upper()),breaks= seq(round(lower()),round(upper()))) +
      theme_classic()
      
    if(whichsamp$i>0){
      #plt <- plt + geom_histogram(color="black", fill="white",binwidth = binwidth) +
        plt <- plt + geom_histogram(color="black", fill="white",bins = input$bins) +        
        annotate("text",x=Inf,y=Inf, label = paste0("Sampling Dist. Mean = ", round(meanofmeans,3)), vjust = 1, hjust = 1) +
        annotate("text",x=Inf,y=Inf, label = paste0("Total # of Samples = ", whichsamp$i), vjust = 3, hjust = 1) +
        annotate("text",x=Inf,y=Inf, label = paste0("Standard Deviation = ",round(sdofmeans,3)), vjust = 5, hjust = 1)
        
        if(input$overlay == TRUE) {
          ggplot_build(plt)$data[[1]] -> hist_layer
          bin_width <- hist_layer$x[2] - hist_layer$x[1]
          plt <- plt + stat_function(fun = function(x) dnorm(x, mean = meanofmeans, sd = sdofmeans) * whichsamp$i*bin_width)
          #TODO: geom_histogram is defaulting to 1/30th of the data range; approximating this here, could use some refinement
        } 
        
        if(input$skew == TRUE) {
          plt <- plt + annotate("text",x=Inf,y=Inf, label = paste0("Skewness = ", round(skewness,3)," (SE=",round(skewnessSE,3),")"), vjust = 7, hjust = 1) +
            annotate("text",x=Inf,y=Inf, label = paste0("Kurtosis = ", round(kurtosis,3)," (SE=",round(kurtosisSE,3),")"), vjust = 9, hjust = 1)
        } 
    
    }
  plt
    
  }) #end output$sampPlot
  
  output$sampleTable <- renderTable({
    if(whichsamp$i>0){
      alldat <- datasets() # get all pre-generated samples
      dat <- as.data.frame(alldat$dat[whichsamp$i-index$i,]) # grab just one of them for *this* sample plot
      colname <- paste0("Sample", whichsamp$i-index$i)
      colnames(dat) <- colname
      return(as.data.frame(dat))
    } else {
      return(data.frame())
    }
    }) #end output$sampleTable

  output$sampleValues <- renderPlot({
    
    alldat <- datasets() # get all pre-generated samples
    
    if(whichsamp$i>0){
      dat <- as.data.frame(alldat$dat[whichsamp$i-index$i,]) # grab just one of them for *this* sample plot
      colname <- paste0("Sample", whichsamp$i-index$i)
    } else {
      dat <- as.data.frame(alldat$dat[1,]) # grab just one of them for *this* sample plot
      colname <- "dat"
    }
    
    colnames(dat) <- colname
    sample_mean <- colMeans(dat)
    sample_sd <- sd(dat[,1]) #JR: Adding this in, may risk information overload though
    sample_se <- sample_sd/sqrt(obs()) #JR: Adding this in, may risk information overload though
    
    plt <- ggplot(data=as.matrix(dat), aes(x=!!sym(colname))) +
      xlab(NULL)
      if(whichsamp$i>0){
      plt <- plt + geom_dotplot(color="black",method="histodot", dotsize=.5) +
        scale_y_continuous(name = "", breaks = NULL) +
        annotate("text",x=Inf,y=Inf, label = paste0("Sample Mean = ",round(sample_mean,3)), vjust = 1, hjust = 1) +# CFF changed sample_mean()
        annotate("text",x=Inf,y=Inf, label = paste0("Sample SD = ",round(sample_sd,3)), vjust = 3, hjust = 1) +
        xlab(colname)
         #JR: Added this in, may be information overload though
      }
    plt <- plt +
      #Clears the y-axis label
      #ylab("Frequency")+
      #Sets the x axis ticks to cover the whole plot
      scale_x_continuous(limits=c(lower(),upper()),breaks= seq(round(lower()),round(upper()))) +
      #coord_cartesian(ratio=.5, expand=F) +
      theme_classic()
    
    if(input$se == TRUE) {
      plt + annotate("text",x=Inf,y=Inf, label = paste0("Estimated Standard Error = ",round(sample_se,3)), vjust = 5, hjust = 1)
    } else {
    plt }
  }) #
  
  output$sampleValuesTable <- renderTable({
    alldat <- datasets() # get all pre-generated samples
    if(whichsamp$i>0){
      dat <- alldat$dat[1:whichsamp$i,,drop=FALSE] # grab just ones necessary for *this* plot
      means <- rowMeans(dat)
      return(as.data.frame(means))
    } else {
      return(data.frame())
    }
    means <- rowMeans(dat)
    
    
  })
} #end server()

# Run the application 
shinyApp(ui = ui, server = server)


SECTION 1: INTRODUCTION

As context for this exercise, suppose we are studying social media use among undergraduate students enrolled in statistics courses.

As part of this research study, we first want to estimate average student social media usage for the class.Participants in our study will report the number of hours they spend on social media in a typical day.


Question 1.1: What is the population?

Reveal Answer

All undergraduate students in statistics courses


Question 1.2: What is the variable being measured?

Reveal Answer

Time spend on social media, reported in hours


SECTION 2: SAMPLING

Refer to the “population distribution” section of the app to see what the distribution of values for the variable(s) are in the population.

Recalling the rational of this experiment, answer the following questions:


Question 2.1: What is our value of interest?


Reveal AnswerUniversity students' average number of hours spent on social media

Question 2.2: If we were conducting a real study, would we know what the population mean is prior to doing the study?


Reveal AnswerNo, we would not. Even after conducting the study we would not know the population mean, we would only have an estimate of it.


Question 2.3: Given that we can’t measure the entire population of university students, how can we gain data to conduct our social media study?


Reveal AnswerWe instead can collect a sample of students


SECTION 3: SAMPLE & MEANS

Context: Suppose we decide to obtain a random sample of 5 students from this class and then ask each of them to report the number of hours they spend on social media on a typical day. To do this, ensure that the “sample size” is set to 5 and then click on “+1” to generate a single sample. Then, refer to the “sample values” section of the app. On one tab you should see a histogram (graph) of the distribution of these 5 values. On another tab, you should see the actual numerical values for these 5 participants.


Question 3.1: What is a sample?

Reveal Answer

A sample is a series of observations taken from a larger population, often with the intention of estimating a population parameter, such as the population mean


Question 3.2: What is this sample mean an estimate of?

Reveal Answer

The sample mean is an estimate of the population mean.

,


*For the following next 3 questions, calculate - by hand - the mean of the 5 sample values...*


Question 3.3: Where does the mean of these 5 values appear in the app?

Reveal Answer

You should see the mean displayed on the sample values plot.


Question 3.4: Is the sample mean identical to the corresponding population parameter?

Reveal Answer

It should not be identical, though by it could be close!


SECTION 4: SAMPLING DISTRIBUTIONS

Question 4.0

Refer to the “sampling distribution” section of the app. At this point, you should see a histogram with a single bar.

Compare the (approximate) value of that bar to the sample mean. Notice that they are identical. In the previous question, you calculated the mean number of social media hours for your first sample of 5 participants.

What does each value in the sampling distribution correspond to?

Reveal Answer

Every datapoint in the sampling distribution represents a mean value from a single sample


Question 4.1

Imagine we repeat our study. To do this, draw another sample by clicking “+1” again. You should see the sample values section update to have 5 new values. The sample values, along with this sample’s mean, are now based on 2 sets of 5 participant samples.

With every new collected sample of students, you would add the participants to the already collected sample of your students and recalculate the mean. Note how the sampling distribution histogram should also have another value added to it.

As your sample changes with every round of data collection, the mean of this growing sample will concurrently change. Compare this new mean to the sample mean of the first sample. Try this process several more times to see what happens to the sampling distribution’s mean.

What happens to the sampling distribution's mean as the samples increase?

Reveal Answer

So long as assumptions of the central limit theorem are met, the mean of the sampling distribution should get closer and closer to the population mean as additional samples are added.


Question 4.2

So far, you have examined 2 distributions: the population distribution (at the top of the application), and the sampling distribution (appearing as a histogram).

By now you should understand that the sampling distribution is not a distribution of individual scores, rather the scores of all your sample’s means. Note that a distribution of the individual scores is called a sample distribution. For example, a sampling distribution of the number of hours of student’s social media use would show 5 points (n=5) for each student’s response.

The sampling distribution, as discussed above, would illustrate only that group’s mean hours of social media use.

Can a scientist gain data about the whole population?

Reveal Answer

Not usually. It is often easier and more economical to obtain a random sample from a population.


Question 4.3

You have just received some funding and can finally begin your experiment. Imagine you have conducted all your data collection and are ready to start analyzing.

What would the initial steps of your analysis look like?

Reveal Answer

The statistics we went through above describes the very first steps you would take once you collected your data. First, you would want to see the distribution of hours of social media use in your sample: generate a sample distribution. This step is important as it is the first step in determining whether or not your sample breaks any assumptions required for certain analyses (a topic we will discuss in later modules). Then, you would calculate the mean of all your samples. This mean will be an integral part of your analyses. How this mean is used will be discussed in the next module. For now, using what we have discussed so far in this module, you have set up the basic components of your analyses.



SECTION 5: REPEATED SAMPLING

Next, generate many samples. You can click on +1000” to add 1000 samples to the sampling distribution. This is like repeating the experiment 1000 more times. Inspect what the sampling distribution looks like in relation to the population distribution.


Question 5.0

Toggle the “overlay normal” option to overlay what a normal distribution would look like.

Roughly what shape is the sampling distribution?

Reveal Answer

It should be symmetric, and kind of 'Normal' looking


Question 5.1:What is the mean of the sampling distribution, and how does it compare to the population mean?

Reveal Answer

The mean of the sampling distribution should be fairly close to the population mean. Theoretically, if a statistic is unbiased, the mean of the sampling distribution will be the same as the corresponding population parameter. Here, the mean of the sampling distribution is the same as the mean of the population.


Question 5.2: Is the standard deviation of the sampling distribution smaller or larger than the population standard deviation?

Reveal Answer

It should be smaller than the population standard deviation.


Question 5.3

Suppose instead of doing the experiment with a sample size of 5, we decided instead to use a sample size of 30. Make this change in the app in the section that specifies the sample size. If you would like to generate one sample at this sample size to inspect the sample values, please do so by clicking “+1”. Then, generate many samples by clicking “+1000” to inspect the sampling distribution.



Question 5.4: What do you notice about the shape of the sampling distribution?

Reveal Answer

The sampling distribution should be roughly symmetric and normal.


Question 5.5: Conceptually, do you think increasing the sample size will result in sample means more or less accurately estimating the population mean?

Reveal Answer

The mean of the sampling distribution should be approximately the same as that of the population. The standard deviation of the sampling distribution should be smaller than it was when n=5 was the sample size.


Question 5.6: How do you expect the variability of the sampling distribution to change if we were to decrease our sample size?

Reveal Answer

Variability of the sampling distribution should increase since when sample size is small, each sample is a less accurate estimate of the population mean.


SECTION 6: STANDARD ERROR

Context: The standard deviation of the sampling distribution (for the sample means) has a special name: the standard error (of the mean). Assuming a fixed sample size, there are several ways we can compute the actual standard error (of the mean): 1) If we know the population standard deviation, or 2)If we repeat our experiment many (e.g., thousands) of times, we can literally compute the standard deviation of the sample means. If we only have a single sample and don’t know the population standard deviation, it is possible to estimate the standard error of the mean


Question 6.0: Which of the following BEST describes standard error?

The average distance of individual data points from the population mean.
The variability of scores within a single sample.
The standard deviation of the sampling distribution of a test statistic.
The difference between the sample mean and the population mean
Reveal Answer

The standard deviation of the sampling distribution of a test statistic.


Question 6.1: While both the standard deviation and standard error give you information about data’s deviation from a mean, how are these different?

Reveal Answer

While the standard deviation tells you how much each data point deviates from the sample mean, your standard deviation tells you how much your sample means deviate from your population mean. In other words, your standard deviation tells you how much your sampled data are spread out from your sample mean. Your standard error tells you how much all your sample’s means deviate (from sample to sample, rather than from data point to data point) around the population mean.


SECTION 7: DISTRIBUTIONS

Background: The shape of the distribution of sample means tends to be normal. It is guaranteed to be normal if either the population from which the samples are obtained is normal; it also tends to be normal under a wide range of other population distributions if sample size is large (usually n ≥ 30 is a common rule of thumb).


Question 7.0: If the distribution of sample means is approximately normal, does this guarantee that the population distribution is normal?

Reveal Answer

Not necessarily. The distribution of sample means does not immeidately tell us anything about the distribution of scores in the population.


Question 7.1: If the population distribution is not normal, what happens to the shape of the sampling distribution of the mean as sample size increases?

Reveal Answer

Under many (though not all) nonnormal distributions, the shape of the distribution will become normal if the sample size is large enough.


Question 7.2: If the population distribution is not normal, what happens to the shape of the sampling distribution of the mean as sample size increases?

Reveal Answer

An increase in sample size can also help the shape of the sampling distribution approach a normal distribution..


Question 7.3: Imagine that the population distribution of student screen time is highly skewed to the right, because a small number of teens use their phones for extremely long hours. You take random samples of n = 50 teenagers many times and plots the sampling distribution of the sample mean?

Reveal Answer

Even though the population is skewed, the Central Limit Theorem tells us that with a large sample size (n = 50), the sampling distribution of the mean will be approximately normal.



SECTION 8: CENTRAL LIMIT THEOREM (CLT)

Background: The CLT states that: for a sufficiently large sample size, the distribution of sample means will approximate a normal (bell curve) distribution, regardless of the shape of the original population distribution (Xang et al., 2023). Note that: The CLT shows that the variability of sample means (the standard error) decreases as sample size increases.

In sum, there are 3 components to the CLT: (1) You take 1 sample with large n -> You take the sample’s mean (2) Then you take many samples (keeping the sample size constant) -> You take each sample’s mean (3) Then, you plot all the means of your samples (not of the individual data)


Question 8.0: Which of the following best illustrates the Central Limit Theorem?

The population becomes normal as sample size increases.
The sample mean always equals the population mean.
The sampling distribution of the mean approaches normality as sample size increases, regardless of population shape.
The sample distribution matches the population distribution exactly.
Reveal Answer

(3) The sampling distribution of the mean approaches normality as sample size increases, regardless of population shape.


Question 8.1

As discussed, you’ve taken the basic steps required to conduct an experiment. These steps are helpful because with your repeated sampling of your population, you are left with a normally distributed sampling distribution. Without a normal sampling distribution, easy methods used to make inferences about the population may not work.

Does the Central Limit Theorem require the population distribution to be normal?

Reveal Answer

No, the major advantage of the CLT is that the population distribution does not need to be normal. The CLT holds even for many cases when the population distribution is skewed or irregular, as long as the sample size is large enough. (It should be noted that there are exceptions, though such exceptions are not yet covered in this module).


Question 8.2: What is the standard deviation of the sampling distribution of the mean called?

Reveal Answer

The standard error of the mean.


Question 8.3: What is the minimum sample size commonly considered “large enough” for the CLT to apply in many practical situations?

Reveal Answer

A sample size of n ≥ 30 is often considered large enough for the CLT to give a good approximation. (This can vary depending on the amount and type of nonnormality in the population).


Question 8.4: Why is the Central Limit Theorem especially valuable when the population distribution is not normal?

Reveal Answer

Because it often allows researchers to use normal-theory inference (confidence intervals, hypothesis tests) on sample means even when the population is skewed or nonnormal, as long as the sample size is large.


Question 8.5: How would the sampling distribution change if the experimenter took more samples of the same size versus larger samples?

Reveal Answer

Larger samples make the distribution more normal and less variable; taking more samples does not change the shape, but only gives more estimates so that we have a better idea of the exact shape of the distribution.


Credits

We thank the Association for Psychological Science for a small teaching grant (Fund for Teaching and Public Understanding of Psychological Science) to help in the development of this software.