proPower: Prospective power analyses for frequentist regression models

Dr Stefano Coretta

University of Edinburgh

2025-03-26

Frequentist statistics

P-values: \(P(d|H)\), probability of the difference \(d\) (or a more extreme one) given the hypothesis \(H\).
Neyman-Pearson NHST:
- \(H\) is the “null hypothesis”, to be rejected.
- \(\alpha\) level: \(p \ge \alpha\) non-significant difference, \(p < \alpha\) significant difference.
- \(\beta\) level: statistical power, probability of detecting a significant difference when there is one.
Null Ritual (Gigerenzer 2004, 2018; Gigerenzer, Krauss, and Vitouch 2004):
- \(H\) is a “nil hypothesis”: \(d = 0\), to be rejected.
- \(\alpha = 0.05\)
- \(\beta = 0.8\)

Statistical power

Statistical power (\(\beta\) level) is affected by sample size.
One must determine the sample size needed to achieve the desired statistical power. In Null Ritualistic statistics this is 0.8 (or 80% power).
Prospective power analysis.

Two examples

Effect of consonant voicing on vowel duration in mono- vs disyllabic words.
Accuracy in a lexical decision task to investigate shallow morphological representation in L2 speakers.

Example 1: vowel duration

Data from Coretta (2019b), Coretta (2019a).
“Voicing effect” in Manchester English.
Mono- and di-syllabic nonce words in frame sentences.
Is the voicing effect the same or different in mono- vs disyllabic words?

Use “pilot” data to determine sample size.

Example 1: the data

library(tidyverse)

eng_s <- readRDS("data/eng_s.rds")
eng_s |> select(speaker, num_syl, word, voicing, v1_duration, speech_rate_c)

# A tibble: 720 × 6
   speaker num_syl word   voicing   v1_duration speech_rate_c
   <fct>   <fct>   <fct>  <fct>           <dbl>         <dbl>
 1 en02    di      teegus voiced          117.         -0.650
 2 en02    mono    teek   voiceless        95.0        -0.908
 3 en02    mono    teek   voiceless       106.         -0.900
 4 en02    di      tergus voiced          190.         -0.750
 5 en02    di      targus voiced          195.         -0.383
 6 en02    di      teepus voiceless       113.         -0.655
 7 en02    mono    terb   voiced          203.         -0.823
 8 en02    mono    tarp   voiceless       217.         -0.534
 9 en02    di      teegus voiced          120.         -0.571
10 en02    di      terpus voiceless       147.         -0.665
# ℹ 710 more rows

Effect of voicing on vowel duration

Prospective power analysis: overview

Determine Smallest Meaningful Effect Size (SMES).
Run regression model on pilot or simulated data.
Extend model with simr::extend() and set coefficient to SMES.
Run power calculation with simr::powerCurve().

Smallest Meaningful Effect Size

The minimum standard deviation for measurements of vowel duration is between 2 to 5 ms (Allen 1978).
Nowak (2006) finds an effect of voicing of 4.5 ms.
Let’s use an SMES of 5 ms. What does this correspond to as a difference ratio?

Smallest Meaningful Effect Size (ratios)

x <- seq(50, 300, by = 10)
y <- x + 5

ggplot() +
  aes(x, y/x) +
  geom_point(size = 2) +
  labs(
    x = "Vowel duration",
    y = "Difference ratio"
  )

Run regression model

dur_lm <- lmer(
  log(v1_duration) ~
    voicing + num_syl + voicing:num_syl + speech_rate_c +
    (voicing | speaker),
  data = eng_s
)

summary(dur_lm)

Linear mixed model fit by REML ['lmerMod']
Formula: 
log(v1_duration) ~ voicing + num_syl + voicing:num_syl + speech_rate_c +  
    (voicing | speaker)
   Data: eng_s

REML criterion at convergence: -230

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3800 -0.5941  0.1648  0.6624  3.1393 

Random effects:
 Groups   Name          Variance  Std.Dev. Corr
 speaker  (Intercept)   0.0181967 0.13490      
          voicingvoiced 0.0004701 0.02168  0.37
 Residual               0.0397878 0.19947      
Number of obs: 720, groups:  speaker, 6

Fixed effects:
                          Estimate Std. Error t value
(Intercept)                4.66722    0.05705  81.816
voicingvoiced              0.17276    0.02281   7.573
num_sylmono                0.08260    0.02308   3.579
speech_rate_c             -0.22902    0.02985  -7.673
voicingvoiced:num_sylmono  0.05501    0.02974   1.850

Correlation of Fixed Effects:
            (Intr) vcngvc nm_syl spch__
voicingvocd -0.029                     
num_sylmono -0.163  0.417              
speech_rt_c  0.011 -0.006  0.412       
vcngvcd:nm_  0.130 -0.652 -0.647 -0.006

Extend model and set SMES

dur_lm_ext <- extend(dur_lm, along = "speaker", n = 20)

fixef(dur_lm_ext)["voicingvoiced:num_sylmono"] <- 0.03
fixef(dur_lm_ext)

              (Intercept)             voicingvoiced               num_sylmono 
               4.66722417                0.17276388                0.08260109 
            speech_rate_c voicingvoiced:num_sylmono 
              -0.22902490                0.03000000

Run power analysis

library(simr)

dur_lm_pow <- powerCurve(
  dur_lm_ext,
  fixed("voicing:num_syl"),
  along = "speaker",
  breaks = c(50, 70, 100)
)

print(dur_lm_pow)

Power for predictor 'voicing:num_syl', (95% confidence interval),
by number of levels in speaker:
     50: 83.30% (80.84, 85.56) - 6000 rows
     70: 92.90% (91.13, 94.41) - 8400 rows
    100: 98.90% (98.04, 99.45) - 12000 rows

Time elapsed: 1 h 12 m 47 s

Power curve plot

plot(dur_lm_pow)

Example 2: accuracy

Data from Song et al. (2020).
Shallow morphological representation in L2 speakers.
Lexical decision task: Is the word a real English word or not?
Priming paradigm:
- Prime: prolong (unrelated), unkind (constituent), kindness (non-constituent).
- Target: unkindness ([un-kind]-ness, not un-[kind-ness]).
Constituent vs non-constituent accuracy in L2 speakers.

Example 2: the data

shallow_s <- readRDS("data/shallow_s.rds")
shallow_s

# A tibble: 180 × 11
   ID    Group List  Target         ACC    RT logRT Critical_Filler Word_Nonword
   <chr> <chr> <chr> <chr>        <dbl> <dbl> <dbl> <chr>           <chr>       
 1 L2_36 L2    D     unholiness       1   878  6.78 Critical        Word        
 2 L2_36 L2    D     rehydratable     1  1263  7.14 Critical        Word        
 3 L2_36 L2    D     unclearness      1  1296  7.17 Critical        Word        
 4 L2_36 L2    D     resellable       1  1344  7.20 Critical        Word        
 5 L2_36 L2    D     unsharpness      0  1192  7.08 Critical        Word        
 6 L2_36 L2    D     reattachable     1   969  6.88 Critical        Word        
 7 L2_36 L2    D     uncleverness     0   835  6.73 Critical        Word        
 8 L2_36 L2    D     unwariness       1  1028  6.94 Critical        Word        
 9 L2_36 L2    D     reclosable       1  1028  6.94 Critical        Word        
10 L2_36 L2    D     reusable         1   814  6.70 Critical        Word        
# ℹ 170 more rows
# ℹ 2 more variables: Relation_type <chr>, Branching <chr>

Accuracy by relation type

Smallest Meaningful Effect Size (log-odds)

I have no idea…
Say -0.2 (log-odds).

Smallest Meaningful Effect Size (probability)

Run regression model

sha_lm <- glmer(
  ACC ~ Relation_type + (Relation_type | ID),
  family = binomial,
  data = shallow_s
)

summary(sha_lm)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: ACC ~ Relation_type + (Relation_type | ID)
   Data: shallow_s

     AIC      BIC   logLik deviance df.resid 
   243.9    272.7   -113.0    225.9      171 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8220 -1.1732  0.5994  0.7376  0.9667 

Random effects:
 Groups Name                        Variance Std.Dev. Corr       
 ID     (Intercept)                 0.5606   0.7487              
        Relation_typeNonConstituent 0.1617   0.4022   -1.00      
        Relation_typeUnrelated      0.2396   0.4895   -1.00  1.00
Number of obs: 180, groups:  ID, 10

Fixed effects:
                              Estimate Std. Error z value Pr(>|z|)  
(Intercept)                  0.8699404  0.3904283   2.228   0.0259 *
Relation_typeNonConstituent  0.0008542  0.4401554   0.002   0.9985  
Relation_typeUnrelated      -0.3867280  0.4379963  -0.883   0.3773  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) Rlt_NC
Rltn_typNnC -0.730       
Rltn_typUnr -0.777  0.596
optimizer (Nelder_Mead) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.00637234 (tol = 0.002, component 1)

Run regression model II

sha_lm_2 <- glmer(
  ACC ~ Relation_type + (1 | ID),
  family = binomial,
  data = shallow_s
)

summary(sha_lm_2)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: ACC ~ Relation_type + (1 | ID)
   Data: shallow_s

     AIC      BIC   logLik deviance df.resid 
   235.1    247.8   -113.5    227.1      176 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7702 -1.2415  0.6161  0.7087  0.9373 

Random effects:
 Groups Name        Variance Std.Dev.
 ID     (Intercept) 0.14     0.3742  
Number of obs: 180, groups:  ID, 10

Fixed effects:
                            Estimate Std. Error z value Pr(>|z|)   
(Intercept)                  0.79468    0.30687   2.590  0.00961 **
Relation_typeNonConstituent  0.08048    0.40084   0.201  0.84086   
Relation_typeUnrelated      -0.30295    0.38990  -0.777  0.43716   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) Rlt_NC
Rltn_typNnC -0.643       
Rltn_typUnr -0.665  0.506

Run power analysis

sha_lm_ext <- extend(sha_lm_2), along = "ID", n = 1000)
fixef(sha_lm_ext)["Relation_typeNonConstituent"] <- -0.2

sha_lm_pow <- powerCurve(sha_lm_ext, fixed("Relation_typeNonConstituent"), along = "ID")
print(sha_lm_pow)

Power for predictor 'Relation_typeNonConstituent', (95% confidence interval),
by largest value of ID:
      3:  5.00% ( 3.73,  6.54) - 54 rows
    114: 37.50% (34.49, 40.58) - 2052 rows
    225: 67.60% (64.60, 70.50) - 4050 rows
    335: 83.30% (80.84, 85.56) - 6030 rows
    446: 92.40% (90.58, 93.97) - 8028 rows
    557: 95.70% (94.25, 96.87) - 10026 rows
    668: 98.30% (97.29, 99.01) - 12024 rows
    778: 99.20% (98.43, 99.65) - 14004 rows
    889: 99.70% (99.13, 99.94) - 16002 rows
   1000: 100.0% (99.63, 100.0) - 18000 rows

Time elapsed: 1 h 29 m 48 s

Power curve plot

plot(sha_lm_pow)

Summary

Determine Smallest Meaningful Effect Size (SMES).
Run regression model on pilot or simulated data.
Extend model with simr::extend() and set coefficient to SMES.
Run power calculation with simr::powerCurve().

Questions?

References

Allen, George D. 1978. “Vowel Duration Measurement: A Reliability Study.” The Journal of the Acoustical Society of America 63 (4): 11761185. https://doi.org/10.1121/1.381826.

Coretta, Stefano. 2019a. “Compensatory Aspects of the Effect of Voicing on Vowel Duration in English [Data V1.0.0].” https://doi.org/10.17605/OSF.IO/EP8WB.

———. 2019b. “Temporal (in)stability in English Monosyllabic and Disyllabic Words: Insights on the Effect of Voicing on Vowel Duration.” https://doi.org/10.31219/osf.io/jvwa8.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587606. https://doi.org/10.1016/j.socec.2004.09.033.

———. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198218. https://doi.org/10.1177/2515245918771329.

Gigerenzer, Gerd, Stefan Krauss, and Oliver Vitouch. 2004. “The Null Ritual. What You Always Wanted to Know about Significance Testing but Were Afraid to Ask.” In, 391408.

Nowak, Pawel. 2006. “Vowel Reduction in Polish.” PhD thesis.

Song, Yoonsang, Youngah Do, Arthur L. Thompson, Eileen R. Waegemaekers, and Jongbong Lee. 2020. “Second Language Users Exhibit Shallow Morphological Processing.” Studies in Second Language Acquisition 42 (5): 1121–36. https://doi.org/10.1017/s0272263120000170.