brainbaking/how-to-do-a-factor-analysis-in-r.md at 75f6dd1e808c33b2b052f7808fd457d6eebbd1f7

15 KiB

Raw Blame History

title

date

Step 1: Is the data up for it?

First things first: it's possible that your questions are so bad there's never going to emerge any pattern. In that case, the "factorability" of the survey is reduced to zero.

There's a statistic for that (That sentence comes in handy later!), it's called the Kaiser-Meyer-Olkin (KMO) sampling adequacy. Their original paper suggests a value of at least .60, but preferably higher than .80.

Let's use a well-known sample data set for our exercise: the psychological Big Five personalty test developed in the eighties. Scientists devised five big personality types, exactly using this very method: openness (O), conscientiousness (C), extraversion (E), agreeableness (A), neuroticism (N). The psych package contains a sample dataset of 2800 responses that can be loaded into the variable bfi with data('bfi').

I'll deliberately avoid using statistical formulas here and let R Suite do the talking:

library('psych') #import fa functions
data('bfi')
KMO(bfi)

The result:

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = bfi)
Overall MSA =  0.84
MSA for each item = ...

All right, so far so good. What's next? An overall item-total correlation (Rit) check to validate the reliability of the questions asked (the items). There are, of course, standardized rules for that, laid out by Ebel and Frisbie. Rit values below .30 are considered doubtful or even poor. How are we doing on that front? Let's try out the alpha(bfi) command and inspect the results:

Reliability analysis   
Call: alpha(x = bfi)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
      0.21      0.51    0.66     0.036 1.1 0.019  4.6 0.53    0.017

 lower alpha upper     95% confidence boundaries
0.18 0.21 0.25 

 Reliability if an item is dropped:
          raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
A1             0.24      0.53    0.68     0.041 1.15    0.018 0.035 0.021
A2             0.19      0.49    0.64     0.034 0.95    0.019 0.034 0.017
A3             0.19      0.49    0.64     0.034 0.94    0.019 0.033 0.017
...

 Item statistics 
             n  raw.r  std.r  r.cor r.drop mean    sd
A1        2784 -0.028 0.0573 -0.057 -0.127  2.4  1.41
A2        2773  0.285 0.3736  0.356  0.210  4.8  1.17
A3        2774  0.267 0.3874  0.379  0.184  4.6  1.30
A4        2781  0.269 0.2938  0.239  0.174  4.7  1.48
A5        2784  0.265 0.3345  0.314  0.185  4.6  1.26
C1        2779  0.204 0.2723  0.215  0.125  4.5  1.24
C2        2776  0.229 0.3506  0.324  0.142  4.4  1.32
C3        2780  0.184 0.2487  0.181  0.099  4.3  1.29
C4        2774  0.040 0.1285  0.060 -0.055  2.6  1.38
C5        2784  0.083 0.1242  0.049 -0.027  3.3  1.63
E1        2777  0.032 0.0023 -0.107 -0.078  3.0  1.63
...

Lots of numbers, where to look? First, the raw_alpha is a total alpha value that should be as close to one as possible. A low alpha value (.21 is not good) predicts that some questions can be left out while retaining the core and purpose of the survey.

The r.cor column is what we're interested in. Surprisingly, there are quite a few bad values here. For instance, A1 could be scrapped, C5 and E1 are also worthless. At this point it's up to the researcher to decide whether or not to throw out questions. We did for a few, after carefully inspecting the question itself. Some participants might have misinterpreted the question. Of course, just throwing out all bad ones to up the statistical relevance is not the way to go.

You can re-match the questions to the codes in the R documentation: A1 is "I Am indifferent to the feelings of others.", E1 is "I Don't talk a lot." Since we already know the 5 factors it's a bit difficult to judge whether or not those questions are relevant to the personality types. Anyway, let's move on and pretend we didn't see the low alpha value.

Step 2: Confirm your suspicions

There are two ways to do a factor analysis: confirmatory or exploratory. With the first, you suspect certain items will belong together, and hope that the statistics will confirm that. With the latter, you have no idea what will emerge, and let the analysis tell you which ones should be grouped together.

Usually, researchers have a pretty good idea, especially if they designed the questions themselves. They're just looking for confirmation---and a way to publish their results. Lets pretend we think O1, C1, E1, A1, N1 and O2, C2, ... belong together, comprising four factors in total. We have to feed R our model ourselves, and then pass it to the cfa function from the lavaan project:

library('lavaan')

model <- '
 # measurement model
  cool  =~ O1 + C1 + E1 + A1 + N1
  mad =~ O2 + C2 + E2 + A2 + N2
  sad =~ O3 + C3 + E3 + A3 + N3
  glad =~ O4 + C4 + E4 + A4 + N4
' 

fit <- cfa(model, bfi)
summary(fit, fit.measures = TRUE)

The summary function spews out about every imaginable statistic you can think of. We're looking for a few things here:

The Comparative Fit Index (CFI) that should ideally be >.80. Here, it's .433.
The Root Mean Square Error of Approximation (RMSEA) and its P-value that should be below .05. Here, it's .135.

All other stuff (variances, covariances, latent variables) are useless if the CFI is low, as is the case. That means your model is wrong, and the confirmatory factor analysis confirmed that your hunch is... well... just a hunch. What's next, scrapping the data and having another go at collecting "better" ones? Not yet. It just means that our cool, mad, sad, glad factors are hogwash.

Step 3: Explore other possibilities

At this point, it's a good idea to trade confirmatory for exploratory, although fiddling with the measurement model yourself is always an option---in my case, it only resulted in frustration. R still asks the researcher for an educated guess when it comes to the number of factors, though.

If you have no clue where to begin, a parallel analysis is a good idea, which comes with a bunch of optional parameters, including which factor method to use, which I won't go into here:

eigenvalues <- fa.parallel(bfi)

This will take a while, but you can limit the result set by using bfi[1:100].

The result is one line of output ("Parallel analysis suggests that the number of factors = 7 and the number of components = 7") and a graph that is called a "scree plot":

The x axis contains the number of potential factors, and the y axis the "eigenvalues", which explains how much of the variance of the items a factor explains. An eigenvalue of one (the threshold) means that factor on the x axis explains more variance than a unique variable. In other words, If it goes below one, introducing new factors won't be very beneficial, as the data of the questionnaire contributes little to that factor.

These graphs traditionally resemble a kind of elbow: after x factors, adding another one becomes futile. This is basically R telling us how many factors we should be aiming for. In this case, it could be seven, as suggested, but it could also be six or five. Remember: it's still an educated guess.

Next, let's try out that guess. Seven? Sure, why not.

factordata <- fa(bfi, 7, rotate="varimax")
> print(factordata$loadings, cutoff=0.3)

The result:

Loadings:
          MR1    MR2    MR3    MR4    MR5    MR6    MR7   
A1                                    -0.538              
A2         0.391                       0.513              
A3         0.524                       0.434              
A4         0.302                       0.311              
A5         0.576                                          
C1                       0.544                            
C2                       0.670                            
C3                       0.542                            
C4                      -0.609                            
C5                      -0.537                            
E1        -0.504                              0.359       
E2        -0.597                              0.386       
E3         0.657                                          
E4         0.683                                          
E5         0.510         0.308                            
N1                0.796                                   
N2                0.794                                   
N3                0.723                                   
N4                0.551                       0.314       
N5                0.518                                   
O1         0.346               -0.393                     
O2                              0.517                     
O3         0.428               -0.498                     
O4                                            0.350       
O5                              0.591                     
gender                                 0.366              
education                                            0.389
age                                                  0.564

Whoops, what are those last few items? Demographic info can be removed but I was too lazy to do that. What are we looking at here? The aim for these "loadings" is to (1) have every item (question) loaded on a separate factor, and not multiple, and (2) have the questions grouped in the factors in a way that makes sense. MR1, MR2, etc can afterwards be named. Since we've kind of spoiled the ending, this is a bit ridiculous now: obviously, we're looking to load all the A questions under a factor ("agreeableness"), all the Cs, etc.

Now is a good time to re-calculate that comparative fit index: .97. Wow, that's great! Of course, since R basically suggested seven factors.

What does rotate=varimax mean? There are different kinds of matrix rotations available, grouped into orthogonal and oblique. As the analysis looks for the correlations to determine the best fit, it can rotate around an axis to find a better fit. In short, if you assume your factors are going to be correlated (which is mostly the case in psychology!), choose an oblique rotation. If you think they're not, choose an orthogonal one.

In the above loading output, you can see that for instance E5 loads high on MR1 (.510) and MR3 (.308). That's annoying. Is this question related to construct 1 or 2? Perhaps five factors is a better fit. Let's try again with five and apply an oblimin rotation, because we suspect they're correlated:

Loadings:
          MR2    MR1    MR3    MR5    MR4   
A1                             -0.514       
A2                              0.631       
A3                              0.581       
A4                              0.397       
A5                0.320         0.458       
C1                       0.547              
C2                       0.662              
C3                       0.563              
C4                      -0.623              
C5                      -0.560              
E1               -0.539                     
E2               -0.641                     
E3                0.490                     
E4                0.669                     
E5                0.407                     
N1         0.782                            
N2         0.762                            
N3         0.729                            
N4         0.502 -0.356                     
N5         0.531                            
O1                                     0.522
O2                                    -0.436
O3                                     0.616
O4                                     0.366
O5                                    -0.522

Bingo. Everything seems to be (almost) neatly grouped into five different factors---the ones we started with: openness (O), conscientiousness (C), extraversion (E), agreeableness (A), neuroticism (N). Of course, when scientists were originally analyzing this data, they didn't know that yet, and had to come up with those cool labels. What about our CFI? .89---still great.

A few other functions that can be of use:

principal, calculating aforementioned statistics for an exploratory analysis
fa.diagram(bfi) visualizes the loaded factors

If you're still keen on grinding away in R, you can verify the EFA findings by creating a revised model and doing a second CFA. The results should more or less match, including the high CFI value.

Step 4: Interpret with caution

What's the point of all those statistics? The emerging factors are never 100% the expected factors. For instance, we expected to see the same themes as the ones we identified in a focus group, as the questions were based on interviewees' opinions, supplemented with academic literature. We instead found (1) less factors, (2) a unique group of sub-themes we didn't foresee, and (3) discrepancies in our questions.

That means the process of coming up with questions, gathering data, and analyzing them using something like a factor analysis, is an incremental process. It should---in theory---be done multiple times using revisions of your questionnaire. In practice, gathering test subjects just once already involves a lot of trouble. I've found a lot of psychometric tests that never receive a published revision.

One more thing: to get your stuff really, really correct, you need a Step 0: Validate beforehand, or a "face validity" phase. There, you present your survey to a small group of experts/critics/potential candidates to iron out the biggest syntactic mistakes. That's the only qualitative part of a survey design.

Now you know the basics of setting up and validating your own questionnaire. I had no idea what I was doing, and I still have no clue on some of the statistical calculations behind these functions, but at least now I know how to use them and interpret their results. Good luck!

15 KiB Raw Blame History

Step 1: Is the data up for it?

Step 2: Confirm your suspicions

Step 3: Explore other possibilities

Step 4: Interpret with caution

15 KiB

Raw Blame History