Factor Analysis in Red Wine Quality

using IBM SPSS Statistics 26

Photo by Terry Vlisidis on Unsplash

Introduction about Factor Analysis

Why using Factor Analysis?

The general purposes of doing factor analysis (FA) is to simplify the data by reducing various variables into smaller dimensions. This technique combines all of the variables’ largest common variance into a single outcome. In this article, FA is used to reduce variables (such as free sulfur dioxide and total sulfur dioxide) in the red quality wine dataset by combining them into one variable/dimension to reduce the amount of time and money spent on calculation/analysis.

Why Non-Metric Independent Variable was Excluded from FA?

FA uses Pearson correlation to assess the similarity between each pair of variables, and Pearson correlation is a numerical value that can only be calculated using metric variables, while non-metric independent variables were not included in the analysis.

Factor Analysis Outputs

Correlation Matrix — part 1
Correlation Matrix — part 2
KMO and Bartlett’s Test
Communalities
Anti-Image Metrics — part 1
Anti-Image Metrics — part 2
Total Variance Explained
Eigenvalues Scree Plot
Component Matrix
Rotated Component Matrix

Based on rotated component matrix figure, the independent metric variables can be grouped into four factors, namely:

  • Factor 1: Fixed Acidity, Density, Citric Acid, pH
  • Factor 2: Quality, Alcohol, Volatile Acidity
  • Factor 3: Free Sulfur Dioxide, Total Sulfur Dioxide, Residual Sugar
  • Factor 4: Chlorides, Sulphates

Of the four factors above, factor 1 can be categorized as “Acid and Density” based on the variables contained in these factors. In addition, factor 2 can be categorized as “Alcohol and Quality” while factor 3 is “Sulfur and Residual Sugar”. Moreover, factor 4 can be categorized as “Chemicals”.

Communality

The percentage of variance accounted for by the components identified through FA is referred to as communality. In communalities figure shows that if the quality is predicted using the four variables obtained from FA by MLR, the R^2 value of the model is 0.627.

Eigenvalue

The eigenvalue may be described as each component’s quality score; factors with high eigenvalues are more likely to reflect the underlying factor. As a general guideline, factors with an eigenvalue of 1 or above should be used. In this scenario, as shown in eigenvalues scree plot and component matrix figure, the dataset’s 12 variables assess four underlying components. Therefore, four factors are chosen.

Evaluating outputs

Increase the factorability

KMO-MSA, as shown KMO and Bartlett’s test output, may be used to determine the dataset’s factorability. Because MSA is less than 0.5, the dataset is not factorable. However, to make the dataset factorable, it is necessary to deletion variables that have a Pearson correlation close to 0 (in this case, deletion is carried out on the residual sugar variable). After deletion of the residual sugar variable, the following is the new FA output:

New Correlation Matrix — part 1
New Correlation Matrix — part 2
New KMO and Bartlett’s Test
New Communalities
New Anti-Image Metrices — part 1
New Anti-Image Metrices — part 2
New Total Variance Explained
New Eigenvalues Scree Plot
New Component Matrix
New Rotated Component Matrix

Cross-Loading & Minimize the Issue of Cross-Loading

Cross-Loading

There is no cross-loading for any of the variables, as shown in new rotated component matrix table. Despite the fact that the density variable has two values in two separate components, factor 1 has a considerably higher density value than factor 2. If the dataset is cross-loaded, the problem can be solved by using a rotation technique. If the cross-loading is continuous, the variable should be removed and FA recalculated.

Minimize Cross-Loading Issue

If the dataset is having cross-loading issue, the problem can be solved by using a rotation technique to minimize the cross-loading issue. Cross-loading is intended to be reduced by using Varimax with the Kaiser Normalization method of rotation. However, if the cross-loading is continuous, the variable should be removed and FA recalculated.

References

The SPSS file and outputs are available on my GitHub here.

--

--

--

Hello World! 👋 | Just Nobody on Medium | linktr.ee/caesarmario_

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Industrial engineer, Power BI dev, data scientist wannabe, into investment, spirituality, and some…

Simple Linear Regression

Intro to Monte Carlo Simulation Using Business Examples

Are Stock Returns Normally Distributed?

Baffled by Elasticity? use it to set the right price for your product.

The 6 Trends You Should Know About Enterprise Data Innovation

Web Scraping with Python Made Easy

Case Study: Find The Cheapest Rooms With High-Score Ratings (Part 2)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mario Caesar

Mario Caesar

Hello World! 👋 | Just Nobody on Medium | linktr.ee/caesarmario_

More from Medium

Water Quality Analysis

Introduction to the Measures of Central Tendency and Dispersion

Multiple Linear Regression in Red Wine Quality

Data science metrics trap : McNamara Fallacy