Structural Equation Model - a flexible tool for Predictive and Prescriptive analysis
SEM (Structural Equation Model) is a powerful and flexible multivariate analysis tool that has been extensively used in psychometrics, behavioural science, business and marketing research. The application of this algorithm is not restricted to a single data source and can be extended to usage with web logs, social media data, transactional data, economic data and other alternate data sources.
SEM is a phenomenal combination of multilevel regression, ANOVA and factor analysis. It allows the user to study and validate complex relationships amongst many independent/dependent variables or features. It tests and evaluates proposed multivariate causal relationships, hence, alternately being called as “Causal Modeling or Confirmatory Technique.
SEM requires knowledge of key terms and nomenclature such as exogenous, endogenous, latent and indicator variables. Independent variables not influenced by other variables in a model are called exogenous while the ones influenced by other variables are known as endogenous. An attribute that is directly observed and measured is called an indicator variable. On the other hand, features that are not directly measured are called latent variables. Apart from these terms, one also needs to have statistical prerequisite of covariance, correlation, partial correlation main effect, interaction effect, moderation, mediation, Sobel test, direct effect and indirect effects as these are the building blocks for this algorithm.
Like any other test or statistical modelling technique, SEM requires specification of hypothesized model that shall undergo validation testing. SEM can also be used to compare multiple hypothesized models at the same time.
Let’s take an example to understand how SEM can be applied in lending industry to determine factors affecting customer defaults and to evaluate variable association. Referring to the diagram above:
- Using internal data like credit history, transaction details and other financial details; underlying latent factor can be derived for Payment Capability of the Customer
- Data like social media, demographics, psychometric and other sources can help predict Intent and environmental factors. Mediation effect should also be taken into consideration as environmental factors can directly and indirectly affect the default of customer.
- “Intent” and “Ability to Pay” factors will play a pivotal role in default predictions.
- Above model can also help in understanding various customer segments present in the data
SEM is well equipped to inform the amount of variance in dependent variables (DVs) – both indicator and latent DVs. It can also tell the reliability of each measured variable. SEM allows to examine mediation and moderation, inclusive of indirect effects, hence, in certain scenarios, it can also be used for variable selection. One of the interesting advantages of SEM is hierarchical modelling functionality with inclusion of fixed/random effects.
While working on SEM, few steps need to be followed such as hypothesis of construct, verifying assumptions, parameter estimation, model evaluation and tuning.
- Initial steps include establishing theoretically plausible models i.e. prior knowledge of the positive or negative direct effects among variables
- SEM follows all assumptions of multivariate models such as multivariate normality and Multicollinearity, however, some estimation methods do not require normality assumptions. Like other models, it is also sensitive to missing values and sample size.
- SEM uses various Estimation methods such as maximum likelihood (ML), generalized least squares, weighted least squares, and partial least squares.
- For Model evaluation and modification, one needs to know model fit indices like χ 2, CFI, RMSEA, TLI, GFI, NFI, SRMR, AIC, and BIC. Also, the residuals of covariances need to be small and centred about zero. For non-normal residuals, robustness of model can be measured by goodness of fit tests like Lagrange Multiplier test.
Similar to other statistical methods, unmet assumptions, overfitting and non-convergence of model are also faced in this algorithm, however, with conceptual understanding and practice, one should be able to control multiple error metrics and model fit index for getting best fit model. Researchers have come up with many different variants of SEM, Partial least square SEM (PLS-SEM), Hierarchical SEM, Bayesian SEM and many more, which is worth exploration for specific use cases.
Do you have information on how to use this model on excel?
Good read thanks