Principal component analysis applied to the statistical control of multivariate processes Análisis de componentes principales aplicados al control estadístico de procesos multivariados

Objetivo: Proponer el análisis y monitoreo de un proceso químico sostenido en los principios teóricos de un método factorial catalogado como análisis de componentes principales (PCA), cuyo objetivo final es representar las variables originales del sistema. Metodología: Metodológicamente, los datos se analizaron en un espacio dimensional más compacto, bajo la hipótesis de la normalidad multivariada. Una gráfica de control basada en los cuadrados de predicción de error se construye en la fase posterior para monitorear el comportamiento de las variables sucesoras una vez que se ha aplicado el ACP. Resultados: Los resultados de los gráficos de control univariados construidos a partir de las puntuaciones t de los componentes individuales, en los que se exhibe el comportamiento de las variables que contribuyen a la generación de señales de alarma, se pueden utilizar como base para el análisis auxiliar basado en los resultados proporcionados por los gráficos de control univariados. Conclusiones: Los resultados obtenidos corroboran que los tres componentes retenidos explican una fracción mayoritaria de la variabilidad original de la nube de datos y que la gráfica de control construida a partir de las subdimensiones registra la existencia de valores atípicos o extremos.


Introducción
The superlative capacity of multivariate analysis for the detection of anomalies in processes where comovements between the analyzed variables occur and that also incorporate huge amounts of information [1], has been patently recognized in multiple domains, including industrial monitoring [2], since the techniques circumscribed within them are theoretically delimited in order to outline monitoring frameworks that aim to stabilize the process, reduce the variability that is natural to it and detect the appearance of background events that negatively affect the performance of production systems. Although, it is undeniable that such techniques do not enjoy a generalized popularity, it is necessary to indicate that one of the inherent advantages to them is that they make possible the reduction of effort quotas to evaluate the state of multiple variables, without assuming the loss of a high fraction of information, and therefore, they avoid the immovable restrictions imposed on the methods of control of processes with high dimensionality. In this sense, multivariate methods for process control are generally concerned with measuring the directionality of observations in a multivariate space, contrary to univariate methods that only monitor the magnitude and variation of individual variables without scrutinizing simultaneous movements between quality characteristics.
The need to apply multivariate statistical tools to simultaneously control two or more variables in the processes becomes more important every day. [3] A very useful multivariate analysis method to explain the sources of variability of a process and reduce the dimensionality of the data, is the Principal Component Analysis (PCA). [4] ACP is one of the oldest and most popular multivariate statistical techniques in data analysis. Its main objectives are: extract the most important information from a multivariate data set, compress a multivariate data set keeping only the information that is considered important (reduce the dimensionality of the data), simplify the description of a data set and analyze the structure of observations and variables. [5].
The central idea of the ACP is to achieve the simplification of a set of quantitative data, derived from a set of interrelated variables. This objective is achieved by obtaining, from linear combinations of the variables originally measured, a new set of the same number of variables, called principal components (CP) in which, the variability present in the original data remains and that, when ordering them decreasingly by their variance, allow us to explain the phenomenon of studies with the first PC. [6] In summary, the PCA is made up of: 1) the Wj vectors that are given by the dominant eigenvectors (those with the largest associated eigenvalue); 2) the sample covariance matrix. [7] The continuous review of orthodox techniques used to obtain a credible overview of the overall capacity and stability of the processes has served as a bridge for the inclusion of multivariate statistical techniques such as ACP, catalogued as a method of dimensionality reduction when there is an extensive amount of quantitative variables, in order to obtain a transformed system of coordinates called main components, linear combination of the primitives [8] and which are shown to be capable of retaining the global variation exhibited by the original data [9].
These types of methods and their extensions not only allow the projection of data in a more compact dimensional sub-space, but also open the way to the implementation of alternative methodologies for the design of control charts-such as those built for the analysis of latent variables, since the direct uses of conventional multivariate tools for operations control are not solvent for addressing non-abnormal circumstances such as the existence of collinear variables -circumstance that generates that the covariance matrix becomes an almost singular matrix and difficultly invertible-besides being potentially conducive to the increase of false alarm rates. Therefore, the use of control graphs that integrate information synthesis methodologies such as the one shown in this paper boast flagship properties such as those previously highlighted, and could well be used in a mandatory way in manufacturing environments for the systematic characterization and evaluation of processes in which a multiplicity of highly correlated quality characteristics underlie, as long as the quantity and quality of the available data is at least acceptable.

Methodology
The present investigation is segmented into three sections which will be explained in detail below. As a starting point, we proceed to identify and characterize the behavior of a conglomerate of variables of the productive process submitted to monitoring is the definition of certain measurable properties that favor the elucidation of a general nuance on the general stability of the process and the correlations between them.
The set of data obtained via randomized sample selection is carried out to carry out a brief exploratory analysis of them [10].
Subsequently, the analysis of major components is carried out using the originating data. In general terms, the main component analysis involves a set of data with observations on numerical variables for each of the entities or individuals. These values define vectors n-dimensional and searches for a linear combination given by: [11], whose coefficients are obtained from the vectors of the covariance matrix of the original data. For the sequential extraction of the main components, the algorithmic procedure catalogued as partially iterative non-linear least squares is available. And the determination of the optimal number of proxies is done through the application of Krzanowski's cross validation method. It is further emphasized that within this analysis the extraction of the major components of typed variables is not disregarded as the metrics of each differ significantly.
Once the analysis in main components is concluded, a multivariate control chart is constructed, this being the neuralgic phase of the present paper, since it detects the occurrence of anomalous events and mobilizations of the process outside the hyper-plane defined by the reference model. Under this same intention, an auxiliary analysis based on the construction of monitoring graphs for each extracted component is drafted.
In both cases, the specification limits are computed from the historical information of the process.

Results
For the acquisition of the data of each one of the selected variables, a simple random sampling plan has been applied without replacement, a probabilistic sample selection procedure in which all the subjects have the same probability of choosing and without the opportunity for another possible selection. Initially, there is a gross volume of data in which it is difficult to identify mathematical relationships, so it is necessary to carry out a timely treatment of the information collected so that its use is permissible in last statistical procedures. Table 1

Source: Own elaboration
Since the metrics of the observed variables differ significantly, this would suppose, a priori, an imminent problem for the purposes of estimating the main components, since they are not comparable to each other and those whose variance is high will irremediably dominate the first sub-dimensions. In order to avoid this undesirable circumstance, the original variables are standardized, so that each one of them will be equipped at the moment of starting the analysis. Note that in addition to certain descriptive statistics and the corresponding scale factors, the specification limits of the characteristics of interest are expressed.

Extraction of Main Components
Intuitively, the essential purpose in the first section is to find a new set of orthogonal directions that define the maximum variability in terms of the variance-covariance structure of the original variables, in such a way that the information contained in the complete set of components found is the exact equivalent of the original data information, since it is conjectured that there are redundant elements that only add dimensionality to the problem studied. [12]. Through the ACP, the original data is projected in a much more compact and parsimonious dimensional representation [13], which enables the analysis and monitoring of multiple variables under a slightly simplified approach.
Then be a multivariate data matrix , with whose columns contain the variables and the rows, the elements. The mean is given by (1): (1) and the covariance matrix by (2): .
The vectors p-dimensional are transformed into a vector of "scores" denoted by (3): being the load matrix.
These scores can be found by decomposing the characteristic values in the sample covariance matrix . The own values of are contained, in descending order, in [14,15].
It is also postulated that the sum of the variances of the variables or total inertia of the point cloud is equivalent to the sum of the variances of the main components. In this sense, the percentage of inertia explained by an i-th component is (4): Being , an expression of the measure of variability associated with the original variables [16].

Component extraction and cross validation
For the sequential extraction of the main components, the algorithmic procedure has been used partially iterative non-linear least squares. In each iteration there is a linear adjustment of the columns of X on a vector of scores to obtain a vector of loads p followed by a linear regression. Instantly, the rows of X are returned on the load vector for re-estimation until the predefined convergence criterion is met. This algorithmic procedure can be synthesized in the following steps [17]: • Select a column from the data matrix and make it equal to a vector • The vector is used to predict the matrix with the regression model (5): , Where, • The vector is defined with the equation (7): • The vector is normalised, so that its length is made equal to the unit.
• The vector is used to predict the matrix from the model seen in (8): , where the minimum quadratic estimator is given by (9): .
Projection of the rows of the matrix on the direction of the vector defined over the space of the variables is obtained in this way.
• The vector is defined • The quadratic norm is calculated from the difference between the vectors obtained in the antecedent step and in the initial step.
• The quadratic norm is compared with the tolerance value set in . If the difference is less than this level, the ith component has been obtained. In the opposite case, return to step 2.
The procedure described above is repeatedly applied to obtain n-components, however, to retain an appropriate amount of sub-dimensions and avoid the phenomenon of over-adjustment of the PCA model is applied a cross validation method such as Krzanowski, alternative scheme based on the algorithm of decomposition of singular values, whose purpose is to determine an optimal number of components and facilitates the identification of redundant information in the data matrix [18]. For such purpose, a test of statistical significance is applied to each extracted component in order to determine which of them merit being incorporated into the model. The proposed method assumes that it is intended to predict the elements of the matrix through model (10): .
In this sense, the value de for the selection of a quantity of components .
Where the degrees of freedom required to adjust the most-third components and the degrees of freedom remaining once the umpteenth components have been adjusted represents the increase in predictive information in each remaining component. The standard cross validation procedure consists of subdividing in several groups, remove each group from the data, evaluate the predictor parameters from the remaining data and predict the deleted values [19].
Following an orthodox and subtly restrictive criterion, the components with the greatest statistical significance should have the following values to the unit. Unquestionably, it would be inappropriate to cease adding major components as soon as posible , on the first occasion, is located below the unit, because it is a function of decreasing non-monotonic [20]. This suggests that there are no clear canons and formal criteria to determine what proportion of variability should be explained by successor variables [21]. However, for monitoring purposes, this empirical pattern will be followed. On the other side, , which is an expression similar to R²X--except that it evidences a less inflationary behavior as the complexity of the model increases-, it allows evaluating the predictive capacity of the model, indicating that such is acceptable, since it assumes a value close to the unit.
The vector of eigenvalues, on the other hand, can be conceptualized as the magnitude of the variance of the observations along the direction of its corresponding autovector. It is a reliable fact that the first factor explains a majority fraction of the total dispersion of the cloud and the successive ones explain minuscule portions of it, called residual variance.

Source: Own elaboration
Next, Table 3 lists the saturations or factorial loads that determine the orientation of the new axes of the main components with respect to the initial coordinate system. From a strictly analytical point of view, this matrix contains the saturations, which are calculated from the product between the weightings of the primitive variables in the extracted components and the square root of the auto-values, that is, the following equation (13): . (13) Positive values close to 1 indicate a strong correlation between a component and any variable as happens with the variables Pressure and Pentane Content in the first factor; on the contrary, negative saturations close to the unit reveal an accentuated degree of negative association between both.

Design of the control chart
Multivariate control diagrams such as the one made in the present case are based on the transformation of a vector to an escalar through a quadratic function that detects the existence of outliers or extreme directions. When the presence of a serial correlation structure in the observations has been discarded and the process verifies the condition of seasonality, control charts of this nature can be implemented for adjustment effects in the monitoring of processes exhibiting high dimensionality. Unlike the control diagrams of these are apt to discover a change in the mean of the process if the displacement is orthogonal to the first own vectors of the covariance matrix [22].
The generation of the control chart from the main components model is based on the squares of the prediction errors (SPE) given by the expression (14): The statistician represents the quadratic orthogonal distance of a new multivariate observation from the multivariate subspace of the main components. In other words, it is a measure that quantifies the lack of adjustment of the new sample with respect to the model that includes the retained components, by detecting the projected data that are not represented by the model. For a sample the residuals denoted by are given by (15): (15) While the magnitude of the waste is equivalent to [23] to the expression given by (16): .
Where, the i-th row of the residual matrix representing the prediction errors of the model, e the identity or unit matrix.
The approximate control limits for can be determined as long as the data can be described by a normal multivariate distribution and the auto-values of the covariance matrix are known. So, the threshold responds to the following expression (17): Where the upper confidence limit for with a level of significance y the statistic of the standardized normal distribution corresponding to the upper percentile . Meanwhile, y, and is the proper value of the covariance matrixErrors between target and actual values can be measured to determine their statistical significance and to identify current operating conditions. In such a case that exceeds the threshold, that is, then there are indications of the presence of events that affect the covariance structure of , as shown in Figure 1, and therefore it is argued that there are anomalous fluctuations in the process that are not explained by the model. The presence of samples outside the limits of the control diagram constructed from component scores or the existence of systematic trends and behaviors in the scores graph shows with factual certainty that the process is "out of control", i.e., that a break has occurred in the correlation structure of the estimated model [23].
Meanwhile, the results of the univariate control charts constructed from the t-scores of the individual components, in which the behavior of the variables contributing to the generation of alarm signals is exhibited, can be used as the basis for an auxiliary analysis based on the results provided by the univariate control charts. This is shown in Figure 2.
Under the assumption of normality, the control limits for a new -score in a time interval and at a level of significance are given by the expression (20): Where the estimated standard deviation of the sample of -scores in an interval of time denoted by . The expression is the critical value of the variable studied with degrees of freedom and level of significance [24].

Source: Own elaboration
It is possible to detect in Figure 2 that out of control signals emerge in the monitoring charts associated with components 1 and 2 and 3 and, consequently, it is stated that the process is contaminated by the presence of outliers, causing the temporary breakdown of the stability of the process. Despite the fact that after a time interval the process returns to a state in which only common causes of variability operate, the existence of anomalous behaviors continues to be persistent, which allows us to conclude that the comovements of the variables contained in the sub-dimensions found transgress the natural limits of variation for all the cases evaluated.

Conclusions
Multivariate control charts, extensions of their univariate analogues, can become canonical tools of regular use in the field of statistical process control to identify situations in which manufacturing systems with multiple variables -deviate from their typical behavior.
In the present case study, a monitoring framework was proposed for industrial processes with multiple quality characteristics, endorsing the fact that the control charts built from the analysis of main components meet the objective concerning the exercise of control over multidimensional production systems under a simple pragmatism clause. Avoiding theoretical lucubrations, it is possible to affirm that the finding of these components induces a substantial reduction of the dimensionality, as well as, in the creation of structures of interdependence that involve the observed variables.
The results obtained corroborate that the three retained components explain a majority fraction of the original variability of the data cloud and that the control chart constructed from the sub-dimensions registers the existence of outliers or extreme values. The premise that the process is not in a state of statistical control is then derived, a judgement revalidated by meticulously inspecting the behaviour shown by the control charts associated with the individual components. For this reason, it is necessary to direct actions aimed at improving the consistency of the productive process so that the quality characteristics exteriorize a relatively homogeneous behavior and without exceeding the limits of natural variation. In a case such as this, the additional application of a univariate monitoring tool would be beneficial, since one limitation of the proposed methodology is its inability to identify which variables substantially contribute to the generation of out of control signals in the short term.