Data Reduction and Regression Using Principal Component Analysis in Qualitative Spatial Reasoning and Health Informatics

Sabharwal, Chaman Lal; Anjum, Bushra; Sabharwal, Chaman Lal; Anjum, Bushra

doi:10.17562/PB-53-3

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.53 México ene./jun. 2016

https://doi.org/10.17562/PB-53-3

Articles

Data Reduction and Regression Using Principal Component Analysis in Qualitative Spatial Reasoning and Health Informatics

Chaman Lal Sabharwal¹

Bushra Anjum²

^¹ Missouri University of Science and Technology, Rolla, MO-63128, USA (e-mail: chaman@mst.edu).

^² Amazon Inc., 1194 Pacific St., San Luis Obispo, CA-93401, USA (e-mail: banjum@amazon.com).

Abstract

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a dataset consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the dataset. In this paper, we use PCA based algorithms in two diverse genres, qualitative spatial reasoning (QSR) to achieve lossless data reduction and health informatics to achieve data reduction along with improved regression analysis respectively. In an adaptive hybrid approach, we have employed PCA to traditional regression algorithms to improve their performance and representation. This yields prediction models that have both a better fit and reduced number of attributes than those produced by using standard logistic regression alone. We present examples using both synthetic data and real health datasets from UCI Repository.

Key words: Principal component analysis; regression analysis; healthcare analytics; big data analytics; region connection calculus

I. Introduction

PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analyzing data. The other main advantage of PCA is that once we have found these patterns in the data, then we compress the data, i.e., by reducing the number of dimensions, without much loss of information.

PCA and Singular Value Decomposition (SVD) are interchangeably used for data reduction/compressions whereas statistical techniques such as regression analysis are used for approximation and analysis of data. Such applications include data mining, health informatics, oceanography, meteorology, natural language processing, machine learning, image analysis, geometry visualization.

A. Qualitative Spatial Reasoning

Reasoning about spatial data is a key task in many applications, including geographic information systems, meteorological and fluid flow analysis, computer-aided design, and protein structure databases. Such applications often require the identification and manipulation of qualitative spatial representations, for example, to detect whether one "object" will soon occlude another in a digital image, or to determine efficiently relationships between a proposed road and wetland regions in a geographic dataset. QSR provides representational primitives (a spatial "vocabulary") and inference mechanisms.

Much QSR work has studied purely topological descriptions of spatial regions and their relationships. One representative approach, the Region-Connection Calculus (RCC), provides predicates for expressing and reasoning about the relationships among topological regions (arbitrarily shaped chunks of space). RCC was originally designed for 2D ^[1, ^2]; later it was extended to 3D ^[3]. Herein we introduce PCA to reduce 9-Intersection model to 4-intersection model in both 2D and 3D. The performance of QSR can be improved by reducing the number of intersections, but PCA connection to QSR is non-existent in the literature. Herein we show how (1) PCA can be applied to intersection dimension reduction for QSR spatial data, and (2) the 9-Intersection can be reduced to 4-Intersection for all spatial as well as non-spatial objects.

For example, of item-attribute-concept in RCC, spatial objects are items, their intersections are attributes, and relations are concepts. There are five RCC5 and eight RCC8 concepts in RCC, see Figure 1, and ^[2, ^3].

Fig. 1 RCC5 and RCC8 relations in 2D.

B. Health Informatics

The Big Data revolution has begun for many industries. The healthcare industry has been playing catch up and has finally reached a consensus on the value of Big Data as a transformative tool. Statistical linear and logistic regression that have been the popular mining techniques, but their ability to deal with inter dependent factors is limited. The understanding of principal components, however, has been lacking in the past by non-academic clinicians. It is no surprise that keeping people healthy is costing more money. From the price of medications and the cost of hospital stays to doctors' fees and medical tests, health-care costs around the world are skyrocketing. Much of this is attributed to wasteful spending on such things as ineffective drugs, futile procedures and redundant paperwork, as well as missed disease-prevention opportunities. This calls for mechanism for efficient data reduction and diagnostic tools as pointed out by some of the examples in the literature cited in the next paragraph.

Analysis of this Big Data offers unlimited opportunities for healthcare researchers and it is estimated that developing and using prediction models in the health-care industry could save billions by using big-data health analytics to mine the treasure trove of information in electronic health records, insurance claims, prescription orders, clinical studies, government reports, and laboratory results. According to the Harvard School of Public Health publication entitled The Promise of Big Data, petabytes of raw information could provide clues for everything from preventing tuberculosis to shrinking health care costs-if we can figure out how to apply this data ^[4]. Improving the care of chronic diseases, uncovering the clinical effectiveness of treatments, and reducing readmissions are expected to be top priority use cases for Big Data in healthcare ^[5].

In this paper, we will give general guidelines to address various issues. We explore an adaptive hybrid approach (1) how PCA can be used to reduce data in the original space in addition to transformed space, (2) how PCA can be used to improve standard line regression and logistic regression algorithms, (3) how to use logistic regression in conjunction PCA to yield models which have both a better fit and reduced number of variables than those produced by using logistic regression alone.

The paper is organized as follows. Section II describes the background on linear regression, logistic regression, and principle of component analysis. Section III describes PCA in detail, along with our suggested representational improvements. Section IV presents the hybrid algorithms for regression using PCA. Section V discusses PCA's improved role in dimensionality reduction followed by additional experimental support in Section VI. Section VII concludes the paper.

II. Background

A. Mathematical Notation

In this section, we describe the mathematical notation for terms whose definitions will follow in the paper. A vector is a sequence of elements. All vectors are column vectors and are in lower case bold letters such as x. The n-tuple [x ₁,..., x_n] denotes a row vector with n elements in lowercase. A superscript T is used to denote the transpose of a vector x, so that x^T is a row vector whereas x = [x₁,..., x _n ]^T is a column vector. This notation is overloaded at some places where the ordered pair [x_l, x₂ ] may be used as a row vector, a point in the plane or a closed interval on the real line. The matrices are denoted with uppercase letters, e.g. A, B. For vectors x, y, the covariance is denoted by cov(x, y), whereas cov(x) is used for cov(x, x) as a shortcut ^[6].

If we have m vector values x _l;,...,x_m of an n-dimensional vector x = [x _l;,..., x_n ]^T, these m row vectors are collectively represented by an m x n data matrix A. The k^th row of A is the row vector xkT. Thus the (i, j) element of A becomes the j^th element of the i^th row/observation, xiT.

There are several ways to represent data so that implicit information becomes explicit. For linear representation of vector data, a vector space is equipped with a basis of linearly independent vectors. Usually in data mining, the data is represented as a matrix of row vectors or data instances. Two of the methods for efficient representation of data are regression and PCA.

B. Qualitative Spatial Reasoning

Much of the foundational research on QSR is related to RCC that describes two regions by their possible relations to each other. RCC5/RCC8 can be formalized by using first order logic ^[2] or by using the 9-intersection model ^[1]. Conceptually, for any two regions, there are three possibilities: (1) one object is outside the other; this results in the RCC5 relation DR (interiors disjoint) and RCC8 relation DC (disconnected) or EC (externally connected). (2) One object overlaps the other across boundaries; this corresponds to the RCC5/RCC8 relation PO (proper overlap). (3) One object is inside the other; this results in topological relation EQ (equal) or RCC5 relation PP (proper part). To make the relations jointly exhaustive and pairwise distinct (JEPD), there is a converse relation denoted by PPc (proper part converse), PPc(A,B) = PP(B,A). For a close examination, RCC8 decomposes RCC5 relation PP (proper part) into two relations: TPP (tangential proper part) and NTPP (non-tangential Proper part). Similarly for RCC5 relation PPc, RCC8 defines TPPc and NTPPc. The RCC5 and RCC8 relations are pictorially described in Figure 1.

Each of the RCC8 relations can be uniquely described by using the 9-Intersection framework. It is a comprehensive way to look at any relation between two regions. The 9-Intersection matrix for two regions A and B is given in Table 1, where Int represents the region's interior, Bnd denotes the boundary, and Ext represents the exterior. The predicate IntInt(A,B) is a binary relation that represents the intersection between the interiors of region A and region B; the value of this function is either true (non-empty) or false (empty). Similarly, there are other predicates for the intersection of A's interior, exterior, or boundary with those of B. In QSR, we are only concerned with the presence/absence of an intersection. The actual value of the intersection is not necessary.

Table 1 9-Intersection Matrix for Calculating RCC8 Relations.

For two non-empty bounded regions A and B, the intersection of their exteriors is always non-empty. This is represented in the last column of Table 2. Since it adds no new information, it has been proposed in the literature ^[2] to replace the 9-Intersection with the 8-Intersection model to define the spatial relations. The values of the 8-Intersection framework for the RCC8 framework are given in the first eight columns in Table 2.

Table II Boolean Values for 9-Intersections Required to Distinguish each RCC8 Relation.

In this paper, we show that PCA provides a better alternative to conventional methods of dimensionality reduction in QSR. The analysis is equally applicable to both non-spatial discrete web objects as well as conventional spatial objects such as cuboids and spheres.

In such applications, some threshold may be required to interpret the resulting dimensions. one can simply ignore variation below a particular threshold to reduce the data and still preserve the main concepts of original intent.

C. Health Informatics

There is a significant opportunity to improve the efficiencies in the healthcare industry by using an evidence-based learning model, which can in turn be powered by Big Data analytics ^[7]. A few examples are provided below. The company Asthmapolis has created a global positioning system (GPS) enabled tracker that monitors inhaler usage by patients, eventually leading to more effective treatment of asthma ^[8]. Center for Disease Control and Prevention (CDC) is using Big Data analytics to combat influenza. Every week, the CDC receives over 700,000 flu reports including the details on the sickness, what treatment was given, and whether not the treatment was successful.

The CDC has made this information available to the general public called FluView, an application that organizes and sifts through this extremely large amount of data to create a clearer picture for doctors of how the disease is spreading across the nation in near real-time ^[9]. GNS Healthcare, a Big Data analytics company, has come together with the health insurance company Aetna to help combat people at risk or already with metabolic syndromes. The company has developed a technology known as Reverse Engineering and Forward Simulation that will be put to work on the data of Aetna insurance subscribers. Essentially, the technology will search for the presence of five warning signs: large waist size, high blood pressure, high triglycerides, low High density Lipoprotein, and high blood sugar. A combination of any three of these lead to the conclusion that the patient is suffering from the condition ^[10].

Researchers at Allazo Health are creating systems designed to improve on medication adherence programs by using predictive analytics. For example, predict what interventions are mostly likely to work for that patient based on what interventions already worked for other patients with similar demographics, behavioral profiles, and medical history ^[11]. Another area of interest is the surveillance of adverse drug reactions (ADRs) which has been a leading cause of death in the United States ^[12]. It is estimated that approximately 2 million patients in USA are affected by ADRs and the researchers in ^[13], ^[14] and ^[15] propose an analytical framework for extracting patient-reported adverse drug events from online patient forums such as DailyStrength and PatientsLikeMe.

Simplistically speaking, in all the above examples the researchers are trying to model and predict a dependent phenomenon based on a number of predictors that have been observed. The dependent parameter can be discrete, nominal, or even binary / logical. There are two problems at hand: dimension reduction and prediction. First problem is data optimization. The optimization problem is data cleaning, and how we can reduce the set of predictors while still maintaining a high prediction accuracy for the dependent variable. The second problem is the prediction of the dependent variable from the reduced dataset. This problem is analyzing whether some event occurred or not given the success or failure, acceptance or rejection, presence or absence of observed simulators. This is where PCA comes into the picture.

III. Principle Component Analysis

The PCA is a well-known data reduction tool in academia for over 100 years. PCA creates a linear orthogonal transformation of correlated data in one frame (coordinates system) to uncorrelated data in another frame. The huge dimensional data can be transformed and approximated with a few dimensions. PCA finds the directions of maximum variance in high-dimensional data and projects it onto a smaller dimensional subspace while retaining most of the original information. If the data is noisy, PCA reduces noise implicitly while projecting data along the principal components. In this paper, we explore an adaptive hybrid approach to show that PCA can be used not only for data reduction but also for regression algorithm improvement. We will describe the hybrid model for both linear and logistic regression algorithms.

Before delving further, we would like to discuss the terms PCA and SVD further as they are used interchangeably in the literature. There is a clear distinction between them.

Definition 1. For a real square matrix A, if there is a real number λ and a non-zero vector x such that A x = λx, then λ is called an eigenvalue and x is called an eigenvector.

Definition 2. For a real matrix A (square or rectangular), if there a non-negative real number σ and a non-zero vectors x and y such that A^TX = σ y, and A y = σx, then σ is called a singular value and x and y represent a pair of singular vectors ^[16].

Note 1. λ can be negative or positive, but σ is always non-negative.

Note 2. σ² is an eigenvalue of covariance matrices AA^T and A^T.A This can be quickly seen

ATx=σy→AATx=σAy→AATx=σσx=σ2x

Therefore

AATx=σ2x

Similarly, we can see that

ATAy=σ2y

An eigenvector is a direction vector supporting the spread of data along the direction of the vector. An eigenvalue measures the spread of data in the direction of the eigenvector. Technically, a principal component can be defined as a linear combination of optimally weighted observed variables. The words "linear combination" refer to the fact that weights/coefficients in a component are created by the contribution of the observed variables being analyzed. "Optimally weighted" refers to the fact that the observed variables are weighted in such a way that the resulting components account for a maximal amount of variance in the dataset.

This will also be a good place to introduce Least Square Approximation (LSA). LSA and PCA are both linear transformations. However, they accomplish the same task differently. In a vector space, for any vector v and a unit vector u, we have v= v·u u + (v - v·u u). Finding the vector u that minimizes |( v - v·u u)| is the same as finding a vector u that maximizes |v· u|. LSA calculates the direction u that minimizes the variance of data from the direction u whereas PCA computes the direction u (principal component) that maximizes the variance of the data along the direction u. This concept is applied to all data instance vectors collectively resulting in covariance matrix AA^T of data matrix and u is the eigenvector of AA^T with largest eigenvalue.

If A is a real square symmetric matrix, then eigenvalues are real and eigenvectors are orthogonal ^[17]. PCA computes eigenvalues and eigenvectors of a data matrix to project data on a lower dimensional subspace. PCA decomposition for a square symmetric matrix A is A = UDU^T where U is the matrix of eigenvectors and D is diagonal matrix of eigenvalues of A. Since AU = UD, U is orthogonal, therefore A=UDU^T. Also PCA orders the eigenvalues in the descending order of magnitude. The columns of U and diagonal entries of D are arranged correspondingly. Since eigenvalues can be negative, the diagonal entries of D are ordered based on absolute values of eigenvalues.

The SVD decomposition is applicable to a matrix of any size (not necessarily square and symmetric). A = USV^T, where U is the matrix of eigenvectors of covariance matrix AA^T, V is the matrix of eigenvectors of covariance matrix A^TA and S is a diagonal matrix with eigenvalues as the main diagonal entries. Hence, PCA can use SVD to calculate eigenvalues and eigenvectors. Also SVD calculates U and V efficiently by recognizing that if v is an eigenvector of A^TA for non-zero eigenvalue λ, then A v is automatically an eigenvector of AA^T for the same eigenvalue λ where λ≥0. If A v is an eigenvector, say, u, then A v is a multiple of eigenvector u (since u and v are unit vectors) and it turns out that Av = √λ u , or u = σAv, where σ = 1/ √λ. By convention, SVD ranks the eigenvectors on descending order of eigenvalues. If U and V are matrices of eigenvectors of AA^T and A^TA and S is the matrix of square roots of eigenvalues on the main diagonal, then A can be expressed as A = USV^T ^[17]. The eigenvalues of a square symmetric matrix A are square roots of eigenvalues of AA^T We will show that it is sufficient to have S as diagonal matrix of only non-zero eigenvalues and U, V to have columns of only the corresponding eigenvectors.

In data mining, the m observations/data points are represented as an m x n matrix A where each observation is a vector with n components. PCA/SVD help in transforming physical world data / objects more clearly in terms of independent, uncorrelated, orthogonal parameters.

The first component extracted in principal component analysis accounts for a maximal amount of variance in the observed variables. The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the dataset that was not accounted for by the first component. The second characteristic of the second component is that it will be uncorrelated with the first component. The remaining components that are extracted in the analysis display the same two characteristics. Visualizing graphically, for the first component direction (eigenvector) e₁, the data spread is maximum m₁; for the next component direction e₂, the data spread is next maximum m₂ (m₂ < m_l the previous maximum) and the direction e₂ is orthogonal to previous direction e₁.

Note 3. For the eigenvectors u of A, any non-zero multiple of u is also an eigenvector of A. In U, the eigenvectors are normalized to unity. If u is a unit eigenvector, then -u is also a unit eigenvector. Thus the sign can be arbitrarily chosen. Some authors make the first nonzero element of the vector to be positive to make them unique. We do not follow this convention. Since the eigenvectors are ordered, we make the k^th element of vector u _k positive. If k^th element is zero, only then first non-zero element is made positive. This is a better representation of eigenvectors as it represents the data in a right-handed system as opposed to asymmetrical ordering, see Figure 2. Using Matlab svd(A) on a simple set of only two data points, the algorithm generates two orthogonal vectors v ₁, v₂ as given in Figure 2(a, b). our algorithm finds green vectors in Figure 2(c), which is more natural and of a right-handed orientation.

Fig. 2 Gray set of axes are standard xy-system. Blue dots represent a set of data points. Red dots are projections of data points on the principal components. (a) The axes v1, v2 in red are the eigenvectors that are computed by Matlab using the set of data points. (b) The axes v1, v2 in blue are the directions so that each eigenvector is unique making the first non-zero component positive, used in the literature. (c) The axes v1, v2 in green are generated when the sign in an eigenvector is chosen by our scheme.

A. Properties of PCA/SVD

In essence, PCA/SVD takes a high dimensional set of data points and reduces it to a lower dimensional space that exposes the knowledge that is hidden in the original space. Moreover, the transformation makes similar items more similar, and dissimilar items more dissimilar.

It is not the goal of SVD to get back the original matrix. However, the reduced dimensions do give insight about the original matrix, as we will see in the case of QSR and health informatics.

There are three important properties of SVD:

It can transform correlated variables into a set of uncorrelated ones. SVD computes a transformation T such covariance of TA is diagonal matrix D, e.g. TA(TA)^T is D. It can extract some relationships hidden in the original data. It can also reduce noise in the data ^[18].
Since the eigenvectors are ordered on most variation to least variation in descending order of eigenvalues, deleting the trailing eigenvectors of smaller variation ensures minimal error. It finds the best approximation of the original data points by projecting data on a fewer dimensional subspace; see Figure 3 of linear data.
The data can be reduced to any desired size. The accuracy of smaller size data depends on the reduction in the number of dimensions ^[19]. By deleting eigenvectors corresponding to least variation, we effectively eliminate noise in the representation of data vectors ^[20].

Fig. 3 Data is almost linear along v₁ direction in 3D, the noise is along the y and z-axis. Blue lines reflect data spread along v1, v2 and v3 directions.

B. Dimension Reduction using PCA

There is a multitude of situations in which data is represented as a matrix. In fact, most of the real world data is expressed in terms of vectors and matrices where each vector has a large number of attributes. Matrices are used to represent data elegantly, and efficiently. In a matrix, each column represents a conceptual attribute of all the items, and each row represents all attributes related to individual data item. For example, in health informatics, rows may represent patients and columns may represent disease diagnostic symptoms or the rows may represent medicines and columns may represent side effects or adverse reactions. Similarly, in spatial-temporal reasoning, rows represent pairs of objects and columns represent temporal intersection properties of objects.

The goal of PCA is to reduce the big data matrix A to a smaller matrix retaining approximately the same information as the original data matrix and make the knowledge explicit that was implicit in the original matrix.

Example: In this example, we have 20 three-dimensional points in 20 x 3 matrix A. Each row of A has three values for x-, y-, z- coordinates. Visually we can see that the data points have a linear trend, in 3D, but data points have noise components in the y, and z coordinates. PCA determines the direction and eliminates noise by eliminating the eigenvectors corresponding to smaller eigenvalues (see Figure 3). Black "+" symbols represent the original data points with noise. As can be seen visually, they are almost linear. They can be approximated along eigenvector v ₁ in one dimension of the v₁v₂v₃ -system. Red lines depict the data trend in the three dimensional space along v₁, v₂ and v₃ direction. Blue lines depict the data spread in the three dimensional space along v₁, v₂ and v₃ direction. Since the data spread or blue line on v₂ and v₃ is almost non-existent (closer to the origin), it is an indicator of noise in the linear data. We can see that the data can be represented satisfactorily using only one direction.

Table 3 enumerates the numeric values of the PCA decomposition. First row lists the eigenvalues, which represent the spread of points along the principal components. Second row shows the eigenvectors corresponding to the eigenvalues. The next three rows represent the error on using first, first two, first three eigenpairs. Let newA be the USV^T based on eigenpairs used. Then Error Original is the |A-newA |/|A| percentage error in the original space. Error Projection is the |AV-newAV|/|AV| percentage error in the projection space. Error Eigenvalues is the ∑p=k+1,3λp∕∑p=1,3λp percentage error in the eigenvalue space.

Table III Eigenvalues, eigenvectors and percent errors for one dimensional data in 3D

Table 3 reveals that v1, is the data dimension and v2, v3 correspond to the noise in this case. If we use two (or three) eigenpairs there is no error. Hence, the data can be represented in one dimension only instead of three dimensions with slight error of 2%. This error is attributed to the difference between the non-zero eigenvalues. The third and fifth rows in Table 3 indicate that the error metrics are equivalent in the original and projection space.

IV. Principal Component And Regression Analysis

There are several ways to model the prediction variables, e.g., linear regression analysis, logistic regression analysis, and PCA. Each has its own advantages. Though regression analysis has been well known as a statistic technique, the understanding of principles, however, has been lacking in the past by non-academic clinicians ^[21]. In this paper we explore an adaptive hybrid approach where PCA can be used in conjunction with regression to yield models which have both a better fit and reduced number of variables than those used by standalone regression. We will apply our findings to a medical dataset obtained from UCI Machine Learning Repository about liver patient ^[22]. We use the records for detecting the existence or non-existence of liver disease based on several factors such as age, gender, total bilirubin, etc.

A. PCA and Linear Regression

In linear regression, the distance between observed point (x_i,y_i) from the computed point (x_i,a+bx_i) on line y=a+bx, is minimized along the y direction. In PCA, the distance of the observed point (x_i,y_i) is minimized to a line which is orthogonal to the line y=a+bx, is minimized. The details of linear algebra concepts in this section are found in ^[17]. We assume that data is standardized to mean zero; and normalized appropriately by the number of objects.

For one independent and one dependent variable, the regression line is y=a+bx where the error between the observed value y_i and estimated value a+bx_i is minimum. For n points data, we compute a and b by using the method of least squares that minimizes:

This is a standard technique that gives regression coefficients a and b where a is the y-intercept and b is the slope of the line.

If the data is mean-centered, then a=0 because x- = ∑i=1,nxin=0 and y -=∑i=1,nYin=0. Alternatively, we can replace x_i with x_i - x and y_i with y_i - y. The direction of the line is always obtained from b as 11+b2,b1+b2.

For more than one independent variables, say m, we have

Then we compute b _k by minimizing:

Thus, it determines a hyper-plane which is a least square approximation of data points. If data is mean-centered, then b₀=0. It is advised to mean-center data to simplify the computations.

It is interesting to note that as a result of linear regression, the data points may not be at least distance from the regression line. Here we present an algorithm using PCA that results in a better least distance line. There are two ways in which regression analysis is improved: data reduction and hybrid algorithm. As a first step, PCA is used on the dataset for data reduction. For improved performance in the second step, we create a hybrid linear regression algorithm coupled with PCA.

Example: We have a dataset of randomly created 20 points. Matlab computes the regression line as the red line, see Figure 4. PCA computes the blue line. As can be visually seen, the blue line is the least distance line instead of the red regression line. For direction vectors and approximation error of data points from the line, see Table 4.

Fig. 4 Using the data points "+", traditional linear regression line is shown in red (LSA) and a principal component is shown in blue (PCA). Visually we can see that the points are much closer to blue line than to the red line.

Table IV Comparison of Linear Regression Methods.

Both linear regression and PCA compute vectors so that the variation of observed points from the computed vector is minimum. However, as shown in Figure 4, they give two different vectors, red via linear regression, and blue via PCA. The approximation error indicates the PCA adapted regression line is a better approach than by LSA method.

Data reduction is attributed to the non-zero eigenvalues of the data matrix A. Since m x n data matrix is decomposed into A = USV^T where U is m x m, S is m x n, and V is n x n. If there are only k non-zero eigenvalues where k ≤ min(m, n), the matrix A has lossless representation by using only k columns of U, k columns of V and kxk diagonal matrix S. If an eigenvalue is very small as compared to others, then ignoring it can lead to further data reduction while retaining most of the information.

B. PCA and Logistic Regression

Along with linear regression, logistic regression (log linear) has been a popular data mining technique. However, both when used stand alone, have limited ability to deal with inter dependent factors. Linear regression is suitable for data that exhibits linear relation, but as all data does not have linear trend, the Logistic models estimate the probability and is applicable to "S-shaped" data. This model is particularly suitable for population growth with limiting condition. As it was with linear regression, it is beneficial to use logistic regression when coupled with PCA.

Population growth is described by exponential function; population is controlled by the limiting condition. The liver disease model is a composition of these two functions as shown below. The mapping from linear to logistic function is described as follows ^[23].

Thus for logistic function P(x) ∈ (0, 1) instead of linear function P(x) = a + b_x, the function becomes:

To solve for a and b, we write:

We make use of PCA in designing better logistic regression algorithm, presented below. Here the hybrid algorithm is presented for two-dimensional data, however, it can be easily extended to higher dimensions.

Example: In this example, we have a training dataset of 20 students obtained from ^[24] who studied for the exam for given hours (horizontal axis) and passed or failed (vertical axis) the test. The curves are the trained logistic regression predictor for chances of passing the exam. Fail and pass are coded numerically as 0 and 1, see Figure 5.

Fig. 5 Using the data points "+", usual logistic regression curve is given in red and the regression curve generated by the proposed hybrid model is given in blue.

The approximation errors for are shown in Table 5. The example exhibits that the PCA (blue) curve is a better approximation predictor with 60% less approximation error.

Table V Comparison of logistic regression methods.

C. How do we measure the goodness of a model?

In data mining there are standard measures, called gold standard, for labeling and measuring the prediction accuracy. These measures are useful in comparing the results of classification. Here we will list three metrics, Precision (P), Recall (R) and F-1. In these metrics, actual value and predicted value are used to create a label for each instance outcome, where the instance outcome is either positive or negative. The labels used here are True Positive (TP), False Positive (FP) and False Negative (FN) to measure the accuracy (the reader is encouraged to consult ^[17] for further details). After labeling, we define the Precision (P), Recall (R) and F-metrics to measure the effectiveness of the prediction on (1) correct prediction on all positive instances and (2) TP prediction on a sample of instances (positive and negative) under investigation on the training data.

The measure F-1 is the weighted Harmonic average of P and R. It is the reciprocal of the weighted average of the reciprocals of P and R.

For 0 ≤ α ≤ 1, it simplifies to:

For α = 0, it turns out to be the recall measure R, for α = 1, it becomes precision measure P, and for α = 1/2, it further simplifies to traditional measure

This is the preferred measure when there are fewer misses of both positive and negative instances, (i.e. both FN and FP are small) ^[17].

Goodness of fit is an interesting analysis criterion. our goal is to show how the hybrid linear/logistic regression model is better than the more straightforward measures generated for linear/logistic measures.

V. Dimensionality reduction using PCA

The nature of data dictates how many dimensions can be reduced. The data is not just the attributes, but the dependency and redundancy among the attributes. If A=USV^T is full dimension SVD of A, where A is m x n, U is m x m, V is n x n, S is m x n, then the total size for decomposition representation of A is m x m + n x n + m x n, which is larger than m x n, the size of A. our goal is to find an integer k smaller than m and n, and use first k columns of U and first k columns of V and restrict S to first k eigenvalues to show the effect of dimensionality reduction. Since eigenpairs are sorted on descending order of variance, deleting the least variation components do not cause significant error in data ^[25].

In practice, it is not our intent to reconstruct the original matrix A from reduced USV^T but to view the data from a fresh perspective and to use the reduced representation to extract information hidden in the original representation. PCA is used to reduce dimensionality in the new space, not the original space. our purpose to see if we can leverage PCA to reduce dimensionality in the original space. This is an open question.

A. Dimension Reduction in Qualitative Spatial Reasoning

PCA has been used mainly with numerical data. If the data is categorical or logical, then data is first converted to numerical. We will see how PCA has the ability to resolve and isolate spatial-temporal patterns in the data presented in Table 2. We present a new robust PCA enabled method for QSR. Table 2 describes eight topological relations between pairs of spatial objects. The values of entries are true and false. In order to use PCA, we first covert the logical data to numerical data. We use 1 for false and 0 for true, see Table 6. our goal is row dimension reduction. As intersection is a complex operation and also computationally expensive, we want to reduce the number of intersections required. For example, for 1000 pairs of objects, there will be 9000 pairwise intersections. By eliminating one intersection, we can reduce 9000 to 8000 intersections, almost 11% improvement in execution. We will show that PCA gives insights, using which we can do better. In fact, we are able to reduce 9000 to 4000 intersections. This is more than 55% reduction in computation time! This means we can replace 9-Intersection model by 4-Intersection model which is now applicable to spatial as well as non-spatial objects, like web documents.

Table VI Numeric Values for 9-Intersections Required to Distinguish each RCC8 Relation.

For RCC8, the item-attribute-concept becomes object pair--9-Intersection-relation classification. In Table 6, row header represents a pair of spatial objects, column headers are the 9-Intersection attributes, and last column RCC8 is the classification of the relation based on the intersections. Table 2 and Table 6 show a sample of eight pairs of objects, one of each classification type.

We consider Table 6 is an 8 x 9 input matrix A. on using Matlab SVD on A^TA we get nine eigenvectors and nine eigenvalues of A^TA shown in Table 7. Since five eigenvalues are zero, the corresponding eigenvectors are useless. This tells us that nx9 data can be replaced with nx4 right away without any loss of information.

Table VII Eigenvectors for RCC8, rows are Eigenvectors; Last column is Eigenvalues.

In Table 8, first row enumerates the eigenvalues of A^TA The next rows represent the error on using first k eigenpairs (where k is the column number). newA is USV^T based on k eigenpairs used. Error Original is the |A-newA|/|A| percentage error in the original space. Error Projection is the |AV-newAV|/|AV| percentage error in the projection space. Error Eigenvalue is the ∑p=k+1,9λp/∑p=1,9λp percentage error in the eigenvalue space. Error Quantization uses quantization before error calculation. For example, now with quantization using 4 eigenvalues we see that there is no error between newA and A. This is what we expected as A has boolean elements.

Table VIII Eigenvalues, Eigenvectors, and Percent Errors for RCC8 Reduction using PCA.

Table 8 shows that the transformed space created using 4 eigenvalues retains perfect information. This means that for all object pairs, the relations can be described with 4 eigenvectors as an n x 4 matrix instead of nx9 matrix with zero error. So how can we deduce the original dimensions that retain most of the information? We explore that in the next section.

VI. Experiments and outcomes

Here we show the application of PCA for dimension reduction in qualitative spatial reasoning and liver disease data. In addition decision tree is used for spatial data classification whereas the improved logistic regression is applied to liver disease data classification.

A. Qualitative Spatial Reasoning

In Section IV, we determined that 4 attributes are sufficient to classify QSR relations in the transformed space. However, it does not tell anything about the attributes in the original space. Now we will see if we can translate this new found knowledge into the original space of Table 2. How can we find four intersection attributes that will lead to the 8 distinct topological relations?

From careful observation of Table 2 we see that the IntInt and BndBnd columns have the most useful information in the sense that they are sufficient to partition the RCC8 relations into eight jointly exhaustive and pairwise distinct (JEPD) classes, which can be further grouped into three classes: {DC, EC}, {NTTP, NTTPc}, and {EQ, TPP, TPPc, PO}.

We revisit Table 2 as Table 9 by shading some entries and analyze them. It shows that only 4-intersections are sufficient for classification of topological relations. The nature of data suggests that the remaining attributes are not necessary. This table can be interpreted and formulated in terms of rules for system integration. These rules are shaded and displayed for visualization in the form of Table 9.

Table IX RCC8 Relations Attributes for Classification, (Revisiting Table 2).

Thus, Table 9 reveals that the spatial relations can be specified by at most four intersection attributes. The shaded columns of Table 9 are transcribed into a decision tree for easy visualization of the rules to classify the RCC8 eight relations, see Figure 6.

Fig. 6 Classification tree for the topological relations, where T and F represent whether the objects intersect or not respectively.

This conclusion makes no assumptions about the objects being spatial or non-spatial as long as they are valid. In addition, this analysis is applicable to discrete and continuous objects alike.

B. Health Informatics

We will use public domain dataset from uCI Machine Learning Repository ^[22] to automate the simplicity, applicability and usability of our approach.

For application of our algorithm, we selected liver disease classification dataset. This dataset was selected particularly as most of its attribute were numeric and classification attribute is binary representing presence or absence of the disease. This dataset is compatible with logistic regression and is also well suited to PCA that processes numerical values only.

We obtained the dataset from Machine Learning Repository at the university of California, Irvine ^[22]. The dataset contains liver disease information about 583 patients out of which 416 are with liver disease and 167 are healthy. The dataset consists of 441 male and 142 female patients. The liver disease classification is based on 10 parameters: age of the patient, gender of the patient, total Bilirubin, direct Bilirubin, Alkaline Phosphotase, Alamine Aminotransferase, Aspartate Aminotransferase, total Proteins, Albumin, Albumin and Globulin ratio. There were two types of recommendations based on these experiments: patient has liver disease or patient does not have liver disease.

This is a fairly small size dataset for classification of 583 patients. Learning from this dataset can be used to predict possible disease for a new patient quickly without further analysis. The goal is not data mining per se, but to show the feasibility of improved algorithms over the existing algorithms and data reduction to classify liver disease. The reduction in one attribute reduces the data size by 9%. We applied PCA on the data to reduce 10 attributes to 3 or 4 attributes, which contribute the most to the eigenvector corresponding to the highest eigenvalue, while retaining approximately the same predictive power as the original data.

For experiment, we created two versions of the dataset: first dataset is raw, the second dataset is mean-centered with unit standard deviation. PCA determines that there is only one non-zero eigenvalue all other eigenvalues are insignificant. one non-zero eigenvalue is shown in Table 10.

Table X Eigenpairs and error in data reduction.

This indicates that only a single attribute in the transformed space is sufficient to diagnose the patients. But this does not tell us which original attributes contributed to reduction. The principal components on normalized data are more realistic in this case, as the normalized data attributes values are evenly distributed. For nominal attributes, mapping nominal to numerical can make a difference. However, covariance and correlation approaches are complementary.

The principal component corresponding to non-zero eigenvalue is a linear combination of original attributes. Each coefficient in it is a contribution of the original data attributes. How do we select the fractions of original attributes because the coefficients in this vector are real?

The only thing it means is that each coefficient is a fractional contribution of the original data attributes. It is clear that the three (Alkaline Phosphatase Alamine Aminotransfe-rase, Total Proteins, Aspartate Aminotransferase) of the ten coefficients are more dominant than the others, however normalized data analysis found three more slightly less dominant coefficients. In either case, the contribution of the original three attributes is more than 95%. Eliminating the other attributes, we compute the approximation error due to dimension reduction to three attributes.

PCA analysis shows that even after 60% reduction, using only 40% of data, the precision is almost the same whereas the gain in computation performance is significant, see Figure 7. For recall, the reduced data regression misses more negatives see Table 11. It is preferable to miss less positive than more negatives. Table 11 corresponds to traditional logistic regression Table 12 corresponds to hybrid logistic regression algorithm. It shows that hybrid algorithm consistently outperforms the traditional algorithms.

Fig. 7 Error in estimating the original data from eigenvectors where x-axis represents the number of eigenvectors used in data approximation (starting from the most significant to the least significant one) and y-axis represents the error percentage of the estimation.

Table XI Error comparison metrics Traditional Algorithm.

Table XII Error comparison metrics Hybrid Algorithm.

VII. Conclusion

Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called "principal components", from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components.

Principal components analysis is commonly used as one step in a series of analyses. We use principal components analysis to reduce the number of variables and avoid multi-collinearity, or when we have too many predictors relative to the number of observations. We have used PCA in two diverse genres, QSR and Health Informatics to improve traditional data reduction and regression algorithms.

QSR uses 9-Intersection model to determine topological relations between spatial objects. In general, PCA utilizes numerical data for analysis and as QSR data is logical bivalent, we mapped the logical data to numerical data. PCA determined that 4-attributes are adequate in the transformed space. In general, reduction in transformed space does not tell anything about reduction in base space. However, in this case study, we leveraged PCA to determine the possibility of reduction in the base space. We succeeded in achieving similar reduction the original space of RCC8 relations. This yields more than 55% efficiency in execution time.

We also presented hybrid algorithms that adaptively used PCA to improve the linear and logistic regression algorithms. With experiments, we have shown the effectiveness of the enhancements. All data mining applications that dwell on these two algorithms will benefit extensively from our enhanced algorithms, as they are more realistic than the traditional algorithms. The tables in the paper body vouch for this improvement. We applied our algorithms to the Liver Patient dataset to demonstrate the usability and applicability of our approach, especially in the area of health related data.

VIII. References

[1] M. J. Egenhofer, R. Franzosa, "Point-Set topological Relations", International Journal of Geographical Information Systems 5(2), pp. 161-174, 1991. [ Links ]

[2] D.A. Randell, Z. Cui, A.G. Cohn, "A Spatial Logic Based on Regions and Connection". KR92:165-176, 1992. [ Links ]

[3] C. Sabharwal, J. Leopold, and Nathan Eloe, "A More Expressive 3D Region Connection Calculus", Proceedings of the 2011 International Workshop on Visual Languages and Computing DMS'11, Florence, Italy, Aug. 18-20, 2011, pp. 307-311, 2011. [ Links ]

[4] The Promise of Big Data, Harward School of Public Health Magazine, pp. 15-43, 2012 [ Links ]

[5] A. Bernard: Healthcare Industry Sees Big Data As More Than a Bandage, CIO, 2013. [ Links ]

[6] C. M. Bishop, Pattern Recognition and Machine Learning, 2006 Springer. [ Links ]

[7] The Global Use of Medicines: Outlook Through 2016, IMS Institute for Healthcare Informatics, pp.1-36, 2012. [ Links ]

[8] S. Israel: Contextual Health vs The Elephant in the Hospital, Forbes, Tech, pp.1-10, 2013. [ Links ]

[9] J. Bort: How the CDC Is Using Big Data to Save You from the Flu, Business Insider. [ Links ]

[10] A. Parmar, "Want to know if you will develop high blood pressure next year? With big data magic you can", MedCity News, 2012. [ Links ]

[11] P. Groves, B. Kayyali, D. Knott, S. V. Kuiken, "The 'Big Data' Revolution in Healthcare", Center of US Health System Reform Business Technology Office, pp. 1-20, 2013. [ Links ]

[12] B.W. Chee, R. Berlin, B. Schatz, "Predicting Adverse Drug Events from Personal Health Messages". In: Annual Symposium Proceedings, pp. 217-226 (2011). [ Links ]

[13] X. Liu, and H. Chen, 2013. "Azdrugminer: An Information Extraction System for Mining Patient Reported Adverse Drug Events in Online Patient Forums," Smart Health. Springer, pp. 134-150. [ Links ]

[14] C. C. Yang, L. Jiang, H. Yang, M. Zhang, "Social Media Mining for Drug Safety Signal Detection" ACMSHB'12, October 29, 2012, Maui, Hawaii, USA. [ Links ]

[15] H. Yang and C. C. Yang, "Using Health-Consumer-Contributed Data to Detect Adverse Drug Reactions by Association Mining with Temporal Analysis", ACMTrans. Intell. Syst. Technol. 6, 4, Article 55 (July 2015). [ Links ]

[16] H. Hotelling, "Analysis of a complex of statistical variables into principal components" Journal of Educational Psychology, 24, 417-441, and 498-520, 1993. [ Links ]

[17] H. Jim: Linear Algebra 2014 http://joshua.smcvt.edu/linearalgebra [ Links ]

[18] J. Shlens A Tutorial on Principal Component Analysis, arXiv:1404.1100 [cs.LG], pp. 1-15, 2014. [ Links ]

[19] K. Baker, Singular Value Decomposition Tutorial, http://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf, January 2013. [ Links ]

[20] H. Y. Chen, R. Liegeois, J. R. de Bruyn, and A. Soddu, "Principal Component Analysis of Particle Motion", Phys. Rev. E 91, 042308, 15 April 2015. [ Links ]

[21] Ling Jiang, Christopher C. Yang, and Jiexun Li, "Discovering Consumer Health Expressionsfrom Consumer-Contributed Content" SBP 2013, LNCS 7812, pp. 164-174, 2013. [ Links ]

[22] UCI Machine Learning Repository, Liver Patient Dataset https://archive.ics.uci.edu/ml/datasets/ [ Links ]

[23] A. Goel, R. G Pinckney and B. Littenberg, APACHE II Predicts Long-term Survival in COPD Patients Admitted to a General Medical Ward, J Gen Intern Med. 2003 Oct; 18(10): 824-830. [ Links ]

[24] Wikipedia: https://en.wikipedia.org/wiki/F1_score [ Links ]

[25] J. Leskovec, A. Rajaraman, J. D. Ullman, Datamining of Massive Datasets, 2014. [ Links ]

Received: January 25, 2016; Accepted: May 20, 2016

This is an open-access article distributed under the terms of the Creative Commons Attribution License