Decomposition#
Principal Component Analysis (PCA)#
Principal component analysis is usually known as PCA. PCA is an unsupervised learning method, in which the training data we feed to the algorithm does not need the desired labels. The aim of PCA is to reduce the dimension of high-dimensional input data. For example, there are x1, x2, x3, x4 and y, up to four columns of data. However, not all kinds of data are essential for the label y. So the PCA could be useful to abandon less important x for regressing/classifying y and accelerate data analysis.
Note : This part would show the whole process of PCA, including data-processing and model-running.
Preparation#
First, after ensuring the Geochemistry Pi framework has been installed successfully (if not, please see docs ), we run the python framework in command line interface to process our program: If you do not input own data, you can run
geochemistrypi data-mining
If you prepare to input own data, you can run
geochemistrypi data-mining --data your_own_data_set.xlsx
The command line interface would show
-*-*- Built-in Data Option-*-*-
1 - Data For Regression
2 - Data For Classification
3 - Data For Clustering
4 - Data For Dimensional Reduction
(User) ➜ @Number: 4
You have to choose **Data For Dimensional Reduction** and press 4 on your own keyboard. The command line interface would show
Successfully loading the built-in data set 'Data_Decomposition.xlsx'.
--------------------
Index - Column Name
1 - CITATION
2 - SAMPLE NAME
3 - Label
4 - Notes
5 - LATITUDE
6 - LONGITUDE
7 - Unnamed: 6
8 - SIO2(WT%)
9 - TIO2(WT%)
10 - AL2O3(WT%)
11 - CR2O3(WT%)
12 - FEOT(WT%)
13 - CAO(WT%)
14 - MGO(WT%)
15 - MNO(WT%)
16 - NA2O(WT%)
17 - Unnamed: 16
18 - SC(PPM)
19 - TI(PPM)
20 - V(PPM)
21 - CR(PPM)
22 - NI(PPM)
23 - RB(PPM)
24 - SR(PPM)
25 - Y(PPM)
26 - ZR(PPM)
27 - NB(PPM)
28 - BA(PPM)
29 - LA(PPM)
30 - CE(PPM)
31 - PR(PPM)
32 - ND(PPM)
33 - SM(PPM)
34 - EU(PPM)
35 - GD(PPM)
36 - TB(PPM)
37 - DY(PPM)
38 - HO(PPM)
39 - ER(PPM)
40 - TM(PPM)
41 - YB(PPM)
42 - LU(PPM)
43 - HF(PPM)
44 - TA(PPM)
45 - PB(PPM)
46 - TH(PPM)
47 - U(PPM)
--------------------
(Press Enter key to move forward.
Here, we just need to press any keyboard to continue.
World Map Projection for A Specific Element Option:
1 - Yes
2 - No
(Plot) ➜ @Number::
We can choose map projection if we need a world map projection for a specific element option. Choose yes, we can choose an element to map. Choose no, skip to the next mode. More information of the map projection can be seen in map projection. In this tutorial, we skip it and gain output as:
-*-*- Data Selected -*-*-
--------------------
Index - Column Name
1 - CITATION
2 - SAMPLE NAME
3 - Label
4 - Notes
5 - LATITUDE
6 - LONGITUDE
7 - Unnamed: 6
8 - SIO2(WT%)
9 - TIO2(WT%)
10 - AL2O3(WT%)
11 - CR2O3(WT%)
12 - FEOT(WT%)
13 - CAO(WT%)
14 - MGO(WT%)
15 - MNO(WT%)
16 - NA2O(WT%)
17 - Unnamed: 16
18 - SC(PPM)
19 - TI(PPM)
20 - V(PPM)
21 - CR(PPM)
22 - NI(PPM)
23 - RB(PPM)
24 - SR(PPM)
25 - Y(PPM)
26 - ZR(PPM)
27 - NB(PPM)
28 - BA(PPM)
29 - LA(PPM)
30 - CE(PPM)
31 - PR(PPM)
32 - ND(PPM)
33 - SM(PPM)
34 - EU(PPM)
35 - GD(PPM)
36 - TB(PPM)
37 - DY(PPM)
38 - HO(PPM)
39 - ER(PPM)
40 - TM(PPM)
41 - YB(PPM)
42 - LU(PPM)
43 - HF(PPM)
44 - TA(PPM)
45 - PB(PPM)
46 - TH(PPM)
47 - U(PPM)
--------------------
Select the data range you want to process.
Input format:
Format 1: "[**, **]; **; [**, **]", such as "[1, 3]; 7; [10, 13]" --> you want to deal with the columns 1, 2, 3, 7, 10, 11, 12, 13
Format 2: "xx", such as "7" --> you want to deal with the columns 7)
Two options are offered. For PCA, the Format 1 method is more useful in multiple dimensional reduction. As a tutorial, we input **[10, 15]** as an example.
Note: [start_col_num, end_col_num]
The selected feature information would be given
The Selected Data Set:
AL2O3(WT%) CR2O3(WT%) FEOT(WT%) CAO(WT%) MGO(WT%) MNO(WT%)
0 3.936000 1.440 3.097000 18.546000 18.478000 0.083000
1 3.040000 0.578 3.200000 20.235000 17.277000 0.150000
2 7.016561 NaN 3.172049 20.092611 15.261175 0.102185
3 3.110977 NaN 2.413834 22.083843 17.349203 0.078300
4 6.971044 NaN 2.995074 20.530008 15.562149 0.096700
.. ... ... ... ... ... ...
104 2.740000 0.060 4.520000 23.530000 14.960000 0.060000
105 5.700000 0.690 2.750000 20.120000 16.470000 0.120000
106 0.230000 2.910 2.520000 19.700000 18.000000 0.130000
107 2.580000 0.750 2.300000 22.100000 16.690000 0.050000
108 6.490000 0.800 2.620000 20.560000 14.600000 0.070000
[109 rows x 6 columns]
After continuing with any key, basic information of selected data would be shown
Basic Statistical Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AL2O3(WT%) 109 non-null float64
1 CR2O3(WT%) 98 non-null float64
2 FEOT(WT%) 109 non-null float64
3 CAO(WT%) 109 non-null float64
4 MGO(WT%) 109 non-null float64
5 MNO(WT%) 109 non-null float64
dtypes: float64(6)
memory usage: 5.2 KB
None
Some basic statistic information of the designated data set:
AL2O3(WT%) CR2O3(WT%) FEOT(WT%) CAO(WT%) MGO(WT%) MNO(WT%)
count 109.000000 98.000000 109.000000 109.000000 109.000000 109.000000
mean 4.554212 0.956426 2.962310 21.115756 16.178044 0.092087
std 1.969756 0.553647 1.133967 1.964380 1.432886 0.054002
min 0.230000 0.000000 1.371100 13.170000 12.170000 0.000000
25% 3.110977 0.662500 2.350000 20.310000 15.300000 0.063075
50% 4.720000 0.925000 2.690000 21.223500 15.920000 0.090000
75% 6.233341 1.243656 3.330000 22.185450 16.816000 0.110000
max 8.110000 3.869550 8.145000 25.362000 23.528382 0.400000
Successfully calculate the pair-wise correlation coefficient among the selected columns.
Save figure 'Correlation Plot' in C:\Users\74086\output\images\statistic.
Successfully draw the distribution plot of the selected columns.
Save figure 'Distribution Histogram' in C:\Users\74086\output\images\statistic.
Successfully draw the distribution plot after log transformation of the selected columns.
Save figure 'Distribution Histogram After Log Transformation' in C:\Users\74086\output\images\statistic.
(Press Enter key to move forward.)
NAN value process#
Check the NAN values would be helpful for later analysis. In geochemistrypi frame, this option is finished automatically.
-*-*- Imputation -*-*-
Check which column has null values:
--------------------
AL2O3(WT%) False
CR2O3(WT%) True
FEOT(WT%) False
CAO(WT%) False
MGO(WT%) False
MNO(WT%) False
dtype: bool
--------------------
The ratio of the null values in each column:
--------------------
CR2O3(WT%) 0.100917
AL2O3(WT%) 0.000000
FEOT(WT%) 0.000000
CAO(WT%) 0.000000
MGO(WT%) 0.000000
MNO(WT%) 0.000000
dtype: float64
--------------------
Several strategies are offered for processing the missing values, including:
-*-*- Strategy for Missing Values -*-*-
1 - Mean Value
2 - Median Value
3 - Most Frequent Value
4 - Constant(Specified Value)
Which strategy do you want to apply?
We choose the mean Value in this example and the input data be processed automatically as:
-*-*- Hypothesis Testing on Imputation Method -*-*-
Null Hypothesis: The distributions of the data set before and after imputing remain the same.
Thoughts: Check which column rejects null hypothesis.
Statistics Test Method: kruskal Test
Significance Level: 0.05
The number of iterations of Monte Carlo simulation: 100
The size of the sample for each iteration (half of the whole data set): 54
Average p-value:
AL2O3(WT%) 1.0
CR2O3(WT%) 0.9327453056346102
FEOT(WT%) 1.0
CAO(WT%) 1.0
MGO(WT%) 1.0
MNO(WT%) 1.0
Note: 'p-value < 0.05' means imputation method doesn't apply to that column.
The columns which rejects null hypothesis: None
Successfully draw the respective probability plot (origin vs. impute) of the selected columns
Save figure 'Probability Plot' in C:\Users\86188\geopi_output\GeoPi - Rock Classification\Xgboost Algorithm - Test 1\artifacts\image\statistic.
Successfully store 'Probability Plot' in 'Probability Plot.xlsx' in C:\Users\86188\geopi_output\GeoPi - Rock Classification\Xgboost Algorithm - Test 1\artifacts\image\statistic.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AL2O3(WT%) 109 non-null float64
1 CR2O3(WT%) 109 non-null float64
2 FEOT(WT%) 109 non-null float64
3 CAO(WT%) 109 non-null float64
4 MGO(WT%) 109 non-null float64
5 MNO(WT%) 109 non-null float64
dtypes: float64(6)
memory usage: 5.2 KB
None
Some basic statistic information of the designated data set:
AL2O3(WT%) CR2O3(WT%) FEOT(WT%) CAO(WT%) MGO(WT%) MNO(WT%)
count 109.000000 109.000000 109.000000 109.000000 109.000000 109.000000
mean 4.554212 0.956426 2.962310 21.115756 16.178044 0.092087
std 1.969756 0.524695 1.133967 1.964380 1.432886 0.054002
min 0.230000 0.000000 1.371100 13.170000 12.170000 0.000000
25% 3.110977 0.680000 2.350000 20.310000 15.300000 0.063075
50% 4.720000 0.956426 2.690000 21.223500 15.920000 0.090000
75% 6.233341 1.170000 3.330000 22.185450 16.816000 0.110000
max 8.110000 3.869550 8.145000 25.362000 23.528382 0.400000
Successfully store 'Data Selected Imputed' in 'Data Selected Imputed.xlsx' in C:\Users\86188\geopi_output\GeoPi - Rock Classification\Xgboost Algorithm - Test 1\artifacts\data.
(Press Enter key to move forward.)
Feature engineering#
The next step is the feature engineering options.
-*-*- Feature Engineering -*-*-
The Selected Data Set:
--------------------
Index - Column Name
1 - AL2O3(WT%)
2 - CR2O3(WT%)
3 - FEOT(WT%)
4 - CAO(WT%)
5 - MGO(WT%)
6 - MNO(WT%)
--------------------
Feature Engineering Option:
1 - Yes
2 - No
Feature engineering options are essential for data analysis. We choose Yes and gain
-*-*- Feature Engineering -*-*-
The Selected Data Set:
--------------------
Index - Column Name
1 - AL2O3(WT%)
2 - CR2O3(WT%)
3 - FEOT(WT%)
4 - CAO(WT%)
5 - MGO(WT%)
6 - MNO(WT%)
--------------------
Feature Engineering Option:
1 - Yes
2 - No
(Data) ➜ @Number: 1
Selected data set:
a - AL2O3(WT%)
b - CR2O3(WT%)
c - FEOT(WT%)
d - CAO(WT%)
e - MGO(WT%)
f - MNO(WT%)
Name the constructed feature (column name), like 'NEW-COMPOUND':
@input: new Feature
Build up new feature with the combination of basic arithmatic operators, including '+', '-', '*', '/', '()'.
Input example 1: a * b - c
--> Step 1: Multiply a column with b column;
--> Step 2: Subtract c from the result of Step 1;
Input example 2: (d + 5 * f) / g
--> Step 1: Multiply 5 with f;
--> Step 2: Plus d column with the result of Step 1;
--> Step 3: Divide the result of Step 1 by g;
Input example 3: pow(a, b) + c * d
--> Step 1: Raise the base a to the power of the exponent b;
--> Step 2: Multiply the value of c by the value of d;
--> Step 3: Add the result of Step 1 to the result of Step 2;
Input example 4: log(a)/b - c
--> Step 1: Take the logarithm of the value a;
--> Step 2: Divide the result of Step 1 by the value of b;
--> Step 3: Subtract the value of c from the result of Step 2;
You can use mean(x) to calculate the average value.
@input:
Considering actual need for constructing several new geochemical indexes. We can set up some new indexes. Here, we would set up a new index by AL2O3/CAO via keyboard options with a/d.
Do you want to continue to construct a new feature?
1 - Yes
2 - No
(Data) ➜ @Number: 2
Successfully store 'Data Before Splitting' in 'Data Before Splitting.xlsx' in C:\Users\74086\output\data.
Exit Feature Engineering Mode.
PCA#
Then we can start PCA by selecting Dimensional Reduction and Principal Component Analysis. The kept component number is a hyper-parameter needs to be decided and here we propose the number is 3. Some PCA information is shown on the window.
-*-*- Hyper-parameters Specification -*-*-
Decide the component numbers to keep:
(Model) ➜ @Number: 3
*-**-* PCA is running ... *-**-*
Expected Functionality:
+ Model Persistence
+ Principal Components
+ Explained Variance Ratio
+ Compositional Bi-plot
+ Compositional Tri-plot
-----* Principal Components *-----
Every column represents one principal component respectively.
Every row represents how much that row feature contributes to each principal component respectively.
The tabular data looks like in format: 'rows x columns = 'features x principal components'.
PC1 PC2 PC3
AL2O3(WT%) -0.742029 -0.439057 -0.085773
CR2O3(WT%) -0.007037 0.082531 -0.213232
FEOT(WT%) -0.173824 0.219858 0.937257
CAO(WT%) 0.624609 -0.620584 0.200722
MGO(WT%) 0.165265 0.605489 -0.168090
MNO(WT%) -0.003397 0.011160 0.012315
-0.040834 -0.014650 -0.005382
-----* Explained Variance Ratio *-----
[0.46679568 0.38306839 0.09102234]
-----* 2 Dimensions Data Selection *-----
The software is going to draw related 2d graphs.
Currently, the data dimension is beyond 2 dimensions.
Please choose 2 dimensions of the data below.
1 - PC1
2 - PC2
3 - PC3
Choose dimension - 1 data:
By inputting different component numbers, results of PCA are obtained automatically.