geochemistrypi.data_mining.data package#

Submodules#

geochemistrypi.data_mining.data.data_readiness module#

basic_info(data: DataFrame) → None[source]#

Show the basic information of the data set.

Parameters:: data (pd.DataFrame) – The data set to be shown.

bool_input(prefix: str | None = None) → bool[source]#

Get the number of the desired option.

Parameters:: prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
Returns:: A boolean value.
Return type:: bool

create_sub_data_set(data: DataFrame) → DataFrame[source]#

Create a sub data set.

Parameters:: data (pd.DataFrame) – The data set to be processed.
Returns:: The sub data set.
Return type:: pd.DataFrame

data_split(X: DataFrame, y: DataFrame | Series, test_size: float = 0.2) → Dict[source]#

Split arrays or matrices into random train and test subsets.

Parameters:

X (pd.DataFrame) – The data to be split.
y (pd.DataFrame or pd.Series) – The target variable to be split.
test_size (float, default=0.2) – Represents the proportion of the dataset to include in the test split.

Returns:

A dictionary containing the split data.

Return type:

dict

float_input(default: float, prefix: str | None = None, slogan: str | None = '@Number: ') → float[source]#

Get the number of the desired option.

Parameters:

default (float) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number.

Return type:

float or int

limit_num_input(option_list: List[str], prefix: str, input_func: num_input) → int[source]#

Limit the scope of the option.

Parameters:

option_list (List[str]) – All the options provided are stored in a list.
prefix (str) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
input_func (function) – The function of input_func.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

np2pd(array: ndarray, columns_name: List[str]) → DataFrame[source]#

Convert numpy array to pandas dataframe.

Parameters:

array (np.ndarray) – The numpy array to be converted.
columns_name (List[str]) – The column names of the dataframe.

Returns:

The converted dataframe.

Return type:

pd.DataFrame

num2option(items: List[str]) → None[source]#

List all the options serially.

Parameters:: items (list) – a series of items need to be enumerated

num_input(prefix: str | None = None, slogan: str | None = '@Number: ') → int[source]#

Get the number of the desired option.

Parameters:

prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

read_data(file_path: str | None = None, is_own_data: int = 2, prefix: str | None = None, slogan: str | None = '@File: ')[source]#

Read the data set.

Parameters:

file_path (str, optional) – The path of the data set, by default None
is_own_data (int, default=2) – 1: own data set; 2: built-in data set
prefix (str, optional) – The prefix of the data set, by default None
slogan (str, optional) – The slogan of the data set, by default “@File: “

Returns:

The data set read

Return type:

pd.DataFrame

select_columns(columns_range: str | None = None) → List[int][source]#

Select the columns of the data set.

Parameters:: columns_range (str, default=None) – The columns range of the data set.
Returns:: The columns selected.
Return type:: list

show_data_columns(columns_name: Index, columns_index: List | None = None) → None[source]#

Show the column names of the data set.

Parameters:

columns_name (pd.Index) – The column names of the data set.
columns_index (list, default=None) – The column index of the data set.

str_input(option_list: List[str], prefix: str | None = None) → str[source]#

Get the string of the desired option.

Parameters:

option_list (list) – All the options provided are stored in a list.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

Returns:

option – A string of the desired option.

Return type:

str

tuple_input(default: Tuple[int], prefix: str | None = None, slogan: str | None = None) → Tuple[int][source]#

Get the tuple of the desired option.

Parameters:

default (Tuple[int]) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default=None) – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – A numeric tuple.

Return type:

tuple

geochemistrypi.data_mining.data.feature_engineering module#

class FeatureConstructor(data: DataFrame)[source]#

Bases: object

Construct new feature based on the existing data set.

append_feature(new_feature_column: Series) → None[source]#: Append the new feature to the original data.

batch_build(feature_engineering_config: Dict) → None[source]#

build() → None[source]#: Build the new feature.

cal_words = ['pow', 'sin', 'cos', 'tan', 'pi', 'mean', 'std', 'var', 'log']#

index2name() → None[source]#: Show the index of columns in the data set. The display pattern is [letter : column name], e.g. a : 1st column name; b : 2nd column name.

input_expression() → None[source]#: Input the expression of the constructed feature.

input_feature_name() → None[source]#: Name the constructed feature (column name), like ‘NEW-COMPOUND’.

letter_map() → None[source]#: Map the letter to the column name.

oper = '+-*/^(),.'#

geochemistrypi.data_mining.data.imputation module#

imputer(data: DataFrame, method: str) → tuple[dict, ndarray][source]#

Apply imputation on missing values.

Parameters:

data (pd.DataFrame) – The dataset with missing values.
method (str) – The imputation method.

Returns:

imputation_config (dict) – The imputation configuration.
data_imputed (np.ndarray) – The dataset after imputing.

geochemistrypi.data_mining.data.inference module#

class PipelineConstrutor[source]#

Bases: object

Construct a sklearn pipeline from a dictionary of transformers.

chain(transformer_config: Dict) → object[source]#

Chain transformers together into a sklearn pipeline.

Parameters:: transformer_config (Dict) – A dictionary of transformers and their parameters.
Returns:: A sklearn pipeline.
Return type:: object

property transformer_dict: Dict#: A dictionary of transformers. Need to be updated when new transformers in the customized automated ML pipeline is added.

build_transform_pipeline(imputation_config: Dict, feature_scaling_config: Dict, feature_selection_config: Dict, run: object, X_train: DataFrame, y_train: DataFrame) → Tuple[Dict, object][source]#

Build the transform pipeline.

Parameters:

imputation_config (Dict) – The imputation configuration.
feature_scaling_config (Dict) – The feature scaling configuration.
feature_selection_config (Dict) – The feature selection configuration.
run (object) – The model selection object.
X_train (pd.DataFrame) – The training data.

Returns:

The transform pipeline configuration and the transform pipeline object.

Return type:

Tuple[Dict, object]

model_inference(inference_data: DataFrame, is_inference: bool, feature_engineering_config: Dict, run: object, transformer_config: Dict, transform_pipeline: object | None = None)[source]#

Run the model inference.

Parameters:

inference_data (pd.DataFrame) – The inference data.
is_inference (bool) – Whether to run the model inference.
feature_engineering_config (Dict) – The feature engineering configuration.
run (object) – The model selection object.
transformer_config (Dict) – The transformer configuration.
transform_pipeline (Optional[object], optional) – The transform pipeline object. The default is None.

geochemistrypi.data_mining.data.preprocessing module#

feature_scaler(X: DataFrame, method: List[str], method_idx: int) → tuple[dict, ndarray][source]#

Apply feature scaling methods.

Parameters:

X (pd.DataFrame) – The dataset.
method (str) – The feature scaling methods.
method_idx (int) – The index of methods.

Returns:

feature_scaling_config (dict) – The feature scaling configuration.
X_scaled (np.ndarray) – The dataset after imputing.

feature_selector(X: DataFrame, y: DataFrame, feature_selection_task: int, method: List[str], method_idx: int) → tuple[dict, DataFrame][source]#

Apply feature selection methods.

Parameters:

X (pd.DataFrame) – The feature dataset.
y (pd.DataFrame) – The label dataset.
feature_selection_task (int) – Feature selection for regression or classification tasks.
method (str) – The feature selection methods.
method_idx (int) – The index of methods.

Returns:

feature_selection_config (dict) – The feature selection configuration.
X_selected (pd.DataFrame) – The feature dataset after selecting.

geochemistrypi.data_mining.data.statistic module#

monte_carlo_simulator(df_orig: DataFrame, df_impute: DataFrame, sample_size: int, iteration: int, test: str, confidence: float = 0.05) → None[source]#

Check which column rejects hypothesis testing, p value < significance level, to find whether the imputation change the distribution of the original data set.

Parameters:

df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.
sample_size (int) – The size of the sample for each iteration.
iteration (int) – The number of iterations of Monte Carlo simulation.
confidence (float) – Confidence level, default to be 0.05

test_once(df_orig: DataFrame, df_impute: DataFrame, test: str) → ndarray[source]#

Do hypothesis testing on each pair-wise column once, non-parametric test. Null hypothesis: the distributions of the data set before and after imputing remain the same.

Parameters:

df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.

Returns:

pvals – A numpy array containing the p-values of the tests on each column in the column order

Return type:

np.ndarray

geochemistrypi.data_mining.data package#

Submodules#

geochemistrypi.data_mining.data.data_readiness module#

geochemistrypi.data_mining.data.feature_engineering module#

geochemistrypi.data_mining.data.imputation module#

geochemistrypi.data_mining.data.inference module#

geochemistrypi.data_mining.data.preprocessing module#

geochemistrypi.data_mining.data.statistic module#

Module contents#