geochemistrypi.data_mining.data package#

Submodules#

geochemistrypi.data_mining.data.data_readiness module#

basic_info(data: DataFrame) None[source]#

Show the basic information of the data set.

Parameters:

data (pd.DataFrame) – The data set to be shown.

bool_input(prefix: str | None = None) bool[source]#

Get the number of the desired option.

Parameters:

prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

Returns:

A boolean value.

Return type:

bool

create_sub_data_set(data: DataFrame) DataFrame[source]#

Create a sub data set.

Parameters:

data (pd.DataFrame) – The data set to be processed.

Returns:

The sub data set.

Return type:

pd.DataFrame

data_split(X: DataFrame, y: DataFrame | Series, test_size: float = 0.2) Dict[source]#

Split arrays or matrices into random train and test subsets.

Parameters:
  • X (pd.DataFrame) – The data to be split.

  • y (pd.DataFrame or pd.Series) – The target variable to be split.

  • test_size (float, default=0.2) – Represents the proportion of the dataset to include in the test split.

Returns:

A dictionary containing the split data.

Return type:

dict

float_input(default: float, prefix: str | None = None, slogan: str | None = '@Number: ') float[source]#

Get the number of the desired option.

Parameters:
  • default (float) – If the user does not enter anything, it is assigned to option.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number.

Return type:

float or int

limit_num_input(option_list: List[str], prefix: str, input_func: num_input) int[source]#

Limit the scope of the option.

Parameters:
  • option_list (List[str]) – All the options provided are stored in a list.

  • prefix (str) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • input_func (function) – The function of input_func.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

np2pd(array: ndarray, columns_name: List[str]) DataFrame[source]#

Convert numpy array to pandas dataframe.

Parameters:
  • array (np.ndarray) – The numpy array to be converted.

  • columns_name (List[str]) – The column names of the dataframe.

Returns:

The converted dataframe.

Return type:

pd.DataFrame

num2option(items: List[str]) None[source]#

List all the options serially.

Parameters:

items (list) – a series of items need to be enumerated

num_input(prefix: str | None = None, slogan: str | None = '@Number: ') int[source]#

Get the number of the desired option.

Parameters:
  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – An option number. Be careful that ‘option = real index + 1’

Return type:

int

read_data(file_path: str | None = None, is_own_data: int = 2, prefix: str | None = None, slogan: str | None = '@File: ')[source]#

Read the data set.

Parameters:
  • file_path (str, optional) – The path of the data set, by default None

  • is_own_data (int, default=2) – 1: own data set; 2: built-in data set

  • prefix (str, optional) – The prefix of the data set, by default None

  • slogan (str, optional) – The slogan of the data set, by default “@File: “

Returns:

The data set read

Return type:

pd.DataFrame

select_columns(columns_range: str | None = None) List[int][source]#

Select the columns of the data set.

Parameters:

columns_range (str, default=None) – The columns range of the data set.

Returns:

The columns selected.

Return type:

list

show_data_columns(columns_name: Index, columns_index: List | None = None) None[source]#

Show the column names of the data set.

Parameters:
  • columns_name (pd.Index) – The column names of the data set.

  • columns_index (list, default=None) – The column index of the data set.

str_input(option_list: List[str], prefix: str | None = None) str[source]#

Get the string of the desired option.

Parameters:
  • option_list (list) – All the options provided are stored in a list.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

Returns:

option – A string of the desired option.

Return type:

str

tuple_input(default: Tuple[int], prefix: str | None = None, slogan: str | None = None) Tuple[int][source]#

Get the tuple of the desired option.

Parameters:
  • default (Tuple[int]) – If the user does not enter anything, it is assigned to option.

  • prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.

  • slogan (str, default=None) – It acts like the first parameter of input function in Python, which output the hint.

Returns:

option – A numeric tuple.

Return type:

tuple

geochemistrypi.data_mining.data.feature_engineering module#

class FeatureConstructor(data: DataFrame)[source]#

Bases: object

Construct new feature based on the existing data set.

append_feature(new_feature_column: Series) None[source]#

Append the new feature to the original data.

batch_build(feature_engineering_config: Dict) None[source]#
build() None[source]#

Build the new feature.

cal_words = ['pow', 'sin', 'cos', 'tan', 'pi', 'mean', 'std', 'var', 'log']#
index2name() None[source]#

Show the index of columns in the data set. The display pattern is [letter : column name], e.g. a : 1st column name; b : 2nd column name.

input_expression() None[source]#

Input the expression of the constructed feature.

input_feature_name() None[source]#

Name the constructed feature (column name), like ‘NEW-COMPOUND’.

letter_map() None[source]#

Map the letter to the column name.

oper = '+-*/^(),.'#

geochemistrypi.data_mining.data.imputation module#

imputer(data: DataFrame, method: str) tuple[dict, ndarray][source]#

Apply imputation on missing values.

Parameters:
  • data (pd.DataFrame) – The dataset with missing values.

  • method (str) – The imputation method.

Returns:

  • imputation_config (dict) – The imputation configuration.

  • data_imputed (np.ndarray) – The dataset after imputing.

geochemistrypi.data_mining.data.inference module#

class PipelineConstrutor[source]#

Bases: object

Construct a sklearn pipeline from a dictionary of transformers.

chain(transformer_config: Dict) object[source]#

Chain transformers together into a sklearn pipeline.

Parameters:

transformer_config (Dict) – A dictionary of transformers and their parameters.

Returns:

A sklearn pipeline.

Return type:

object

property transformer_dict: Dict#

A dictionary of transformers. Need to be updated when new transformers in the customized automated ML pipeline is added.

build_transform_pipeline(imputation_config: Dict, feature_scaling_config: Dict, feature_selection_config: Dict, run: object, X_train: DataFrame, y_train: DataFrame) Tuple[Dict, object][source]#

Build the transform pipeline.

Parameters:
  • imputation_config (Dict) – The imputation configuration.

  • feature_scaling_config (Dict) – The feature scaling configuration.

  • feature_selection_config (Dict) – The feature selection configuration.

  • run (object) – The model selection object.

  • X_train (pd.DataFrame) – The training data.

Returns:

The transform pipeline configuration and the transform pipeline object.

Return type:

Tuple[Dict, object]

model_inference(inference_data: DataFrame, is_inference: bool, feature_engineering_config: Dict, run: object, transformer_config: Dict, transform_pipeline: object | None = None)[source]#

Run the model inference.

Parameters:
  • inference_data (pd.DataFrame) – The inference data.

  • is_inference (bool) – Whether to run the model inference.

  • feature_engineering_config (Dict) – The feature engineering configuration.

  • run (object) – The model selection object.

  • transformer_config (Dict) – The transformer configuration.

  • transform_pipeline (Optional[object], optional) – The transform pipeline object. The default is None.

geochemistrypi.data_mining.data.preprocessing module#

feature_scaler(X: DataFrame, method: List[str], method_idx: int) tuple[dict, ndarray][source]#

Apply feature scaling methods.

Parameters:
  • X (pd.DataFrame) – The dataset.

  • method (str) – The feature scaling methods.

  • method_idx (int) – The index of methods.

Returns:

  • feature_scaling_config (dict) – The feature scaling configuration.

  • X_scaled (np.ndarray) – The dataset after imputing.

feature_selector(X: DataFrame, y: DataFrame, feature_selection_task: int, method: List[str], method_idx: int) tuple[dict, DataFrame][source]#

Apply feature selection methods.

Parameters:
  • X (pd.DataFrame) – The feature dataset.

  • y (pd.DataFrame) – The label dataset.

  • feature_selection_task (int) – Feature selection for regression or classification tasks.

  • method (str) – The feature selection methods.

  • method_idx (int) – The index of methods.

Returns:

  • feature_selection_config (dict) – The feature selection configuration.

  • X_selected (pd.DataFrame) – The feature dataset after selecting.

geochemistrypi.data_mining.data.statistic module#

monte_carlo_simulator(df_orig: DataFrame, df_impute: DataFrame, sample_size: int, iteration: int, test: str, confidence: float = 0.05) None[source]#

Check which column rejects hypothesis testing, p value < significance level, to find whether the imputation change the distribution of the original data set.

Parameters:
  • df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.

  • df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.

  • test (str) – The statistics test method used.

  • sample_size (int) – The size of the sample for each iteration.

  • iteration (int) – The number of iterations of Monte Carlo simulation.

  • confidence (float) – Confidence level, default to be 0.05

test_once(df_orig: DataFrame, df_impute: DataFrame, test: str) ndarray[source]#

Do hypothesis testing on each pair-wise column once, non-parametric test. Null hypothesis: the distributions of the data set before and after imputing remain the same.

Parameters:
  • df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.

  • df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.

  • test (str) – The statistics test method used.

Returns:

pvals – A numpy array containing the p-values of the tests on each column in the column order

Return type:

np.ndarray

Module contents#