API#
data_readiness#
- basic_info(data: DataFrame) None[source]#
Show the basic information of the data set.
- Parameters:
data (pd.DataFrame) – The data set to be shown.
- bool_input(prefix: str | None = None) bool[source]#
Get the number of the desired option.
- Parameters:
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
- Returns:
A boolean value.
- Return type:
bool
- create_sub_data_set(data: DataFrame) DataFrame[source]#
Create a sub data set.
- Parameters:
data (pd.DataFrame) – The data set to be processed.
- Returns:
The sub data set.
- Return type:
pd.DataFrame
- data_split(X: DataFrame, y: DataFrame | Series, test_size: float = 0.2) Dict[source]#
Split arrays or matrices into random train and test subsets.
- Parameters:
X (pd.DataFrame) – The data to be split.
y (pd.DataFrame or pd.Series) – The target variable to be split.
test_size (float, default=0.2) – Represents the proportion of the dataset to include in the test split.
- Returns:
A dictionary containing the split data.
- Return type:
dict
- float_input(default: float, prefix: str | None = None, slogan: str | None = '@Number: ') float[source]#
Get the number of the desired option.
- Parameters:
default (float) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – An option number.
- Return type:
float or int
- limit_num_input(option_list: List[str], prefix: str, input_func: num_input) int[source]#
Limit the scope of the option.
- Parameters:
option_list (List[str]) – All the options provided are stored in a list.
prefix (str) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
input_func (function) – The function of input_func.
- Returns:
option – An option number. Be careful that ‘option = real index + 1’
- Return type:
int
- np2pd(array: ndarray, columns_name: List[str]) DataFrame[source]#
Convert numpy array to pandas dataframe.
- Parameters:
array (np.ndarray) – The numpy array to be converted.
columns_name (List[str]) – The column names of the dataframe.
- Returns:
The converted dataframe.
- Return type:
pd.DataFrame
- num2option(items: List[str]) None[source]#
List all the options serially.
- Parameters:
items (list) – a series of items need to be enumerated
- num_input(prefix: str | None = None, slogan: str | None = '@Number: ') int[source]#
Get the number of the desired option.
- Parameters:
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default="@Number: ") – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – An option number. Be careful that ‘option = real index + 1’
- Return type:
int
- read_data(file_path: str | None = None, is_own_data: int = 2, prefix: str | None = None, slogan: str | None = '@File: ')[source]#
Read the data set.
- Parameters:
file_path (str, optional) – The path of the data set, by default None
is_own_data (int, default=2) – 1: own data set; 2: built-in data set
prefix (str, optional) – The prefix of the data set, by default None
slogan (str, optional) – The slogan of the data set, by default “@File: “
- Returns:
The data set read
- Return type:
pd.DataFrame
- select_columns(columns_range: str | None = None) List[int][source]#
Select the columns of the data set.
- Parameters:
columns_range (str, default=None) – The columns range of the data set.
- Returns:
The columns selected.
- Return type:
list
- show_data_columns(columns_name: Index, columns_index: List | None = None) None[source]#
Show the column names of the data set.
- Parameters:
columns_name (pd.Index) – The column names of the data set.
columns_index (list, default=None) – The column index of the data set.
- str_input(option_list: List[str], prefix: str | None = None) str[source]#
Get the string of the desired option.
- Parameters:
option_list (list) – All the options provided are stored in a list.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
- Returns:
option – A string of the desired option.
- Return type:
str
- tuple_input(default: Tuple[int], prefix: str | None = None, slogan: str | None = None) Tuple[int][source]#
Get the tuple of the desired option.
- Parameters:
default (Tuple[int]) – If the user does not enter anything, it is assigned to option.
prefix (str, default=None) – It indicates which section the user currently is in on the UML, which is shown on the command-line console.
slogan (str, default=None) – It acts like the first parameter of input function in Python, which output the hint.
- Returns:
option – A numeric tuple.
- Return type:
tuple
feature_engineering#
- class FeatureConstructor(data: DataFrame)[source]#
Bases:
objectConstruct new feature based on the existing data set.
- append_feature(new_feature_column: Series) None[source]#
Append the new feature to the original data.
- cal_words = ['pow', 'sin', 'cos', 'tan', 'pi', 'mean', 'std', 'var', 'log']#
- index2name() None[source]#
Show the index of columns in the data set. The display pattern is [letter : column name], e.g. a : 1st column name; b : 2nd column name.
- oper = '+-*/^(),.'#
data.imputation#
- imputer(data: DataFrame, method: str) tuple[dict, ndarray][source]#
Apply imputation on missing values.
- Parameters:
data (pd.DataFrame) – The dataset with missing values.
method (str) – The imputation method.
- Returns:
imputation_config (dict) – The imputation configuration.
data_imputed (np.ndarray) – The dataset after imputing.
data.preprocessing#
- feature_scaler(X: DataFrame, method: List[str], method_idx: int) tuple[dict, ndarray][source]#
Apply feature scaling methods.
- Parameters:
X (pd.DataFrame) – The dataset.
method (str) – The feature scaling methods.
method_idx (int) – The index of methods.
- Returns:
feature_scaling_config (dict) – The feature scaling configuration.
X_scaled (np.ndarray) – The dataset after imputing.
- feature_selector(X: DataFrame, y: DataFrame, feature_selection_task: int, method: List[str], method_idx: int) tuple[dict, DataFrame][source]#
Apply feature selection methods.
- Parameters:
X (pd.DataFrame) – The feature dataset.
y (pd.DataFrame) – The label dataset.
feature_selection_task (int) – Feature selection for regression or classification tasks.
method (str) – The feature selection methods.
method_idx (int) – The index of methods.
- Returns:
feature_selection_config (dict) – The feature selection configuration.
X_selected (pd.DataFrame) – The feature dataset after selecting.
data.statistic#
- monte_carlo_simulator(df_orig: DataFrame, df_impute: DataFrame, sample_size: int, iteration: int, test: str, confidence: float = 0.05) None[source]#
Check which column rejects hypothesis testing, p value < significance level, to find whether the imputation change the distribution of the original data set.
- Parameters:
df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.
sample_size (int) – The size of the sample for each iteration.
iteration (int) – The number of iterations of Monte Carlo simulation.
confidence (float) – Confidence level, default to be 0.05
- test_once(df_orig: DataFrame, df_impute: DataFrame, test: str) ndarray[source]#
Do hypothesis testing on each pair-wise column once, non-parametric test. Null hypothesis: the distributions of the data set before and after imputing remain the same.
- Parameters:
df_orig (pd.DataFrame (n_samples, n_components)) – The original dataset with missing value.
df_impute (pd.DataFrame (n_samples, n_components)) – The dataset after imputation.
test (str) – The statistics test method used.
- Returns:
pvals – A numpy array containing the p-values of the tests on each column in the column order
- Return type:
np.ndarray