mlm_insights.core.metrics.data_quality package¶

Submodules¶

mlm_insights.core.metrics.data_quality.cramers_v_correlation module¶

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CorrelationSummary(cramers_v_correlation: float, p_value: float = nan)¶

Bases: object

cramers_v_correlation: float¶

p_value: float = nan¶

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)¶

Bases: DatasetMetricBase, Serializable

This metric computes the Cramers_V correlation matrix and P_value matrix for the user provided feature inputs.
It is a dataset level metric which can process categorical data types.
This is an approximate multivariate metric.
Internally, it uses a sketch data structure with a default K value of 1024.
We use cramer’s V measure of association for correlation metric between n categorical features
This metric handles NaN values, Used for feature importance

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]

It ranges from 0 to 1 where:

0 indicates no association between the two variables.
1 indicates a perfect association between the two variables.

Cramer’s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1

Configuration¶

lg_max_k: int, default=10

Maximum size, in log2, of k. The value must be between 7 and 21, inclusive

ignore_invalid_data_types: bool, default=True

Flag for ignoring invalid data types
If set to True, non-categorical features will be ignored, else, metric will throw an error For example: Cramers only deals with Categorical data types so drop all non-categorical data types

feature_list: List[str]

list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns¶

feature_list: List[str]

list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]

correlation matrix

p_values: numpy.typing.NDArray[np.float64]

The p-value is the probability of observing a non-zero correlation coefficient in our sample data when in fact the null hypothesis is true. A low p-value would lead you to reject the null hypothesis. A typical threshold for rejection of the null hypothesis is a p-value of 0.05.

Limitations¶

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 categorical feature for computation

Exceptions¶

InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT
ValueError - When comparison columns have no corresponding data to compare, all are NaN
TypeError - in case user do not passed feature_list in list format

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.cramers_v_correlation import CramersVCorrelation
import pandas as pd

def main():
    input_schema = {
        'transport': FeatureType(data_type=DataType.STRING,
                                 variable_type=VariableType.NOMINAL,
                                 column_type=ColumnType.INPUT),
        'gender': FeatureType(data_type=DataType.STRING,
                              variable_type=VariableType.NOMINAL,
                              column_type=ColumnType.INPUT)
    }

    data_frame = pd.DataFrame({'transport': ['bus', 'bus', 'train', 'walk', 'walk', 'car', 'car'],
                               'gender': ['M', 'M', 'F', 'F', 'M', 'M', 'F']})
    feature1: str = 'transport'
    feature2: str = 'gender'
    correlation_metrics = [
        MetricMetadata(klass=CramersVCorrelation, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    cramers_actual_value = dataset_metrics.get_result()['value']
    cramers_correlation_matrix = cramers_actual_value['matrix']
    p_value_matrix = cramers_actual_value['p_values']

    feature_map = {value: index for index, value in enumerate(cramers_actual_value[FEATURE_LIST])}

    cramers_v_value_for_feature1_feature2 = round(
        cramers_correlation_matrix[feature_map[feature1]][feature_map[feature2]], 4)
    p_value_for_feature1_feature2 = round(
        p_value_matrix[feature_map[feature1]][feature_map[feature2]], 4)

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'p_values': array([[0.00815097, 0.40465279],
                        [0.40465279, 0.01265042]]),
           'feature_list': ['transport', 'gender']
           }
       }

compute(dataset: DataFrame, **kwargs: Any) → None¶: Update the state of the CramersVCorrelation using dataset

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → CramersVCorrelation¶

Create a CramersVCorrelation data quality metric using the configuration and kwargs

Parameters¶

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) → CramersVCorrelation¶

Create a new instance from serialized bytes.

Parameters¶

serialized_bytesbytes: Serialized bytes as input.

Returns¶

Serializable: New instance of Serializable

feature_list: List[str]¶

feature_pair_mapping: Dict[str, CramersVCorrelationState]¶

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns CramersVCorrelation data quality metric

Returns¶

Json object: CramersVCorrelation of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns CramersVCorrelation Metric and P_values in Standard format.

Returns¶

StandardMetricResult: CramersVCorrelation Metric and P_values in standard format.

merge(other: CramersVCorrelation, **kwargs: Any) → CramersVCorrelation¶

Merge two CramersVCorrelation into one, without mutating the others. Update sketch with new partition pair values from column1 and column2

Parameters¶

otherCramersVCorrelation: Other CramersVCorrelation that need be merged.

Returns¶

CramersVCorrelation: A new instance of CramersVCorrelation

serialize(**kwargs: Any) → bytes¶: Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns¶

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelationState(sketch: _datasketches.frequent_strings_sketch, total_count: int = 0, feature1: str = '', feature2: str = '')¶

Bases: object

feature1: str = ''¶

feature2: str = ''¶

sketch: frequent_strings_sketch¶

total_count: int = 0¶

mlm_insights.core.metrics.data_quality.pearson_correlation module¶

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState] = <factory>, feature_list: ~typing.List[str] = <factory>)¶

Bases: DatasetMetricBase, Serializable

This metric computes Pearson’s Correlation Coefficient matrix for the user provided feature inputs.
Pearson’s Correlation coefficient has value between -1 to 1.
It is a dataset level metric which can process numeric data types.
This is an exact multivariate metric.
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]

Used for feature importance It ranges from -1 to 1 where:

-1 indicates a perfect negative linear relationship between variables

0 indicates no linear relationship between variables

1 indicates a perfect positive linear relationship between variables

Pearson’s is computed taking Covariance and Variance of both variables

Configuration¶

ignore_invalid_data_types: bool, default=True

Flag for ignoring invalid data types

If set to True, non-numeric features will be ignored, else, metric will throw an error

For example: Pearson only deals with numerical data types so drop all non-numerical data types

feature_list: List[str]

list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns¶

feature_list: List[str]

list of user provided feature inputs

matrix: numpy.typing.NDArray[np.float64]

correlation matrix

Limitations¶

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 numerical feature for computation

Exceptions¶

InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT
TypeError - in case user do not passed feature_list in list format

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.data_quality.pearson_correlation import PearsonCorrelation
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(data_type=DataType.INTEGER,
                                   variable_type=VariableType.CONTINUOUS),
        'house_price': FeatureType(data_type=DataType.INTEGER,
                                   variable_type=VariableType.CONTINUOUS),
    }

    data_frame = pd.DataFrame({'house_price': [1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7],
                               'square_feet': [5, 6, 7, 8, 9, 10, 11, 12, 9, 10, 11]})
    feature1: str = 'house_price'
    feature2: str = 'square_feet'
    correlation_metrics = [
        MetricMetadata(klass=PearsonCorrelation, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    Returns the metric result as:
      return {
      'value':  {
            'matrix': array([[1.        , 0.64549722],
                           [0.64549722, 1.        ]]),
           'feature_list': ['house_price', 'square_feet']
           }
       }

compute(dataset: DataFrame, **kwargs: Any) → None¶: Update the state of the PearsonCorrelation metric using dataset

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → PearsonCorrelation¶

Create a PearsonCorrelation data quality metric using the configuration and kwargs

Parameters¶

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) → PearsonCorrelation¶

Create a new instance from serialized bytes.

Parameters¶

serialized_bytesbytes: Serialized bytes as input.

Returns¶

Serializable: New instance of Serializable

feature_list: List[str]¶

feature_pair_mapping: Dict[str, PearsonCorrelationState]¶

get_required_shareable_feature_components(**kwargs: Any) → Dict[str, List[SFCMetaData]]¶: Returns the Shareable Feature Components for 2 input features

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Pearson’s Correlation 2-D matrix for set of features, using the DescriptiveStatisticsSFC

Returns¶

Json object: Pearson’s Correlation 2-D matrix for n features

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Pearson’s Correlation Metric and P_values in Standard format.

Returns¶

StandardMetricResult: Pearson’s Correlation Metric and P_values in standard format.

merge(other: PearsonCorrelation, **kwargs: Any) → PearsonCorrelation¶

Merge two PearsonCorrelation into one, without mutating the others. 1. Calculate cumulative_col12_count 2. Calculate combined mean for feature column1 and column2 3. Calculate numerator of covariance column1 and column2

Parameters¶

otherPearsonCorrelation: Other PearsonCorrelation that need be merged.

Returns¶

PearsonCorrelation: A new instance of PearsonCorrelation

serialize(**kwargs: Any) → bytes¶: Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns¶

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelationState(cumulative_partition_count: int = 0, column1_mean: float = nan, column2_mean: float = nan, covariance_col1_col2: float = nan, feature1: str = '', feature2: str = '')¶

Bases: object

column1_mean: float = nan¶

column2_mean: float = nan¶

covariance_col1_col2: float = nan¶

cumulative_partition_count: int = 0¶

feature1: str = ''¶

feature2: str = ''¶

mlm_insights.core.metrics.data_quality.correlation_ratio module¶

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, feature_pair_mapping: ~typing.Dict[str, ~mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState] = <factory>, categorical_features: ~typing.List[str] = <factory>, numerical_features: ~typing.List[str] = <factory>)¶

Bases: DatasetMetricBase, Serializable

Dataset level metric computes correlation matrix for user provided categorical and numerical features.
This is an approximate multivariate metric.
We use Correlation Ratio for correlation metric between n categorical and m numerical features
This metric handles NaN values

NaN handling Example

a = [1, 2, 8, np.nan, 9]
b = [5, np.nan, 7, np.nan, 10]
valid_corresponding_column_values = pd.core.nanops.notna(a) & pd.core.nanops.notna(b)
valid_corresponding_column_values= [ True False True False True]

Applying valid_corresponding_column_values over column_a and column_b:
a = a[valid_corresponding_column_values]
b = b[valid_corresponding_column_values]
a = [1, 8, 9]
b = [5, 7, 10]

It ranges from 0 to 1 where:

0 indicates no dispersion among the means of the different categories
1 indicates dispersion within the respective categories
NaN when all data points of the complete population take the same value

Correlation ratio (η) is a measure of the relationship between statistical dispersion within individual categories and dispersion across the whole population or sample.

Configuration¶

feature_list: List[str]

list of feature names for computing the correlation between each provided feature pairs, number of features supported is between 2 and 50 inclusive

Returns¶

matrix: numpy.typing.NDArray[np.float64]

correlation matrix

categorical_features: List[str]

list of user provided categorical feature inputs

numerical_features: List[str]

list of user provided numerical feature inputs

Limitations¶

Currently we support only maximum MAX_FEATURE_THRESHOLD_DEFAULT = 50 features including both categorical and numerical features

Exceptions¶

InvalidParameterException - in case Column Name is not present in provided dataset
MissingRequiredParameterException - on breaching MAX_FEATURE_THRESHOLD_DEFAULT or Minimum 1 Numerical and 1 Categorical feature column names not provided
ValueError - When comparison columns have no corresponding data to compare, all are NaN
TypeError - in case user do not passed feature_list in list format

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.definitions import FEATURE_LIST, CATEGORICAL_FEATURES, NUMERICAL_FEATURES
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.data_quality.correlation_ratio import CorrelationRatio
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        "Pclass": FeatureType(data_type=DataType.STRING, variable_type=VariableType.NOMINAL),
        "age": FeatureType(data_type=DataType.FLOAT, variable_type=VariableType.CONTINUOUS)
    }

    data_frame = pd.DataFrame({'Pclass': [3, 3, 2, 3, 3, 3, 3, 2, 3, 3],
                               'age': [34.5, 47, 62, 27, 22, 14, 30, 26, 18, 21]})
    feature1: str = 'Pclass'
    feature2: str = 'age'
    correlation_metrics = [
        MetricMetadata(klass=CorrelationRatio, config={FEATURE_LIST: [feature1, feature2]})
    ]

    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=correlation_metrics)
    runner = InsightsBuilder().                     with_input_schema(input_schema).                     with_data_frame(data_frame=data_frame).                     with_metrics(metrics=metric_details).                     with_engine(engine=EngineDetail(engine_name="native")).                     build()
    run_result = runner.run()
    profile = run_result.profile

    dataset_metrics = profile.get_dataset_metric(correlation_metrics[0])
    assert dataset_metrics is not None

    sfc_registry = {}
    for feature in profile.features.values():
        sfc_registry[feature.get_name()] = feature.sfc_registry

    correlation_ratio_actual_value = dataset_metrics.get_result(sfc_registry=sfc_registry)['value']
    correlation_matrix = correlation_ratio_actual_value['matrix']
    assert correlation_matrix is not None

    categorical_feature_map = {value: index for index, value in
                               enumerate(correlation_ratio_actual_value[CATEGORICAL_FEATURES])}
    numerical_feature_map = {value: index for index, value in
                             enumerate(correlation_ratio_actual_value[NUMERICAL_FEATURES])}

    correlation_ratio_value = round(
        correlation_ratio_actual_value['matrix'][categorical_feature_map[feature1]][
            numerical_feature_map[feature2]], 4)

Returns the metric result as:
  return {
  'value':  {
        'matrix': array([  [0.50199]
                        ]),
       'categorical_features': ['Pclass'],
       'numerical_features': ['age']
       }
   }

categorical_features: List[str]¶

compute(dataset: DataFrame, **kwargs: Any) → None¶: Update the state of the CorrelationRatio using dataset

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset for a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → CorrelationRatio¶

Create a CorrelationRatio data quality metric using the configuration and kwargs

Parameters¶

config : Metric configuration kwargs: Key value pair for dynamic arguments. The current kwargs contains:

features: Contains list of input feature column names

classmethod deserialize(serialized_bytes: bytes, **kwargs: Any) → CorrelationRatio¶

Create a new instance from serialized bytes.

Parameters¶

serialized_bytesbytes: Serialized bytes as input.

Returns¶

Serializable: New instance of Serializable

feature_pair_mapping: Dict[str, CorrelationRatioState]¶

get_required_shareable_feature_components(**kwargs: Any) → Dict[str, List[SFCMetaData]]¶: Returns the Shareable Feature Components that a Metric requires to compute its state and values Metrics which do not require SFC need not override this property

Returns¶

Dict where feature_name as key and List of SFCMetadata as value. Each SFCMetadata must contain the klass attribute which points to the SFC class

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns CorrelationRatio data quality metric

Returns¶

Json object: CorrelationRatio of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns CorrelationRatio Metric in Standard format.

Returns¶

StandardMetricResult: CorrelationRatio Metric in standard format.

merge(other: CorrelationRatio, **kwargs: Any) → CorrelationRatio¶

Merge two CorrelationRatio into one, without mutating the others.

Parameters¶

otherCorrelationRatio: Other CorrelationRatio that need be merged.

Returns¶

CorrelationRatio: A new instance of CorrelationRatio

numerical_features: List[str]¶

serialize(**kwargs: Any) → bytes¶: Serialize the class to bytes. The bytes output must return the instance of the same class when deserialized.

Returns¶

bytes: Byte representation of object

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails(total_sum: float = 0.0, total_count: int = 0)¶

Bases: object

total_count: int = 0¶

total_sum: float = 0.0¶

class mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioState(category_details: Dict[Union[int, str, float], mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatioDetails] = <factory>, categorical_feature: str = '', numerical_feature: str = '')¶

Bases: object

categorical_feature: str = ''¶

category_details: Dict[int | str | float, CorrelationRatioDetails]¶

numerical_feature: str = ''¶

mlm_insights.core.metrics.data_quality package¶

Submodules¶

mlm_insights.core.metrics.data_quality.cramers_v_correlation module¶

Configuration¶

Returns¶

Limitations¶

Exceptions¶

Parameters¶

Parameters¶

Parameters¶

Returns¶

Returns¶

Returns¶

Parameters¶

Returns¶

Returns¶

mlm_insights.core.metrics.data_quality.pearson_correlation module¶

Configuration¶

Returns¶

Limitations¶

Exceptions¶

Parameters¶

Parameters¶

Parameters¶

Returns¶

Returns¶

Returns¶

Parameters¶

Returns¶

Returns¶

mlm_insights.core.metrics.data_quality.correlation_ratio module¶

Configuration¶

Returns¶

Limitations¶

Exceptions¶

Parameters¶

Parameters¶

Parameters¶

Returns¶

Returns¶

Returns¶

Returns¶

Parameters¶

Returns¶

Returns¶

Module contents¶