mlm_insights.core.metrics package

Subpackages

Submodules

mlm_insights.core.metrics.count module

class mlm_insights.core.metrics.count.Count(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, missing_count: int = 0, total_count: int = 0)

Bases: MetricBase

Feature Metric to compute total rows count, missing count and missing count percentage
It takes into consideration removing NaN values while computing total count
It is an exact univariate metric which can process any column type and for all data types

Configuration

None

Returns

total count: int
  • Number of records processed for the feature.

missing count: int
  • Number of records which have missing values.

missing_count_percentage: float
  • The percentage of missing records in the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.count import Count
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23, None]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Count)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Count"])
    # {'total_count': 6, 'missing_count': 1, 'missing_count_percentage': 16.666666666666664}


if __name__ == "__main__":
    main()


Returns the standard metric result as:
{
    'metric_name': 'Count',
    'metric_description': 'Feature metric that returns total count, missing count and missing count percentage',
    'variable_count': 3, 'variable_names': ['total_count', 'missing_count', 'missing_count_percentage'],
    'variable_types': [CONTINUOUS, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, INTEGER, FLOAT],
    'variable_dimensions': [0, 0, 0],
    'metric_data': [0, 0, 0.0],
    'metadata': {},
    'error': None
}
compute(column: Series, **kwargs: Any) None

Computes the count of missing records for the passed in dataset, as well as the total number of processed records. In case of a partitioned dataset, computes the count of missing records for the specific partition

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) Count

Factory Method to create an object. The configuration will be available in config.

Returns

Count

An Instance of Count.

get_result(**kwargs: Any) Dict[str, Any]

Returns the total count, count of missing data, and the percentage of missing data for the feature

Returns

total count: int
  • total number of records processed in the data.

missing count: int
  • number of records in the data having missing values.

missing_count_percentage: float
  • percentage of missing records (the number of missing records for the feature divided by the total number of records)

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for Count.

Returns

StandardMetricResult: Count Metric in standard format.

merge(other_metric: Count, **kwargs: Any) Count

Merge two Count metrics into one, without mutating the others.

Parameters

other_metricCount

Other Count metric that needs to be merged.

Returns

Count

A new instance of Count containing missing_count and total_count after merging.

missing_count: int = 0
total_count: int = 0

mlm_insights.core.metrics.dataset_metric_registry module

class mlm_insights.core.metrics.dataset_metric_registry.DatasetMetricRegistry

Bases: object

add_metric(dataset_metric_metadata: MetricMetadata, **kwargs: Any) DatasetMetricRegistry
static create_from_metrics_map(dataset_metrics_map: Dict[str, DatasetMetricBase]) DatasetMetricRegistry

Factory method to create Dataset Metric Registry using Dataset Metric Map. Use this method to create metric registry directly from the dataset metric map.

Parameters

dataset_metrics_mapDict[str, DatasetMetricBase]

Dictionary of metrics_map, hash as the Key and DatasetMetricBase as value.

classmethod deserialize(metric_registry_message: MetricRegistryMessage) DatasetMetricRegistry
get_dataset_metrics() Any
get_dataset_metrics_map() Dict[str, DatasetMetricBase]
get_metric(dataset_metric_metadata: MetricMetadata) DatasetMetricBase
serialize() MetricRegistryMessage

mlm_insights.core.metrics.dataset_summary module

class mlm_insights.core.metrics.dataset_summary.DatasetSummary

Bases: object

compute(input_schema: Dict[str, FeatureType]) None
classmethod deserialize(serialized: DatasetSummaryMessage) DatasetSummary
serialize() DatasetSummaryMessage

mlm_insights.core.metrics.distinct_count module

class mlm_insights.core.metrics.distinct_count.DistinctCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

Feature Metric to compute distinct count of elements present in that column
It is an approximate univariate metric which can process any column type and for all data types
Internally, it uses a sketch data structure with a default K value of 4096.
Supports all data types, it does not consider NaN values while doing the computation

Returns

distinct_count: int
  • the distinct count of the data.

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.distinct_count import DistinctCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=DistinctCount)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "DistinctCount": {
    "metric_name": "DistinctCount",
    "metric_description": "Approximate Distinct Count",
    "variable_count": 1,
    "variable_names": ["distinct_count"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [3],
    "metadata": {}
  }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) DistinctCount

Factory Method to create an object. The configuration will be available in config.

Returns

int: distinct count of the data.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DistinctCountSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns the distinct count of data.

Returns

int: the distinct count of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for Distinct Count.

Returns

StandardMetricResult: Distinct Count Metric in standard format.

mlm_insights.core.metrics.duplicate_count module

class mlm_insights.core.metrics.duplicate_count.DuplicateCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

Feature Metric to compute duplicate count and duplicate count percentage of elements present in that column
It is an approximate univariate metric which can process any column type and for all data types
Internally, it uses a sketch data structure with a default K value of 1024.
Supports all data types, it does not consider NaN while computation

Configuration

None

Returns

count: int
  • Number of duplicate items in the feature data

percentage: float
  • The percentage of duplicate records in the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.duplicate_count import DuplicateCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=DuplicateCount)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["DuplicateCount"])
    # {'count': 2, 'percentage': 40.0}
if __name__ == "__main__":
    main()


Returns the standard metric result as:
{
    'metric_name': 'DuplicateCount',
    'metric_description': 'Feature Metric to compute duplicate count and duplicate count percentage',
    'variable_count': 2, 'variable_names': ['count', 'percentage'],
    'variable_types': [CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, FLOAT],
    'variable_dimensions': [0, 0],
    'metric_data': [23, 15.5],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) DuplicateCount

Factory Method to create an object.

Returns

Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of the total count.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Frequent Items SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. Frequent Items SFC

get_result(**kwargs: Any) Dict[str, Any]

Returns the number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.

Returns

Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

mlm_insights.core.metrics.framework_metrics_enum module

class mlm_insights.core.metrics.framework_metrics_enum.FrameworkMetrics(value)

Bases: Enum

Define all Insights-provided Metric here.

AccuracyScore = <class 'mlm_insights.core.metrics.classification_metrics.accuracy_score.AccuracyScore'>
ChiSquare = <class 'mlm_insights.core.metrics.drift_metrics.chi_square.ChiSquare'>
ClassImbalance = <class 'mlm_insights.core.metrics.bias_and_fairness.class_imbalance.ClassImbalance'>
ConflictLabel = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_label.ConflictLabel'>
ConflictPrediction = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_prediction.ConflictPrediction'>
ConfusionMatrix = <class 'mlm_insights.core.metrics.classification_metrics.confusion_matrix.ConfusionMatrix'>
CorrelationRatio = <class 'mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio'>
Count = <class 'mlm_insights.core.metrics.count.Count'>
CramersVCorrelation = <class 'mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation'>
DateTimeDuration = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_duration.DateTimeDuration'>
DateTimeMax = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_max.DateTimeMax'>
DateTimeMin = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_min.DateTimeMin'>
DistinctCount = <class 'mlm_insights.core.metrics.distinct_count.DistinctCount'>
DuplicateCount = <class 'mlm_insights.core.metrics.duplicate_count.DuplicateCount'>
FBetaScore = <class 'mlm_insights.core.metrics.classification_metrics.fbeta_score.FBetaScore'>
FalseNegativeRate = <class 'mlm_insights.core.metrics.classification_metrics.false_negative_rate.FalseNegativeRate'>
FalsePositiveRate = <class 'mlm_insights.core.metrics.classification_metrics.false_positive_rate.FalsePositiveRate'>
FrequencyDistribution = <class 'mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution'>
IQR = <class 'mlm_insights.core.metrics.iqr.IQR'>
IsConstantFeature = <class 'mlm_insights.core.metrics.is_constant_feature.IsConstantFeature'>
IsNegative = <class 'mlm_insights.core.metrics.is_negative.IsNegative'>
IsNonZero = <class 'mlm_insights.core.metrics.is_non_zero.IsNonZero'>
IsPositive = <class 'mlm_insights.core.metrics.is_positive.IsPositive'>
IsQuasiConstantFeature = <class 'mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature'>
JensenShannon = <class 'mlm_insights.core.metrics.drift_metrics.jensen_shannon.JensenShannon'>
KolmogorovSmirnov = <class 'mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov.KolmogorovSmirnov'>
KullbackLeibler = <class 'mlm_insights.core.metrics.drift_metrics.kullback_leibler.KullbackLeibler'>
Kurtosis = <class 'mlm_insights.core.metrics.kurtosis.Kurtosis'>
LogLoss = <class 'mlm_insights.core.metrics.classification_metrics.log_loss.LogLoss'>
Max = <class 'mlm_insights.core.metrics.max.Max'>
MaxError = <class 'mlm_insights.core.metrics.regression_metrics.max_error.MaxError'>
Mean = <class 'mlm_insights.core.metrics.mean.Mean'>
MeanAbsoluteError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_error.MeanAbsoluteError'>
MeanAbsolutePercentageError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_percentage_error.MeanAbsolutePercentageError'>
MeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_error.MeanSquaredError'>
MeanSquaredLogError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_log_error.MeanSquaredLogError'>
Min = <class 'mlm_insights.core.metrics.min.Min'>
Mode = <class 'mlm_insights.core.metrics.mode.Mode'>
PearsonCorrelation = <class 'mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation'>
Percentiles = <class 'mlm_insights.core.metrics.percentiles.Percentiles'>
PopulationStabilityIndex = <class 'mlm_insights.core.metrics.drift_metrics.population_stability_index.PopulationStabilityIndex'>
PrecisionRecallAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_auc.PrecisionRecallAreaUnderCurve'>
PrecisionRecallCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_curve.PrecisionRecallCurve'>
PrecisionScore = <class 'mlm_insights.core.metrics.classification_metrics.precision_score.PrecisionScore'>
ProbabilityDistribution = <class 'mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution'>
Quartiles = <class 'mlm_insights.core.metrics.quartiles.Quartiles'>
R2Score = <class 'mlm_insights.core.metrics.regression_metrics.r2_score.R2Score'>
ROCAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc_auc.ROCAreaUnderCurve'>
ROCCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc.ROCCurve'>
Range = <class 'mlm_insights.core.metrics.range.Range'>
RecallScore = <class 'mlm_insights.core.metrics.classification_metrics.recall_score.RecallScore'>
RootMeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.root_mean_squared_error.RootMeanSquaredError'>
RowCount = <class 'mlm_insights.core.metrics.rows_count.RowCount'>
Skewness = <class 'mlm_insights.core.metrics.skewness.Skewness'>
Specificity = <class 'mlm_insights.core.metrics.classification_metrics.specificity.Specificity'>
StandardDeviation = <class 'mlm_insights.core.metrics.standard_deviation.StandardDeviation'>
Sum = <class 'mlm_insights.core.metrics.sum.Sum'>
TopKFrequentElements = <class 'mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements'>
TypeMetric = <class 'mlm_insights.core.metrics.type_metric.TypeMetric'>
Variance = <class 'mlm_insights.core.metrics.variance.Variance'>

mlm_insights.core.metrics.frequency_distribution module

class mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')

Bases: MetricBase

Frequency Distribution
This metric calculates FrequencyDistribution of a single data column
This is a feature level metric which can process any column and only numerical (int, float) data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 200.

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
- Number of bins
- Binning algorithm. Default is Sturges
- Bins: List of floats

Returns

bins: List[int]

bins of the data.

frequency: List[int]

frequency of the data.

Example

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.frequency_distribution import FrequencyDistribution
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [1, 1, 2, 3, 4, 5, 7, 10, 11, 20]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=FrequencyDistribution)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["FrequencyDistribution"])
    # {'bins': [1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], 'frequency': [5, 2, 2, 0, 1]}

 {
    "metric_name": "FrequencyDistribution",
    "metric_description": "Feature Metric to compute Frequency distribution",
    "variable_count": 2,
    "variable_names": ['bins', 'frequency'],
    "variable_types": ["CONTINUOUS", "CONTINUOUS"],
    "variable_dtypes": ["FLOAT", "FLOAT"],
    "variable_dimensions": [1, 1],
    "metric_data": [[1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], [5, 2, 2, 0, 1]],
    "metadata": {},
    "error": null
  }
bins: str | int | List[float] = 'sturges'
classmethod create(config: Dict[str, ConfigParameter] | None = None) FrequencyDistribution

Factory Method to create an object. The configuration will be available in config.

Returns

FrequencyDistribution

An Instance of FrequencyDistribution.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns

List[SFCMetaData]

List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns the FrequencyDistribution of data.

Returns

Dict

The frequency distribution of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: FrequencyDistribution, **kwargs: Any) FrequencyDistribution

Merge two Frequency Distribution Metric into one, without mutating the others.

Parameters

other_metricFrequencyDistribution

Other Frequency Distribution that need be merged.

Returns

FrequencyDistribution

A new instance of Frequency Distribution metric after merging.

mlm_insights.core.metrics.iqr module

class mlm_insights.core.metrics.iqr.IQR(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

Inter Quartile Range
This metric calculates inter quartile range of a single numerical data column, namely Q3 - Q1.
Each quartile represent ((n + 1)/4)th Term of the overall dataset.
This is a feature level metric which can process any column type and only numerical (int, float) data types.
This is an approximate metric.
Internally, it uses a sketch data structure with a default K value of 200.

Configuration

None

Returns

iqr: float

the IQR of the data (Q3 - Q1).

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.iqr import IQR
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [-1, -2, -3, -4]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=IQR)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["IQR"])
    # {'value': 2}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'IQR',
    'metric_description': 'Feature Metric to compute IQR',
    'variable_count': 1,
    'variable_names': ['i_q_r'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [2],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) IQR

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of IQR.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns the IQR of data.

Returns

float: the IQR of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns IQR Metric in Standard format.

Returns

StandardMetricResult: IQR Metric in Standard format.

mlm_insights.core.metrics.is_constant_feature module

class mlm_insights.core.metrics.is_constant_feature.IsConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

Constant Feature metric computes whether all the values are same or not
This metric returns Constant as True when all the values within feature are same
This is a Univariate, feature level metric which can process any column type and any data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 1024.

Configuration

None

Returns

is_constant: boolean
  • If all values are same

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_constant_feature import IsConstantFeature
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [1, 1, 1, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsConstantFeature)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])


{
  "IsConstantFeature": {
    "metric_name": "IsConstantFeature",
    "metric_description": "Feature Metric to compute if all values are same",
    "variable_count": 1,
    "variable_names": ["is_constant"],
    "variable_types": ["BINARY"],
    "variable_dtypes": ["BOOLEAN"],
    "variable_dimensions": [0],
    "metric_data": [true],
    "metadata": {},
    "error": null
  }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) IsConstantFeature

Factory Method to create an object.

Returns

An Instance of IsConstantFeature Univariate Metric.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute IsConstantFeature Univariate Metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns IsConstantFeature Univariate Metric for the data using the FrequentItemsSFC.

Returns

boolean: IsConstantFeature Univariate Metric of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns IsConstant Metric in Standard format.

Returns

StandardMetricResult: IsConstant Metric in Standard format.

mlm_insights.core.metrics.is_negative module

class mlm_insights.core.metrics.is_negative.IsNegative(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric computes if the provided numerical feature has all negative values.
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric.

Configuration

None

Returns

is_negative: boolean
  • True if all values are negative (strictly less than zero)

  • False otherwise (including when all values are np.nan)

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_negative import IsNegative
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [-1, -4, -6, -11]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsNegative)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
    "IsNegative": {
        "metric_name": "IsNegative",
        "metric_description": "Feature Metric to compute if all values are negative",
        "variable_count": 1,
        "variable_names": ["is_negative"],
        "variable_types": ["BINARY"],
        "variable_dtypes": ["BOOLEAN"],
        "variable_dimensions": [0],
        "metric_data": [true],
        "metadata": {},
        "error": null
    }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) IsNegative

Create IsNegative metric

Returns

An Instance of IsNegative metric.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns result of IsNegative metric for the data.

Returns

boolean: result of IsNegative metric.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns IsNegative metric in Standard format.

Returns

StandardMetricResult: IsNegative metric in Standard format.

mlm_insights.core.metrics.is_non_zero module

class mlm_insights.core.metrics.is_non_zero.IsNonZero(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, is_non_zero: bool = False, zero_count: int = 0)

Bases: MetricBase

This metric computes if the provided numerical feature has all non-zero values.
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric.

Configuration

None

Returns

is_non_zero: boolean
  • True if all values are non-zero (np.nan is treated as non-zero)

  • False otherwise

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_non_zero import IsNonZero
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [-1, -4, -6, -11]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsNonZero)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
    "IsNonZero": {
        "metric_name": "IsNonZero",
        "metric_description": "Feature Metric to compute if all values are non-zero",
        "variable_count": 1,
        "variable_names": ["is_non_zero"],
        "variable_types": ["BINARY"],
        "variable_dtypes": ["BOOLEAN"],
        "variable_dimensions": [0],
        "metric_data": [true],
        "metadata": {},
        "error": null
    }
}
compute(column: Series, **kwargs: Any) None

Computes IsNonZero for the passed in dataset.

In case of a partitioned dataset, the value of IsNonZero for the specific partition is computed.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) IsNonZero

Create IsNonZero metric.

Returns

An Instance of IsNonZero metric.

get_result(**kwargs: Any) Dict[str, Any]

Returns result of IsNonZero metric for the data.

Returns

boolean: result of IsNonZero metric.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns IsNonZero metric in Standard format.

Returns

StandardMetricResult: IsNonZero metric in Standard format.

is_non_zero: bool = False
merge(other_metric: IsNonZero, **kwargs: Any) IsNonZero

Merge two IsNonZero metrics into one, without mutating the others.

Parameters

other_metricIsNonZero

Other IsNonZero metric that needs to be merged.

Returns

IsNonZero

A new instance of IsNonZero

zero_count: int = 0

mlm_insights.core.metrics.is_positive module

class mlm_insights.core.metrics.is_positive.IsPositive(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric computes if the provided numerical feature has all positive values.
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric.

Configuration

None

Returns

is_positive: boolean
  • True if all values are positive (strictly greater than zero)

  • False otherwise (including when all values are np.nan)

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_positive import IsPositive
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsPositive)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
    "IsPositive": {
        "metric_name": "IsPositive",
        "metric_description": "Feature Metric to compute if all values are positive",
        "variable_count": 1,
        "variable_names": ["is_positive"],
        "variable_types": ["BINARY"],
        "variable_dtypes": ["BOOLEAN"],
        "variable_dimensions": [0],
        "metric_data": [true],
        "metadata": {},
        "error": null
    }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) IsPositive

Create IsPositive metric

Returns

An Instance of IsPositive metric.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns result of IsPositive metric for the data.

Returns

boolean: result of IsPositive metric.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns IsPositive metric in Standard format.

Returns

StandardMetricResult: IsPositive metric in Standard format.

mlm_insights.core.metrics.is_quasi_constant_feature module

class mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, quasi_constant_threshold: float = 0.99)

Bases: MetricBase

Quasi Constant metric computes whether all the values are almost same or not
This metric returns Quasi Constant as True when one single value occurs at higher frequency compared to Quasi Constant Threshold
This is a Univariate, feature level metric which can process any column type and any data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 1024.

Configuration

quasi_constant_threshold: str, default=0.99
  • Define Quasi Constant Threshold value, if the first element value count percentage is >= this threshold, it is Quasi Constant Feature

Returns

is_quasi_constant: boolean
  • If all values are almost same

Example

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_quasi_constant_feature import IsQuasiConstantFeature
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

df = pd.DataFrame({"Age": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsQuasiConstantFeature,
                                                                        config={"quasi_constant_threshold":0.8})]},
                              dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])


{
  "IsQuasiConstantFeature": {
    "metric_name": "IsQuasiConstantFeature",
    "metric_description": "Feature Metric to compute if all values are almost same",
    "variable_count": 1,
    "variable_names": ["is_quasi_constant"],
    "variable_types": ["BINARY"],
    "variable_dtypes": ["BOOLEAN"],
    "variable_dimensions": [0],
    "metric_data": [true],
    "metadata": {},
    "error": null
  }
}
classmethod create(config: Dict[str, ConfigParameter] | None = None) IsQuasiConstantFeature

Factory Method to create an object.

Returns

An Instance of IsQuasiConstantFeature Univariate Metric.The configuration will be available in config.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute IsQuasiConstantFeature Univariate Metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns IsQuasiConstantFeature Univariate Metric for the data using the FrequentItemsSFC.

Returns

boolean: IsQuasiConstantFeature Univariate Metric of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Minimum Metric in Standard format.

Returns

StandardMetricResult: Minimum Metric in Standard format.

merge(other_metric: IsQuasiConstantFeature, **kwargs: Any) IsQuasiConstantFeature

Merge two IsQuasiConstantFeature Metric into one, without mutating the others.

Parameters

other_metricIsQuasiConstantFeature

Other IsQuasiConstantFeature that need be merged.

Returns

IsQuasiConstantFeature

A new instance of IsQuasiConstantFeature after merging.

quasi_constant_threshold: float = 0.99

mlm_insights.core.metrics.kurtosis module

class mlm_insights.core.metrics.kurtosis.Kurtosis(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Kurtosis of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric
Mathematically: central_moments[i] = sum{( x - mean )^i} /N
Kurtosis is 4th Central Moment

Configuration

None

Returns

kurtosis: float
  • Kurtosis of the data, if data not present returns None

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.kurtosis import Kurtosis
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Kurtosis)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Kurtosis"])
    # {'value': -0.4098628688922519}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Kurtosis',
    'metric_description': 'Feature Metric to compute Kurtosis',
    'variable_count': 1,
    'variable_names': ['kurtosis'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [0.45],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Kurtosis

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Kurtosis.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns Excess Kurtosis of data.

Returns

float: Excess Kurtosis of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for Kurtosis.

Returns

StandardMetricResult: Kurtosis Metric in standard format.

mlm_insights.core.metrics.max module

class mlm_insights.core.metrics.max.Max(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Max of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration

None

Returns

max: float
  • Maximum of the data, if data not present returns None

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.max import Max
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Max)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
    "Max": {
        "metric_name": "Max",
        "metric_description": "Feature Metric to compute maximum value",
        "variable_count": 1,
        "variable_names": ["maximum"],
        "variable_types": ["CONTINUOUS"],
        "variable_dtypes": ["FLOAT"],
        "variable_dimensions": [0],
        "metric_data": [6.0],
        "metadata": {},
        "error": null
    }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Max

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns maximum of input data.

Returns

float: maximum of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Maximum Metric in Standard format.

Returns

StandardMetricResult: Maximum Metric in Standard format.

mlm_insights.core.metrics.mean module

class mlm_insights.core.metrics.mean.Mean(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Mean of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration

None

Returns

mean: float
  • Mean of the data, if data not present returns None

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.mean import Mean
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Mean)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Mean"])
    # {'value': 20.54}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Mean',
    'metric_description': 'Feature Metric to compute Mean',
    'variable_count': 1,
    'variable_names': ['mean'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [20.54],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Mean

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Mean.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute mean metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns mean metric for the data using the DescriptiveStatisticsSFC.

Returns

float: mean of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for Mean.

Returns

StandardMetricResult: Mean Metric in standard format.

mlm_insights.core.metrics.metric_metadata module

class mlm_insights.core.metrics.metric_metadata.MetricMetadata(klass: ~typing.Type[~typing.Any], config: ~typing.Dict[str, ~typing.Any] = <factory>)

Bases: object

Represents dataset metric metadata used to define and configure a metric

config: Dict[str, Any]
get_key() str

Returns key which uniquely identifies this metric. Since a metric can be added only once to a feature/profile, key only contains the name which uniquely identifies the metric

klass: Type[Any]

mlm_insights.core.metrics.metric_registry module

class mlm_insights.core.metrics.metric_registry.MetricRegistry

Bases: object

add_metric(metric_metadata: MetricMetadata, **kwargs: Any) MetricRegistry
static create_from_metrics_map(metrics_map: Dict[str, MetricBase]) MetricRegistry

Factory method to create Metric Registry using Metric Map. Use this method to create metric registry directly from the metric map.

Parameters

metrics_mapDict[str, MetricBase]

Dictionary of metrics_map, hash as the Key and MetricBase as value.

classmethod deserialize(metric_registry_message: MetricRegistryMessage) MetricRegistry
get_metric(metric_metadata: MetricMetadata) MetricBase
get_metrics() Any
get_metrics_map() Dict[str, MetricBase]
serialize() MetricRegistryMessage

mlm_insights.core.metrics.metric_result module

class mlm_insights.core.metrics.metric_result.MetricResultJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: JSONEncoder

default(obj: Any) Any

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class mlm_insights.core.metrics.metric_result.StandardMetricResult(metric_name: str = '', metric_description: str = '', variable_count: int = 0, variable_names: List[str] = <factory>, variable_types: List[mlm_insights.constants.types.VariableType] = <factory>, variable_dtypes: List[mlm_insights.constants.types.DataType] = <factory>, variable_dimensions: List[int] = <factory>, metric_data: List[Any] = <factory>, metadata: Dict[str, str] = <factory>, error: Union[str, NoneType] = None)

Bases: object

error: str | None = None
static get_metric_result_with_error(name: str, description: str, error: str) StandardMetricResult
has_error() bool
metadata: Dict[str, str]
metric_data: List[Any]
metric_description: str = ''
metric_name: str = ''
to_dict() Dict[str, Any]
to_json() str
variable_count: int = 0
variable_dimensions: List[int]
variable_dtypes: List[DataType]
variable_names: List[str]
variable_types: List[VariableType]

mlm_insights.core.metrics.min module

class mlm_insights.core.metrics.min.Min(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Min of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration

None

Returns

min: float
  • Minimum of the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.min import Min

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Min)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "Min": {
    "metric_name": "Min",
    "metric_description": "Feature Metric to compute minimum value",
    "variable_count": 1,
    "variable_names": ["minimum"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [1.0],
    "metadata": {},
    "error": null
  }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Min

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Min.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns minimum of input data.

Returns

float: minimum of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Minimum Metric in Standard format.

Returns

StandardMetricResult: Minimum Metric in Standard format.

mlm_insights.core.metrics.mode module

class mlm_insights.core.metrics.mode.Mode(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10)

Bases: MetricBase

This metric calculates the Mode for the given data column. It returns the most frequently occurring item as the mode. In bi-modal or multi-modal cases, two modes are returned.
This is a feature level metric which can process both numerical and categorical data types.
This is an approximate metric which uses a Frequent Items Sketch to calculate the most frequently occurring item(s).
This metric handles NaN values by dropping them from the given data column

The Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact, hence exact mode will be returned. Otherwise, items are returned with their estimated frequencies and mode will be approximate.

NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’.

Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Configuration

lg_max_k: int, default=10
  • log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024

Returns

mode: List[String]
  • The mode of the given data column. In bi-modal or multi-modal cases, two modes are returned.

Exceptions

  • InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21

Examples

# To declare Mode metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode)]},
                              dataset_metrics=[])
# To declare Mode metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode,
                              config={"lg_max_k": 12})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    'metric_name': 'Mode',
    'metric_description': 'Mode',
    'variable_count': 1,
    'variable_names': ['mode'],
    'variable_types': [NOMINAL],
    'variable_dtypes': [STRING],
    'variable_dimensions': [1],
    'metric_data': [['1', '3']],
    'metadata': {},
    'error': None
}
classmethod create(config: Dict[str, ConfigParameter] | None = None) Mode

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Mode.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute Mode metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Mode metric for the data using the FrequentItemsSFC.

Returns

float: Mode metric of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Mode Metric in Standard format.

Returns

StandardMetricResult: Mode Metric in Standard format.

merge(other_metric: Mode, **kwargs: Any) Mode

Merge two Mode Metrics into one, without mutating the others.

Parameters

other_metricMode

Other Mode that needs to be merged.

Returns

Mode

A new instance of Mode after merging.

mlm_insights.core.metrics.percentiles module

class mlm_insights.core.metrics.percentiles.Percentiles(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, percentile_values: ~typing.List[str] = <factory>)

Bases: MetricBase

This metric calculates the user-provided percentiles of the given data column.
This is a feature level metric which can process only numerical (int, float) data types.
This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200
This metric handles NaN values by dropping them from the given data column

Configuration

percentile_values: List[str]

,where each percentile to be computed should be given in the form p<percentile>. For eg, [p5, p95] to compute 5th and 95th percentile values

Returns

  • All user configured percentile values in Insights metrics standard format

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.percentiles import Percentiles
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [3, 5, 1, 7, 8, 4, 9]})
    metric_details = MetricDetail(univariate_metric=
                                    {"square_feet": [MetricMetadata(klass=Percentiles)]},
                                    dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Percentiles"])

    if __name__ == "__main__":
        main()

    Returns the standard metric result as:
    {
        metric_name: 'Percentiles',
        metric_description: 'Feature Metric to compute user-provided percentile values',
        variable_count: 3,
        variable_names: ['p5', 'p35', 'p95'],
        variable_types: [CONTINUOUS, CONTINUOUS, CONTINUOUS],
        variable_dtypes: [FLOAT, FLOAT, FLOAT],
        variable_dimensions: [0, 0, 0],
        metric_data=[3.0, 5.0, 8.0],
        metadata={},
        error=None
    }
classmethod create(config: Dict[str, ConfigParameter] | None = None) Percentiles

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Percentiles.

get_percentile_value_from_sfc(percentile_rank: float, quantile_sfc: QuantilesSFC) float | None
get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute percentile metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Percentiles metric for the data using the QuantilesSFC.

Returns

Json object: percentiles of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

percentile_values: List[str]

mlm_insights.core.metrics.performance_metrics_utils module

class mlm_insights.core.metrics.performance_metrics_utils.PerformanceMetricMeta(target_column_name: str = 'y_true', prediction_column_name: str = 'y_predict', target_series: pandas.core.series.Series = None, prediction_series: pandas.core.series.Series = None, prediction_score_column_name: str = 'y_score', prediction_score_series: pandas.core.series.Series = None)

Bases: object

prediction_column_name: str = 'y_predict'
prediction_score_column_name: str = 'y_score'
prediction_score_series: Series = None
prediction_series: Series = None
target_column_name: str = 'y_true'
target_series: Series = None
class mlm_insights.core.metrics.performance_metrics_utils.PredictionTargetColumnMapping(target_column: str = 'y_true', prediction_column: str = 'y_predict', prediction_score_column: str = 'y_score')

Bases: object

prediction_column: str = 'y_predict'
prediction_score_column: str = 'y_score'
target_column: str = 'y_true'
mlm_insights.core.metrics.performance_metrics_utils.get_column(dataset: DataFrame, column_name: str) Series
mlm_insights.core.metrics.performance_metrics_utils.get_target_prediction_columns(features_metadata: Dict[str, FeatureMetadata]) PredictionTargetColumnMapping

mlm_insights.core.metrics.probablity_distribution module

class mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')

Bases: MetricBase

This metric calculates Probability Distribution of a single data column. Probability Distribution of a Random Variable (X) shows how the Probabilities of the events are distributed over different values of the Random Variable.
This is a feature level metric which can process only numerical data types.
This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200
This metric handles NaN values by dropping them from the given data column

Configuration

bin: Union[str, int, List[float]], default=’sturges’
One of the following values
  • Number of bins

  • Binning algorithm. Default is Sturges, also the only algorithm supported as of now. Other algorithms will be supported in the near future

  • Bins: List of floats

Returns

bins: List[float]

bins of the data.

density: List[float]

Density/probabilities of occurrence for the respective bins

Example

# To declare ProbabilityDistribution metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution)
                              ]}, dataset_metrics=[])
# To declare ProbabilityDistribution metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution,
                              config={"bins":10})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    "metric_name": "ProbabilityDistribution",
    "metric_description": "Feature Metric to compute probability density",
    "variable_count": 2,
    "variable_names": ['bins', 'density'],
    "variable_types": [CONTINUOUS, CONTINUOUS],
    "variable_dtypes": [FLOAT, FLOAT],
    "variable_dimensions": [1, 1],
    "metric_data": [[1.0, 1.5, 2.0], [0.5, 0.5]],
    "metadata": {},
    "error": null
}
bins: str | int | List[float] = 'sturges'
classmethod create(config: Dict[str, ConfigParameter] | None = None) ProbabilityDistribution

Factory Method to create an object. The configuration will be available in config.

Returns

ProbabilityDistribution

An Instance of ProbabilityDistribution.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute PDF metric.

Returns

List[SFCMetaData]

list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Probability Distribution for the data using the QuantilesSFC.

Returns

Dict

Probability Distribution for the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for Probability Distribution.

Returns

StandardMetricResult: Probability Distribution Metric in standard format.

merge(other_metric: ProbabilityDistribution, **kwargs: Any) ProbabilityDistribution

Merge two Probability Distribution Metric into one, without mutating the others.

Parameters

other_metricProbability Distribution

Other PDF that need be merged.

Returns

ProbabilityDistribution

A new instance of Probability Distribution after merging.

mlm_insights.core.metrics.quartiles module

class mlm_insights.core.metrics.quartiles.Quartiles(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates the Quartiles (Q1, Q2, Q3) of the given data column.
This is a feature level metric which can process only numerical (int, float) data types.
This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200
This metric handles NaN values by dropping them from the given data column

Configuration

None

Returns

  • Q1: Lower quartile which is a number halfway between the lowest number and the middle number.

  • Q2: Second quartile (also known as median) which is a middle number halfway between the lowest number and the highest number

  • Q3: Upper quartile which is a number halfway between the median and the highest number.

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.quartiles import Quartiles
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [3, 5, 1, 7, 8, 4, 9]})
    metric_details = MetricDetail(univariate_metric=
                                    {"square_feet": [MetricMetadata(klass=Quartiles)]},
                                    dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Quartiles"])
    # {'q1': 3.0, 'q2': 5.0, 'q3': 8.0}

    if __name__ == "__main__":
        main()

    Returns the standard metric result as:
    {
        metric_name: 'Quartiles',
        metric_description: 'Feature Metric to compute Quartiles (Q1, Q2, Q3)',
        variable_count: 3,
        variable_names: ['q1', 'q2', 'q3'],
        variable_types: [CONTINUOUS, CONTINUOUS, CONTINUOUS],
        variable_dtypes: [FLOAT, FLOAT, FLOAT],
        variable_dimensions: [0, 0, 0],
        metric_data=[3.0, 5.0, 8.0],
        metadata={},
        error=None
    }
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Quartiles

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Quartiles.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute quartiles metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Quartiles metric for the data using the QuantilesSFC.

Returns

Json object: quartiles of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

mlm_insights.core.metrics.range module

class mlm_insights.core.metrics.range.Range(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Range of a single numerical data column. Range is the difference between the smallest and highest numbers
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Returns

float: Range of the data.

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.range import Range

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Range)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "Range": {
    "metric_name": "Range",
    "metric_description": "Feature Metric to compute range value",
    "variable_count": 1,
    "variable_names": ["range"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [5.0],
    "metadata": {},
    "error": null
  }
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Range

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns range of input data.

Returns

float: range of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Range Metric in Standard format.

Returns

StandardMetricResult: Range Metric in Standard format.

mlm_insights.core.metrics.rows_count module

class mlm_insights.core.metrics.rows_count.RowCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, row_count: int = 0)

Bases: DatasetMetricBase

This metric calculates the total row count of the Dataset
This Dataset level metric is an exact metric.
This metric doesn’t handle NaN values. If certain rows have NaN values, there is no impact on the RowCount

Configuration

None

Returns

integer: total row count of the dataset

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.rows_count import RowCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=[MetricMetadata(klass=RowCount)])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    dataset_metrics = profile_json['dataset_metrics']
    print(dataset_metrics["RowCount"])
    # {'value': 5}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'RowCount',
    'metric_description': 'Dataset-level Metric to compute the total row count of the dataset',
    'variable_count': 1,
    'variable_names': ['rows_count'],
    'variable_types': [DISCRETE],
    'variable_dtypes': [INTEGER],
    'variable_dimensions': [0],
    'metric_data': [5],
    'metadata': {},
    'error': None
}
compute(dataset: DataFrame, **kwargs: Any) None

Calculate the metric value(s) from the passed DataFrame , set the internal state with the value(s). When a metric is being computed for a partitioned data set, this method is invoked for each partition. Write logic required to derive the metric value in this method for that specific partition

Parameters

dataset : pd.DataFrame DataFrame object for either the entire dataset or a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) RowCount

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_result(**kwargs: Any) Dict[str, Any]

Returns the computed value of the metric

Returns

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

merge(other_metric: RowCount, **kwargs: Any) RowCount

Merge the other metric with the current metric and return a new instance of metric. Use this method to merge the states of the 2 metrics to produce a statistically-correctly state Note: you should not mutate the current metric but create a new instance.

Parameters

other_metric : DatasetMetricBase The second metric which the current metric is being merged with

Returns

DatasetMetricBase: New, merged DatasetMetricBase instance

row_count: int = 0

mlm_insights.core.metrics.serializer module

mlm_insights.core.metrics.serializer.do_metric_deserialize(klass: Any, metric_message: MetricMessage) Any
mlm_insights.core.metrics.serializer.do_metric_serialize(metric: Any) MetricMessage
mlm_insights.core.metrics.serializer.get_metric_class(metric_name: str, klass: Any | None = None) Any

mlm_insights.core.metrics.skewness module

class mlm_insights.core.metrics.skewness.Skewness(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Skewness of a single numerical data column. The skewness is a parameter to measure the symmetry of a data set
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column
Distribution of data on the basis of skewness value:
  • Skewness = 0: Then normally distributed.

  • Skewness > 0: Then right tail of the distribution is longer

  • Skewness < 0: Then left tail of the distribution is longer

Configuration

None

Returns

float: Skewness of the data

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.skewness import Skewness
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Skewness)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Skewness"])
    # {'value': 1.1088349707251306}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Skewness',
    'metric_description': 'Feature Metric to compute Skewness',
    'variable_count': 1,
    'variable_names': ['skewness'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [1.1088349707251306],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Skewness

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of Skewness.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns skewness of input data.

Returns

float: skewness of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

mlm_insights.core.metrics.standard_deviation module

class mlm_insights.core.metrics.standard_deviation.StandardDeviation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Standard Deviation of a single numerical data column, a measure of the spread of a distribution.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration

None

Returns

float: Standard Deviation of the feature

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.standard_deviation import StandardDeviation
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=StandardDeviation)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["StandardDeviation"])
    # {'value': 13.375326538070016}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'StandardDeviation',
    'metric_description': 'Feature Metric to compute Standard Deviation',
    'variable_count': 1,
    'variable_names': ['standard_deviation'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [13.375326538070016],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) StandardDeviation

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute Standard Deviation metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Standard Deviation metric for the data using the DescriptiveStatisticsSFC.

Returns

float: Standard Deviation of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

mlm_insights.core.metrics.sum module

class mlm_insights.core.metrics.sum.Sum(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, sum: float = 0.0)

Bases: MetricBase

This metric calculates Sum of a single numerical data column.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Returns

float: Sum of the feature data

Examples

# To declare Sum metric:
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Sum)]},
                              dataset_metrics=[])
compute(column: Series, **kwargs: Any) None

Computes the sum for the passed in dataset. In case of a partitioned dataset, computes the sum for the specific partition

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) Sum

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_result(**kwargs: Any) Dict[str, Any]

Returns sum of input data.

Returns

float: Sum of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Standard Metric for sum.

Returns

StandardMetricResult: Sum Metric in standard format.

classmethod get_supported_variable_types() List[VariableType]

Method to retrieve the list of Feature Variable type supported for the metric

Returns

List of Feature Variable type supported by the Sum metric

merge(other_metric: Sum, **kwargs: Any) Sum

Merge two Sum metrics into one, without mutating the others.

Parameters

other_metricSum

Other Sum metric that need be merged.

Returns

Sum

A new instance of Sum metric after merging.

sum: float = 0.0

mlm_insights.core.metrics.top_k_frequent_elements module

class mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10, k: int = 10)

Bases: MetricBase

This metric calculates the Top K frequent elements for the given data column. It returns the most frequent items (aka heavy hitters) and also returns the frequency of occurrence for an item i.
This is a feature level metric which can process both numerical and categorical data types.
This is an approximate metric which uses a Frequent Items Sketch to return estimated frequency of the items.
This metric handles NaN values by dropping them from the given data column

The Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact. Otherwise, items are returned with their estimated frequencies.

NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’. Also, the results might be returned with large difference between the upper bounds and lower bounds. At that time, the results are approximate. Again, to get a better result, the user should increase the maxMapSize to get a better result.

Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Configuration

k: int, default=10
  • k value for how many top items to be returned by the metric

lg_max_k: int, default=10
  • log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024

Returns

categories: String
  • The different categories (item name)

estimate: int
  • The estimated frequency for the given category (item name)

lower_bound: int
  • The lower bound for frequency of the given category. True frequency is always guaranteed to lie between lower bound and upper bound

upper_bound: int
  • The upper bound for frequency of the given category.

Exceptions

  • InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21

Examples

# To declare TopKFrequentElements metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements)]},
                              dataset_metrics=[])
# To declare TopKFrequentElements metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements,
                              config={"k":20, "lg_max_k": 12})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    'metric_name': 'TopKFrequentElements',
    'metric_description': 'Top K Frequent Elements',
    'variable_count': 4,
    'variable_names': ['categories', 'estimate', 'lower_bound', 'upper_bound'],
    'variable_types': [NOMINAL, CONTINUOUS, CONTINUOUS,
                       CONTINUOUS],
    'variable_dtypes': [STRING, INTEGER, INTEGER, INTEGER],
    'variable_dimensions': [1,1,1,1],
    'metric_data': [['3', '1', '2'], [5, 4, 3], [5, 4, 3], [5, 4, 3]],
    'metadata': {},
    'error': None
}
classmethod create(config: Dict[str, ConfigParameter] | None = None) TopKFrequentElements

Factory Method to create an object. The configuration will be available in config.

Returns

An Instance of TopKFrequentElements.

get_required_shareable_feature_components() List[SFCMetaData]

Returns list of SFCs required to compute Top K Frequent Elements Metric.

Returns

List: list of SFCs

get_result(**kwargs: Any) Dict[str, Any]

Returns Top K Frequent Elements Metric for the data using the FrequentItemsSFC.

Returns

float: Top K Frequent Elements Metric of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

Returns Top K Frequent Elements Metric in Standard format.

Returns

StandardMetricResult: Top K Frequent Elements Metric in Standard format.

k: int = 10
merge(other_metric: TopKFrequentElements, **kwargs: Any) TopKFrequentElements

Merge two Top K Frequent Elements Metric into one, without mutating the others.

Parameters

other_metricTopKFrequentElements

Other TopKFrequentElements that need be merged.

Returns

TopKFrequentElements

A new instance of TopKFrequentElements after merging.

mlm_insights.core.metrics.type_metric module

class mlm_insights.core.metrics.type_metric.TypeMetric(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, string_type_count: int = 0, integral_type_count: int = 0, fractional_type_count: int = 0, boolean_type_count: int = 0)

Bases: MetricBase

This metric calculates the count of data types for feature values. For a given feature, it returns how many strings, integers, floats and booleans are there within the feature data.
This is a feature level metric which can process a feature having any data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration

None

Returns

  • string_type_count: Count of number of feature values of type string

  • integral_type_count: Count of number of feature values of type integer

  • fractional_type_count: Count of number of feature values of type float

  • boolean_type_count: Count of number of feature values of type boolean

Examples

import pandas as pd
import numpy as np

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.duplicate_count import TypeMetric
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [0, 1, 2.0, 3, 4.4, True, False, 5, np.nan, 6.0, 7, None]})
    metric_details = MetricDetail(univariate_metric=
                                    {"square_feet": [MetricMetadata(klass=TypeMetric)]},
                                    dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["TypeMetric"])
    # {'string_type_count': 0, 'integral_type_count': 5, 'fractional_type_count': 3, 'boolean_type_count': 2}

    if __name__ == "__main__":
        main()

    Returns the standard metric result as:
    {
        metric_name: 'TypeMetric',
        metric_description: 'Feature Metric to compute count of data types for feature values',
        variable_count: 4,
        variable_names: ['string_type_count', 'integral_type_count', 'fractional_type_count', 'boolean_type_count],
        variable_types: [DISCRETE, DISCRETE, DISCRETE, DISCRETE],
        variable_dtypes: [INTEGER, INTEGER, INTEGER, INTEGER],
        variable_dimensions: [0, 0, 0, 0],
        metric_data=[0, 5, 3, 2],
        metadata={},
        error=None
    }
boolean_type_count: int = 0
compute(column: Series, **kwargs: Any) None

Computes TypeMetric for the passed in dataset.

In case of a partitioned dataset, the TypeMetric for the specific partition is computed.

Parameters

columnpd.Series

Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) TypeMetric

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

fractional_type_count: int = 0
get_result(**kwargs: Any) Dict[str, Any]

Returns Map containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count of input data.

Returns

Map: Map containing string_type_count, integral_type_count, fractional_type_count,

boolean_type_count of input data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

integral_type_count: int = 0
merge(other_metric: TypeMetric, **kwargs: Any) TypeMetric

Merge two TypeMetric into one, without mutating the others.

Parameters

other_metricTypeMetric

Other TypeMetric that need be merged.

Returns

TypeMetric

A new instance of TypeMetric containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count after merging.

string_type_count: int = 0

mlm_insights.core.metrics.variance module

class mlm_insights.core.metrics.variance.Variance(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)

Bases: MetricBase

This metric calculates Variance of a single numerical data column, a measure of the spread of a distribution.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration

None

Returns

float: Variance of the feature

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.variance import Variance
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Variance)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Variance"])
    # {'value': 178.89936000000003}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Variance',
    'metric_description': 'Feature Metric to compute Variance',
    'variable_count': 1,
    'variable_names': ['variance'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [178.89936000000003],
    'metadata': {},
    'error': None
}
config: Dict[str, ConfigParameter]
classmethod create(config: Dict[str, ConfigParameter] | None = None) Variance

Factory Method to create an object. The configuration will be available in config.

Returns

MetricBase

An Instance of MetricBase.

get_required_shareable_feature_components() List[SFCMetaData]

Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) Dict[str, Any]

Returns variance of input data.

Returns

float: variance of the data.

get_standard_metric_result(**kwargs: Any) StandardMetricResult

This method returns metric output in standard format.

Returns

StandardMetricResult

Module contents