mlm_insights.core.metrics package¶

Subpackages¶

Submodules¶

mlm_insights.core.metrics.count module¶

class mlm_insights.core.metrics.count.Count(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, missing_count: int = 0, total_count: int = 0)¶

Bases: MetricBase

Feature Metric to compute total rows count, missing count and missing count percentage
It takes into consideration removing NaN values while computing total count
It is an exact univariate metric which can process any column type and for all data types

Configuration¶

None

Returns¶

total count: int

Number of records processed for the feature.

missing count: int

Number of records which have missing values.

missing_count_percentage: float

The percentage of missing records in the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.count import Count
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23, None]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Count)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Count"])
    # {'total_count': 6, 'missing_count': 1, 'missing_count_percentage': 16.666666666666664}


if __name__ == "__main__":
    main()


Returns the standard metric result as:
{
    'metric_name': 'Count',
    'metric_description': 'Feature metric that returns total count, missing count and missing count percentage',
    'variable_count': 3, 'variable_names': ['total_count', 'missing_count', 'missing_count_percentage'],
    'variable_types': [CONTINUOUS, CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, INTEGER, FLOAT],
    'variable_dimensions': [0, 0, 0],
    'metric_data': [0, 0, 0.0],
    'metadata': {},
    'error': None
}

compute(column: Series, **kwargs: Any) → None¶

Computes the count of missing records for the passed in dataset, as well as the total number of processed records. In case of a partitioned dataset, computes the count of missing records for the specific partition

Parameters¶

columnpd.Series: Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Count¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

Count: An Instance of Count.

get_result(**kwargs: Any) → Dict[str, Any]¶

Returns the total count, count of missing data, and the percentage of missing data for the feature

Returns¶

total count: int

total number of records processed in the data.

missing count: int

number of records in the data having missing values.

missing_count_percentage: float

percentage of missing records (the number of missing records for the feature divided by the total number of records)

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for Count.

Returns¶

StandardMetricResult: Count Metric in standard format.

merge(other_metric: Count, **kwargs: Any) → Count¶

Merge two Count metrics into one, without mutating the others.

Parameters¶

other_metricCount: Other Count metric that needs to be merged.

Returns¶

Count: A new instance of Count containing missing_count and total_count after merging.

missing_count: int = 0¶

total_count: int = 0¶

mlm_insights.core.metrics.dataset_metric_registry module¶

class mlm_insights.core.metrics.dataset_metric_registry.DatasetMetricRegistry¶

Bases: object

add_metric(dataset_metric_metadata: MetricMetadata, **kwargs: Any) → DatasetMetricRegistry¶

static create_from_metrics_map(dataset_metrics_map: Dict[str, DatasetMetricBase]) → DatasetMetricRegistry¶

Factory method to create Dataset Metric Registry using Dataset Metric Map. Use this method to create metric registry directly from the dataset metric map.

Parameters¶

dataset_metrics_mapDict[str, DatasetMetricBase]: Dictionary of metrics_map, hash as the Key and DatasetMetricBase as value.

classmethod deserialize(metric_registry_message: MetricRegistryMessage) → DatasetMetricRegistry¶

get_dataset_metrics() → Any¶

get_dataset_metrics_map() → Dict[str, DatasetMetricBase]¶

get_metric(dataset_metric_metadata: MetricMetadata) → DatasetMetricBase¶

serialize() → MetricRegistryMessage¶

mlm_insights.core.metrics.dataset_summary module¶

class mlm_insights.core.metrics.dataset_summary.DatasetSummary¶

Bases: object

compute(input_schema: Dict[str, FeatureType]) → None¶

classmethod deserialize(serialized: DatasetSummaryMessage) → DatasetSummary¶

serialize() → DatasetSummaryMessage¶

mlm_insights.core.metrics.distinct_count module¶

class mlm_insights.core.metrics.distinct_count.DistinctCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

Feature Metric to compute distinct count of elements present in that column
It is an approximate univariate metric which can process any column type and for all data types
Internally, it uses a sketch data structure with a default K value of 4096.
Supports all data types, it does not consider NaN values while doing the computation

Returns¶

distinct_count: int

the distinct count of the data.

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.distinct_count import DistinctCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=DistinctCount)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "DistinctCount": {
    "metric_name": "DistinctCount",
    "metric_description": "Approximate Distinct Count",
    "variable_count": 1,
    "variable_names": ["distinct_count"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [3],
    "metadata": {}
  }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → DistinctCount¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

int: distinct count of the data.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DistinctCountSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns the distinct count of data.

Returns¶

int: the distinct count of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for Distinct Count.

Returns¶

StandardMetricResult: Distinct Count Metric in standard format.

mlm_insights.core.metrics.duplicate_count module¶

class mlm_insights.core.metrics.duplicate_count.DuplicateCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

Feature Metric to compute duplicate count and duplicate count percentage of elements present in that column
It is an approximate univariate metric which can process any column type and for all data types
Internally, it uses a sketch data structure with a default K value of 1024.
Supports all data types, it does not consider NaN while computation

Configuration¶

None

Returns¶

count: int

Number of duplicate items in the feature data

percentage: float

The percentage of duplicate records in the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.duplicate_count import DuplicateCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=DuplicateCount)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["DuplicateCount"])
    # {'count': 2, 'percentage': 40.0}
if __name__ == "__main__":
    main()


Returns the standard metric result as:
{
    'metric_name': 'DuplicateCount',
    'metric_description': 'Feature Metric to compute duplicate count and duplicate count percentage',
    'variable_count': 2, 'variable_names': ['count', 'percentage'],
    'variable_types': [CONTINUOUS, CONTINUOUS], 'variable_dtypes': [INTEGER, FLOAT],
    'variable_dimensions': [0, 0],
    'metric_data': [23, 15.5],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → DuplicateCount¶: Factory Method to create an object.

Returns¶

Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of the total count.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Frequent Items SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. Frequent Items SFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns the number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.

Returns¶

Object: number of items that are duplicate of another item in the data and percentage of duplicate count out of total count.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

mlm_insights.core.metrics.framework_metrics_enum module¶

class mlm_insights.core.metrics.framework_metrics_enum.FrameworkMetrics(value)¶

Bases: Enum

Define all Insights-provided Metric here.

AccuracyScore = <class 'mlm_insights.core.metrics.classification_metrics.accuracy_score.AccuracyScore'>¶

ChiSquare = <class 'mlm_insights.core.metrics.drift_metrics.chi_square.ChiSquare'>¶

ClassImbalance = <class 'mlm_insights.core.metrics.bias_and_fairness.class_imbalance.ClassImbalance'>¶

ConflictLabel = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_label.ConflictLabel'>¶

ConflictPrediction = <class 'mlm_insights.core.metrics.conflict_metrics.conflict_prediction.ConflictPrediction'>¶

ConfusionMatrix = <class 'mlm_insights.core.metrics.classification_metrics.confusion_matrix.ConfusionMatrix'>¶

CorrelationRatio = <class 'mlm_insights.core.metrics.data_quality.correlation_ratio.CorrelationRatio'>¶

Count = <class 'mlm_insights.core.metrics.count.Count'>¶

CramersVCorrelation = <class 'mlm_insights.core.metrics.data_quality.cramers_v_correlation.CramersVCorrelation'>¶

DateTimeDuration = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_duration.DateTimeDuration'>¶

DateTimeMax = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_max.DateTimeMax'>¶

DateTimeMin = <class 'mlm_insights.core.metrics.datetime_metrics.datetime_min.DateTimeMin'>¶

DistinctCount = <class 'mlm_insights.core.metrics.distinct_count.DistinctCount'>¶

DuplicateCount = <class 'mlm_insights.core.metrics.duplicate_count.DuplicateCount'>¶

FBetaScore = <class 'mlm_insights.core.metrics.classification_metrics.fbeta_score.FBetaScore'>¶

FalseNegativeRate = <class 'mlm_insights.core.metrics.classification_metrics.false_negative_rate.FalseNegativeRate'>¶

FalsePositiveRate = <class 'mlm_insights.core.metrics.classification_metrics.false_positive_rate.FalsePositiveRate'>¶

FrequencyDistribution = <class 'mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution'>¶

IQR = <class 'mlm_insights.core.metrics.iqr.IQR'>¶

IsConstantFeature = <class 'mlm_insights.core.metrics.is_constant_feature.IsConstantFeature'>¶

IsNegative = <class 'mlm_insights.core.metrics.is_negative.IsNegative'>¶

IsNonZero = <class 'mlm_insights.core.metrics.is_non_zero.IsNonZero'>¶

IsPositive = <class 'mlm_insights.core.metrics.is_positive.IsPositive'>¶

IsQuasiConstantFeature = <class 'mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature'>¶

JensenShannon = <class 'mlm_insights.core.metrics.drift_metrics.jensen_shannon.JensenShannon'>¶

KolmogorovSmirnov = <class 'mlm_insights.core.metrics.drift_metrics.kolmogorov_smirnov.KolmogorovSmirnov'>¶

KullbackLeibler = <class 'mlm_insights.core.metrics.drift_metrics.kullback_leibler.KullbackLeibler'>¶

Kurtosis = <class 'mlm_insights.core.metrics.kurtosis.Kurtosis'>¶

LogLoss = <class 'mlm_insights.core.metrics.classification_metrics.log_loss.LogLoss'>¶

Max = <class 'mlm_insights.core.metrics.max.Max'>¶

MaxError = <class 'mlm_insights.core.metrics.regression_metrics.max_error.MaxError'>¶

Mean = <class 'mlm_insights.core.metrics.mean.Mean'>¶

MeanAbsoluteError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_error.MeanAbsoluteError'>¶

MeanAbsolutePercentageError = <class 'mlm_insights.core.metrics.regression_metrics.mean_absolute_percentage_error.MeanAbsolutePercentageError'>¶

MeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_error.MeanSquaredError'>¶

MeanSquaredLogError = <class 'mlm_insights.core.metrics.regression_metrics.mean_squared_log_error.MeanSquaredLogError'>¶

Min = <class 'mlm_insights.core.metrics.min.Min'>¶

Mode = <class 'mlm_insights.core.metrics.mode.Mode'>¶

PearsonCorrelation = <class 'mlm_insights.core.metrics.data_quality.pearson_correlation.PearsonCorrelation'>¶

Percentiles = <class 'mlm_insights.core.metrics.percentiles.Percentiles'>¶

PopulationStabilityIndex = <class 'mlm_insights.core.metrics.drift_metrics.population_stability_index.PopulationStabilityIndex'>¶

PrecisionRecallAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_auc.PrecisionRecallAreaUnderCurve'>¶

PrecisionRecallCurve = <class 'mlm_insights.core.metrics.classification_metrics.precision_recall_curve.PrecisionRecallCurve'>¶

PrecisionScore = <class 'mlm_insights.core.metrics.classification_metrics.precision_score.PrecisionScore'>¶

ProbabilityDistribution = <class 'mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution'>¶

Quartiles = <class 'mlm_insights.core.metrics.quartiles.Quartiles'>¶

R2Score = <class 'mlm_insights.core.metrics.regression_metrics.r2_score.R2Score'>¶

ROCAreaUnderCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc_auc.ROCAreaUnderCurve'>¶

ROCCurve = <class 'mlm_insights.core.metrics.classification_metrics.roc.ROCCurve'>¶

Range = <class 'mlm_insights.core.metrics.range.Range'>¶

RecallScore = <class 'mlm_insights.core.metrics.classification_metrics.recall_score.RecallScore'>¶

RootMeanSquaredError = <class 'mlm_insights.core.metrics.regression_metrics.root_mean_squared_error.RootMeanSquaredError'>¶

RowCount = <class 'mlm_insights.core.metrics.rows_count.RowCount'>¶

Skewness = <class 'mlm_insights.core.metrics.skewness.Skewness'>¶

Specificity = <class 'mlm_insights.core.metrics.classification_metrics.specificity.Specificity'>¶

StandardDeviation = <class 'mlm_insights.core.metrics.standard_deviation.StandardDeviation'>¶

Sum = <class 'mlm_insights.core.metrics.sum.Sum'>¶

TopKFrequentElements = <class 'mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements'>¶

TypeMetric = <class 'mlm_insights.core.metrics.type_metric.TypeMetric'>¶

Variance = <class 'mlm_insights.core.metrics.variance.Variance'>¶

mlm_insights.core.metrics.frequency_distribution module¶

class mlm_insights.core.metrics.frequency_distribution.FrequencyDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶

Bases: MetricBase

Frequency Distribution
This metric calculates FrequencyDistribution of a single data column
This is a feature level metric which can process any column and only numerical (int, float) data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 200.

Configuration¶

bin: Union[str, int, List[float]], default=’sturges’: One of the following values

- Number of bins

- Binning algorithm. Default is Sturges

- Bins: List of floats

Returns¶

bins: List[int]: bins of the data.
frequency: List[int]: frequency of the data.

Example

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.frequency_distribution import FrequencyDistribution
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [1, 1, 2, 3, 4, 5, 7, 10, 11, 20]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=FrequencyDistribution)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["FrequencyDistribution"])
    # {'bins': [1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], 'frequency': [5, 2, 2, 0, 1]}

 {
    "metric_name": "FrequencyDistribution",
    "metric_description": "Feature Metric to compute Frequency distribution",
    "variable_count": 2,
    "variable_names": ['bins', 'frequency'],
    "variable_types": ["CONTINUOUS", "CONTINUOUS"],
    "variable_dtypes": ["FLOAT", "FLOAT"],
    "variable_dimensions": [1, 1],
    "metric_data": [[1.0, 4.8, 8.6, 12.399999999999999, 16.2, 20.0], [5, 2, 2, 0, 1]],
    "metadata": {},
    "error": null
  }

bins: str | int | List[float] = 'sturges'¶

classmethod create(config: Dict[str, ConfigParameter] | None = None) → FrequencyDistribution¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

FrequencyDistribution: An Instance of FrequencyDistribution.

get_required_shareable_feature_components() → List[SFCMetaData]¶

Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns¶

List[SFCMetaData]: List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC

get_result(**kwargs: Any) → Dict[str, Any]¶

Returns the FrequencyDistribution of data.

Returns¶

Dict: The frequency distribution of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

merge(other_metric: FrequencyDistribution, **kwargs: Any) → FrequencyDistribution¶

Merge two Frequency Distribution Metric into one, without mutating the others.

Parameters¶

other_metricFrequencyDistribution: Other Frequency Distribution that need be merged.

Returns¶

FrequencyDistribution: A new instance of Frequency Distribution metric after merging.

mlm_insights.core.metrics.iqr module¶

class mlm_insights.core.metrics.iqr.IQR(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

Inter Quartile Range
This metric calculates inter quartile range of a single numerical data column, namely Q3 - Q1.
Each quartile represent ((n + 1)/4)th Term of the overall dataset.
This is a feature level metric which can process any column type and only numerical (int, float) data types.
This is an approximate metric.
Internally, it uses a sketch data structure with a default K value of 200.

Configuration¶

None

Returns¶

iqr: float: the IQR of the data (Q3 - Q1).

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.iqr import IQR
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [-1, -2, -3, -4]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=IQR)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["IQR"])
    # {'value': 2}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'IQR',
    'metric_description': 'Feature Metric to compute IQR',
    'variable_count': 1,
    'variable_names': ['i_q_r'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [2],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → IQR¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of IQR.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Quantiles SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. QuantilesSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns the IQR of data.

Returns¶

float: the IQR of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns IQR Metric in Standard format.

Returns¶

StandardMetricResult: IQR Metric in Standard format.

mlm_insights.core.metrics.is_constant_feature module¶

class mlm_insights.core.metrics.is_constant_feature.IsConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

Constant Feature metric computes whether all the values are same or not
This metric returns Constant as True when all the values within feature are same
This is a Univariate, feature level metric which can process any column type and any data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 1024.

Configuration¶

None

Returns¶

is_constant: boolean

If all values are same

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_constant_feature import IsConstantFeature
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

df = pd.DataFrame({"Age": [1, 1, 1, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsConstantFeature)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])


{
  "IsConstantFeature": {
    "metric_name": "IsConstantFeature",
    "metric_description": "Feature Metric to compute if all values are same",
    "variable_count": 1,
    "variable_names": ["is_constant"],
    "variable_types": ["BINARY"],
    "variable_dtypes": ["BOOLEAN"],
    "variable_dimensions": [0],
    "metric_data": [true],
    "metadata": {},
    "error": null
  }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → IsConstantFeature¶: Factory Method to create an object.

Returns¶

An Instance of IsConstantFeature Univariate Metric.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute IsConstantFeature Univariate Metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns IsConstantFeature Univariate Metric for the data using the FrequentItemsSFC.

Returns¶

boolean: IsConstantFeature Univariate Metric of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns IsConstant Metric in Standard format.

Returns¶

StandardMetricResult: IsConstant Metric in Standard format.

mlm_insights.core.metrics.is_quasi_constant_feature module¶

class mlm_insights.core.metrics.is_quasi_constant_feature.IsQuasiConstantFeature(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, quasi_constant_threshold: float = 0.99)¶

Bases: MetricBase

Quasi Constant metric computes whether all the values are almost same or not
This metric returns Quasi Constant as True when one single value occurs at higher frequency compared to Quasi Constant Threshold
This is a Univariate, feature level metric which can process any column type and any data types.
This is an approximate metric
Internally, it uses a sketch data structure with a default K value of 1024.

Configuration¶

quasi_constant_threshold: str, default=0.99

Define Quasi Constant Threshold value, if the first element value count percentage is >= this threshold, it is Quasi Constant Feature

Returns¶

is_quasi_constant: boolean

If all values are almost same

Example

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.is_quasi_constant_feature import IsQuasiConstantFeature
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

df = pd.DataFrame({"Age": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=IsQuasiConstantFeature,
                                                                        config={"quasi_constant_threshold":0.8})]},
                              dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])


{
  "IsQuasiConstantFeature": {
    "metric_name": "IsQuasiConstantFeature",
    "metric_description": "Feature Metric to compute if all values are almost same",
    "variable_count": 1,
    "variable_names": ["is_quasi_constant"],
    "variable_types": ["BINARY"],
    "variable_dtypes": ["BOOLEAN"],
    "variable_dimensions": [0],
    "metric_data": [true],
    "metadata": {},
    "error": null
  }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → IsQuasiConstantFeature¶: Factory Method to create an object.

Returns¶

An Instance of IsQuasiConstantFeature Univariate Metric.The configuration will be available in config.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute IsQuasiConstantFeature Univariate Metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns IsQuasiConstantFeature Univariate Metric for the data using the FrequentItemsSFC.

Returns¶

boolean: IsQuasiConstantFeature Univariate Metric of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Minimum Metric in Standard format.

Returns¶

StandardMetricResult: Minimum Metric in Standard format.

merge(other_metric: IsQuasiConstantFeature, **kwargs: Any) → IsQuasiConstantFeature¶

Merge two IsQuasiConstantFeature Metric into one, without mutating the others.

Parameters¶

other_metricIsQuasiConstantFeature: Other IsQuasiConstantFeature that need be merged.

Returns¶

IsQuasiConstantFeature: A new instance of IsQuasiConstantFeature after merging.

quasi_constant_threshold: float = 0.99¶

mlm_insights.core.metrics.kurtosis module¶

class mlm_insights.core.metrics.kurtosis.Kurtosis(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Kurtosis of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric
Mathematically: central_moments[i] = sum{( x - mean )^i} /N
Kurtosis is 4th Central Moment

Configuration¶

None

Returns¶

kurtosis: float

Kurtosis of the data, if data not present returns None

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.kurtosis import Kurtosis
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Kurtosis)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Kurtosis"])
    # {'value': -0.4098628688922519}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Kurtosis',
    'metric_description': 'Feature Metric to compute Kurtosis',
    'variable_count': 1,
    'variable_names': ['kurtosis'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [0.45],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Kurtosis¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Kurtosis.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Excess Kurtosis of data.

Returns¶

float: Excess Kurtosis of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for Kurtosis.

Returns¶

StandardMetricResult: Kurtosis Metric in standard format.

mlm_insights.core.metrics.max module¶

class mlm_insights.core.metrics.max.Max(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Max of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration¶

None

Returns¶

max: float

Maximum of the data, if data not present returns None

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.max import Max
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Max)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
    "Max": {
        "metric_name": "Max",
        "metric_description": "Feature Metric to compute maximum value",
        "variable_count": 1,
        "variable_names": ["maximum"],
        "variable_types": ["CONTINUOUS"],
        "variable_dtypes": ["FLOAT"],
        "variable_dimensions": [0],
        "metric_data": [6.0],
        "metadata": {},
        "error": null
    }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Max¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns maximum of input data.

Returns¶

float: maximum of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Maximum Metric in Standard format.

Returns¶

StandardMetricResult: Maximum Metric in Standard format.

mlm_insights.core.metrics.mean module¶

class mlm_insights.core.metrics.mean.Mean(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Mean of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration¶

None

Returns¶

mean: float

Mean of the data, if data not present returns None

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.mean import Mean
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
import pandas as pd

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Mean)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Mean"])
    # {'value': 20.54}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Mean',
    'metric_description': 'Feature Metric to compute Mean',
    'variable_count': 1,
    'variable_names': ['mean'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [20.54],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Mean¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Mean.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute mean metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns mean metric for the data using the DescriptiveStatisticsSFC.

Returns¶

float: mean of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for Mean.

Returns¶

StandardMetricResult: Mean Metric in standard format.

mlm_insights.core.metrics.metric_metadata module¶

class mlm_insights.core.metrics.metric_metadata.MetricMetadata(klass: ~typing.Type[~typing.Any], config: ~typing.Dict[str, ~typing.Any] = <factory>)¶

Bases: object

Represents dataset metric metadata used to define and configure a metric

config: Dict[str, Any]¶

get_key() → str¶: Returns key which uniquely identifies this metric. Since a metric can be added only once to a feature/profile, key only contains the name which uniquely identifies the metric

klass: Type[Any]¶

mlm_insights.core.metrics.metric_registry module¶

class mlm_insights.core.metrics.metric_registry.MetricRegistry¶

Bases: object

add_metric(metric_metadata: MetricMetadata, **kwargs: Any) → MetricRegistry¶

static create_from_metrics_map(metrics_map: Dict[str, MetricBase]) → MetricRegistry¶

Factory method to create Metric Registry using Metric Map. Use this method to create metric registry directly from the metric map.

Parameters¶

metrics_mapDict[str, MetricBase]: Dictionary of metrics_map, hash as the Key and MetricBase as value.

classmethod deserialize(metric_registry_message: MetricRegistryMessage) → MetricRegistry¶

get_metric(metric_metadata: MetricMetadata) → MetricBase¶

get_metrics() → Any¶

get_metrics_map() → Dict[str, MetricBase]¶

serialize() → MetricRegistryMessage¶

mlm_insights.core.metrics.min module¶

class mlm_insights.core.metrics.min.Min(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Min of a single numerical data column
This is a Univariate, feature level metric which can process any column type and only numerical (int, float) data types.
This is an exact metric

Configuration¶

None

Returns¶

min: float

Minimum of the data

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.min import Min

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Min)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "Min": {
    "metric_name": "Min",
    "metric_description": "Feature Metric to compute minimum value",
    "variable_count": 1,
    "variable_names": ["minimum"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [1.0],
    "metadata": {},
    "error": null
  }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Min¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Min.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns minimum of input data.

Returns¶

float: minimum of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Minimum Metric in Standard format.

Returns¶

StandardMetricResult: Minimum Metric in Standard format.

mlm_insights.core.metrics.mode module¶

class mlm_insights.core.metrics.mode.Mode(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10)¶

Bases: MetricBase

This metric calculates the Mode for the given data column. It returns the most frequently occurring item as the
mode. In bi-modal or multi-modal cases, two modes are returned.
This is a feature level metric which can process both numerical and categorical data types.
This is an approximate metric which uses a Frequent Items Sketch to calculate the most frequently occurring
item(s).
This metric handles NaN values by dropping them from the given data column

The Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact, hence exact mode will be returned. Otherwise, items are returned with their estimated frequencies and mode will be approximate.

NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’.

Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Configuration¶

lg_max_k: int, default=10

log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024

Returns¶

mode: List[String]

The mode of the given data column. In bi-modal or multi-modal cases, two modes are returned.

Exceptions¶

InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21

Examples

# To declare Mode metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode)]},
                              dataset_metrics=[])
# To declare Mode metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Mode,
                              config={"lg_max_k": 12})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    'metric_name': 'Mode',
    'metric_description': 'Mode',
    'variable_count': 1,
    'variable_names': ['mode'],
    'variable_types': [NOMINAL],
    'variable_dtypes': [STRING],
    'variable_dimensions': [1],
    'metric_data': [['1', '3']],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Mode¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Mode.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute Mode metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Mode metric for the data using the FrequentItemsSFC.

Returns¶

float: Mode metric of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Mode Metric in Standard format.

Returns¶

StandardMetricResult: Mode Metric in Standard format.

merge(other_metric: Mode, **kwargs: Any) → Mode¶

Merge two Mode Metrics into one, without mutating the others.

Parameters¶

other_metricMode: Other Mode that needs to be merged.

Returns¶

Mode: A new instance of Mode after merging.

mlm_insights.core.metrics.probablity_distribution module¶

class mlm_insights.core.metrics.probablity_distribution.ProbabilityDistribution(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, bins: str | int | ~typing.List[float] = 'sturges')¶

Bases: MetricBase

This metric calculates Probability Distribution of a single data column. Probability Distribution of a Random
Variable (X) shows how the Probabilities of the events are distributed over different values of the Random
Variable.
This is a feature level metric which can process only numerical data types.
This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200
This metric handles NaN values by dropping them from the given data column

Configuration¶

bin: Union[str, int, List[float]], default=’sturges’

One of the following values

Number of bins
Binning algorithm. Default is Sturges, also the only algorithm supported as of now. Other algorithms will be supported in the near future
Bins: List of floats

Returns¶

bins: List[float]: bins of the data.
density: List[float]: Density/probabilities of occurrence for the respective bins

Example

# To declare ProbabilityDistribution metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution)
                              ]}, dataset_metrics=[])
# To declare ProbabilityDistribution metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=ProbabilityDistribution,
                              config={"bins":10})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    "metric_name": "ProbabilityDistribution",
    "metric_description": "Feature Metric to compute probability density",
    "variable_count": 2,
    "variable_names": ['bins', 'density'],
    "variable_types": [CONTINUOUS, CONTINUOUS],
    "variable_dtypes": [FLOAT, FLOAT],
    "variable_dimensions": [1, 1],
    "metric_data": [[1.0, 1.5, 2.0], [0.5, 0.5]],
    "metadata": {},
    "error": null
}

bins: str | int | List[float] = 'sturges'¶

classmethod create(config: Dict[str, ConfigParameter] | None = None) → ProbabilityDistribution¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

ProbabilityDistribution: An Instance of ProbabilityDistribution.

get_required_shareable_feature_components() → List[SFCMetaData]¶

Returns list of SFCs required to compute PDF metric.

Returns¶

List[SFCMetaData]: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶

Returns Probability Distribution for the data using the QuantilesSFC.

Returns¶

Dict: Probability Distribution for the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for Probability Distribution.

Returns¶

StandardMetricResult: Probability Distribution Metric in standard format.

merge(other_metric: ProbabilityDistribution, **kwargs: Any) → ProbabilityDistribution¶

Merge two Probability Distribution Metric into one, without mutating the others.

Parameters¶

other_metricProbability Distribution: Other PDF that need be merged.

Returns¶

ProbabilityDistribution: A new instance of Probability Distribution after merging.

mlm_insights.core.metrics.quartiles module¶

class mlm_insights.core.metrics.quartiles.Quartiles(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates the Quartiles (Q1, Q2, Q3) of the given data column.
This is a feature level metric which can process only numerical (int, float) data types.
This is an approximate metric. Internally, it uses a KLL sketch data structure with default k value as 200
This metric handles NaN values by dropping them from the given data column

Configuration¶

None

Returns¶

Q1: Lower quartile which is a number halfway between the lowest number and the middle number.

Q2: Second quartile (also known as median) which is a middle number halfway between the lowest number and the highest number

Q3: Upper quartile which is a number halfway between the median and the highest number.

Examples

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.quartiles import Quartiles
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [3, 5, 1, 7, 8, 4, 9]})
    metric_details = MetricDetail(univariate_metric=
                                    {"square_feet": [MetricMetadata(klass=Quartiles)]},
                                    dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Quartiles"])
    # {'q1': 3.0, 'q2': 5.0, 'q3': 8.0}

    if __name__ == "__main__":
        main()

    Returns the standard metric result as:
    {
        metric_name: 'Quartiles',
        metric_description: 'Feature Metric to compute Quartiles (Q1, Q2, Q3)',
        variable_count: 3,
        variable_names: ['q1', 'q2', 'q3'],
        variable_types: [CONTINUOUS, CONTINUOUS, CONTINUOUS],
        variable_dtypes: [FLOAT, FLOAT, FLOAT],
        variable_dimensions: [0, 0, 0],
        metric_data=[3.0, 5.0, 8.0],
        metadata={},
        error=None
    }

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Quartiles¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Quartiles.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute quartiles metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Quartiles metric for the data using the QuantilesSFC.

Returns¶

Json object: quartiles of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

mlm_insights.core.metrics.range module¶

class mlm_insights.core.metrics.range.Range(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Range of a single numerical data column. Range is the difference between the smallest
and highest numbers
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Returns¶

float: Range of the data.

Example

import pandas as pd

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType
from mlm_insights.core.metrics.metric_metadata import MetricMetadata
from mlm_insights.core.metrics.range import Range

df = pd.DataFrame({"Age": [1, 4, 6, 1]})
metric_details = MetricDetail(univariate_metric={"Age": [MetricMetadata(klass=Range)]}, dataset_metrics=[])
input_schema = {"Age": FeatureType(DataType.INTEGER, VariableType.NOMINAL)}

runner = InsightsBuilder().             with_input_schema(input_schema).             with_data_frame(data_frame=df).             with_metrics(metrics=metric_details).             with_engine(engine=EngineDetail(engine_name="native")).             build()

profile_json = runner.run().profile.to_json()
feature_metrics = profile_json['feature_metrics']
print(feature_metrics["Age"])

{
  "Range": {
    "metric_name": "Range",
    "metric_description": "Feature Metric to compute range value",
    "variable_count": 1,
    "variable_names": ["range"],
    "variable_types": ["CONTINUOUS"],
    "variable_dtypes": ["FLOAT"],
    "variable_dimensions": [0],
    "metric_data": [5.0],
    "metadata": {},
    "error": null
  }
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Range¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns range of input data.

Returns¶

float: range of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Range Metric in Standard format.

Returns¶

StandardMetricResult: Range Metric in Standard format.

mlm_insights.core.metrics.rows_count module¶

class mlm_insights.core.metrics.rows_count.RowCount(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, row_count: int = 0)¶

Bases: DatasetMetricBase

This metric calculates the total row count of the Dataset
This Dataset level metric is an exact metric.
This metric doesn’t handle NaN values. If certain rows have NaN values, there is no impact on the RowCount

Configuration¶

None

Returns¶

integer: total row count of the dataset

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.rows_count import RowCount
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric={},
                                  dataset_metrics=[MetricMetadata(klass=RowCount)])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    dataset_metrics = profile_json['dataset_metrics']
    print(dataset_metrics["RowCount"])
    # {'value': 5}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'RowCount',
    'metric_description': 'Dataset-level Metric to compute the total row count of the dataset',
    'variable_count': 1,
    'variable_names': ['rows_count'],
    'variable_types': [DISCRETE],
    'variable_dtypes': [INTEGER],
    'variable_dimensions': [0],
    'metric_data': [5],
    'metadata': {},
    'error': None
}

compute(dataset: DataFrame, **kwargs: Any) → None¶: Calculate the metric value(s) from the passed DataFrame , set the internal state with the value(s). When a metric is being computed for a partitioned data set, this method is invoked for each partition. Write logic required to derive the metric value in this method for that specific partition

Parameters¶

dataset : pd.DataFrame DataFrame object for either the entire dataset or a partition on which a Metric is being computed

classmethod create(config: Dict[str, ConfigParameter] | None = None, **kwargs: Any) → RowCount¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns the computed value of the metric

Returns¶

Dict[str, Any]: Dictionary with key as string and value as any metric property.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

merge(other_metric: RowCount, **kwargs: Any) → RowCount¶: Merge the other metric with the current metric and return a new instance of metric. Use this method to merge the states of the 2 metrics to produce a statistically-correctly state Note: you should not mutate the current metric but create a new instance.

Parameters¶

other_metric : DatasetMetricBase The second metric which the current metric is being merged with

Returns¶

DatasetMetricBase: New, merged DatasetMetricBase instance

row_count: int = 0¶

mlm_insights.core.metrics.serializer module¶

mlm_insights.core.metrics.serializer.do_metric_deserialize(klass: Any, metric_message: MetricMessage) → Any¶

mlm_insights.core.metrics.serializer.do_metric_serialize(metric: Any) → MetricMessage¶

mlm_insights.core.metrics.serializer.get_metric_class(metric_name: str, klass: Any | None = None) → Any¶

mlm_insights.core.metrics.skewness module¶

class mlm_insights.core.metrics.skewness.Skewness(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Skewness of a single numerical data column. The skewness is a parameter to measure
the symmetry of a data set
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Distribution of data on the basis of skewness value:

Skewness = 0: Then normally distributed.
Skewness > 0: Then right tail of the distribution is longer
Skewness < 0: Then left tail of the distribution is longer

Configuration¶

None

Returns¶

float: Skewness of the data

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.skewness import Skewness
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Skewness)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Skewness"])
    # {'value': 1.1088349707251306}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Skewness',
    'metric_description': 'Feature Metric to compute Skewness',
    'variable_count': 1,
    'variable_names': ['skewness'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [1.1088349707251306],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Skewness¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of Skewness.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns skewness of input data.

Returns¶

float: skewness of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

mlm_insights.core.metrics.standard_deviation module¶

class mlm_insights.core.metrics.standard_deviation.StandardDeviation(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Standard Deviation of a single numerical data column, a measure of the spread of a
distribution.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration¶

None

Returns¶

float: Standard Deviation of the feature

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.standard_deviation import StandardDeviation
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=StandardDeviation)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["StandardDeviation"])
    # {'value': 13.375326538070016}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'StandardDeviation',
    'metric_description': 'Feature Metric to compute Standard Deviation',
    'variable_count': 1,
    'variable_names': ['standard_deviation'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [13.375326538070016],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → StandardDeviation¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute Standard Deviation metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Standard Deviation metric for the data using the DescriptiveStatisticsSFC.

Returns¶

float: Standard Deviation of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

mlm_insights.core.metrics.sum module¶

class mlm_insights.core.metrics.sum.Sum(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, sum: float = 0.0)¶

Bases: MetricBase

This metric calculates Sum of a single numerical data column.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Returns¶

float: Sum of the feature data

Examples

# To declare Sum metric:
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=Sum)]},
                              dataset_metrics=[])

compute(column: Series, **kwargs: Any) → None¶

Computes the sum for the passed in dataset. In case of a partitioned dataset, computes the sum for the specific partition

Parameters¶

columnpd.Series: Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Sum¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns sum of input data.

Returns¶

float: Sum of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Standard Metric for sum.

Returns¶

StandardMetricResult: Sum Metric in standard format.

classmethod get_supported_variable_types() → List[VariableType]¶: Method to retrieve the list of Feature Variable type supported for the metric

Returns¶

List of Feature Variable type supported by the Sum metric

merge(other_metric: Sum, **kwargs: Any) → Sum¶

Merge two Sum metrics into one, without mutating the others.

Parameters¶

other_metricSum: Other Sum metric that need be merged.

Returns¶

Sum: A new instance of Sum metric after merging.

sum: float = 0.0¶

mlm_insights.core.metrics.top_k_frequent_elements module¶

class mlm_insights.core.metrics.top_k_frequent_elements.TopKFrequentElements(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, _lg_max_k: int = 10, k: int = 10)¶

Bases: MetricBase

This metric calculates the Top K frequent elements for the given data column. It returns the most frequent items
(aka heavy hitters) and also returns the frequency of occurrence for an item i.
This is a feature level metric which can process both numerical and categorical data types.
This is an approximate metric which uses a Frequent Items Sketch to return estimated frequency of the items.
This metric handles NaN values by dropping them from the given data column

The Frequent Items Sketch is initialized with a maxMapSize that specifies the maximum physical length of the internal hash map of the form (<T> item, long count). The maxMapSize must be a power of 2. If fewer than 0.75 * maxMapSize different items are inserted into the sketch the estimated frequencies returned by the sketch will be exact. Otherwise, items are returned with their estimated frequencies.

NOTE: In case the metric result doesn’t contain any output, then the user will need to tweak the maxMapSize by providing a higher value for ‘lg_max_k’. Also, the results might be returned with large difference between the upper bounds and lower bounds. At that time, the results are approximate. Again, to get a better result, the user should increase the maxMapSize to get a better result.

Please refer here for more details: https://datasketches.apache.org/docs/Frequency/FrequentItemsOverview.html

Configuration¶

k: int, default=10

k value for how many top items to be returned by the metric

lg_max_k: int, default=10

log of max map size (max map size = 2^lg_max_k). So, with default value of lg_max_k as 10, map size used by Frequent items sketch will be 2^10 = 1024

Returns¶

categories: String

The different categories (item name)

estimate: int

The estimated frequency for the given category (item name)

lower_bound: int

The lower bound for frequency of the given category. True frequency is always guaranteed to lie between lower bound and upper bound

upper_bound: int

The upper bound for frequency of the given category.

Exceptions¶

InvalidParameterException - in case lg_max_k < 7 or lg_max_k > 21

Examples

# To declare TopKFrequentElements metric, without any config
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements)]},
                              dataset_metrics=[])
# To declare TopKFrequentElements metric, along with config options
metric_details = MetricDetail(univariate_metric= {"feature_name": [MetricMetadata(klass=TopKFrequentElements,
                              config={"k":20, "lg_max_k": 12})]}, dataset_metrics=[])

Returns the standard metric result as:
{
    'metric_name': 'TopKFrequentElements',
    'metric_description': 'Top K Frequent Elements',
    'variable_count': 4,
    'variable_names': ['categories', 'estimate', 'lower_bound', 'upper_bound'],
    'variable_types': [NOMINAL, CONTINUOUS, CONTINUOUS,
                       CONTINUOUS],
    'variable_dtypes': [STRING, INTEGER, INTEGER, INTEGER],
    'variable_dimensions': [1,1,1,1],
    'metric_data': [['3', '1', '2'], [5, 4, 3], [5, 4, 3], [5, 4, 3]],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → TopKFrequentElements¶: Factory Method to create an object. The configuration will be available in config.

Returns¶

An Instance of TopKFrequentElements.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns list of SFCs required to compute Top K Frequent Elements Metric.

Returns¶

List: list of SFCs

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns Top K Frequent Elements Metric for the data using the FrequentItemsSFC.

Returns¶

float: Top K Frequent Elements Metric of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: Returns Top K Frequent Elements Metric in Standard format.

Returns¶

StandardMetricResult: Top K Frequent Elements Metric in Standard format.

k: int = 10¶

merge(other_metric: TopKFrequentElements, **kwargs: Any) → TopKFrequentElements¶

Merge two Top K Frequent Elements Metric into one, without mutating the others.

Parameters¶

other_metricTopKFrequentElements: Other TopKFrequentElements that need be merged.

Returns¶

TopKFrequentElements: A new instance of TopKFrequentElements after merging.

mlm_insights.core.metrics.type_metric module¶

class mlm_insights.core.metrics.type_metric.TypeMetric(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>, string_type_count: int = 0, integral_type_count: int = 0, fractional_type_count: int = 0, boolean_type_count: int = 0)¶

Bases: MetricBase

This metric calculates the count of data types for feature values. For a given feature, it returns how many
strings, integers, floats and booleans are there within the feature data.
This is a feature level metric which can process a feature having any data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration¶

None

Returns¶

string_type_count: Count of number of feature values of type string

integral_type_count: Count of number of feature values of type integer

fractional_type_count: Count of number of feature values of type float

boolean_type_count: Count of number of feature values of type boolean

Examples

import pandas as pd
import numpy as np

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.duplicate_count import TypeMetric
from mlm_insights.core.metrics.metric_metadata import MetricMetadata


def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [0, 1, 2.0, 3, 4.4, True, False, 5, np.nan, 6.0, 7, None]})
    metric_details = MetricDetail(univariate_metric=
                                    {"square_feet": [MetricMetadata(klass=TypeMetric)]},
                                    dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["TypeMetric"])
    # {'string_type_count': 0, 'integral_type_count': 5, 'fractional_type_count': 3, 'boolean_type_count': 2}

    if __name__ == "__main__":
        main()

    Returns the standard metric result as:
    {
        metric_name: 'TypeMetric',
        metric_description: 'Feature Metric to compute count of data types for feature values',
        variable_count: 4,
        variable_names: ['string_type_count', 'integral_type_count', 'fractional_type_count', 'boolean_type_count],
        variable_types: [DISCRETE, DISCRETE, DISCRETE, DISCRETE],
        variable_dtypes: [INTEGER, INTEGER, INTEGER, INTEGER],
        variable_dimensions: [0, 0, 0, 0],
        metric_data=[0, 5, 3, 2],
        metadata={},
        error=None
    }

boolean_type_count: int = 0¶

compute(column: Series, **kwargs: Any) → None¶

Computes TypeMetric for the passed in dataset.

In case of a partitioned dataset, the TypeMetric for the specific partition is computed.

Parameters¶

columnpd.Series: Input column.

classmethod create(config: Dict[str, ConfigParameter] | None = None) → TypeMetric¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

fractional_type_count: int = 0¶

get_result(**kwargs: Any) → Dict[str, Any]¶

Returns Map containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count of input data.

Returns¶

Map: Map containing string_type_count, integral_type_count, fractional_type_count,: boolean_type_count of input data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult

integral_type_count: int = 0¶

merge(other_metric: TypeMetric, **kwargs: Any) → TypeMetric¶

Merge two TypeMetric into one, without mutating the others.

Parameters¶

other_metricTypeMetric: Other TypeMetric that need be merged.

Returns¶

TypeMetric: A new instance of TypeMetric containing string_type_count, integral_type_count, fractional_type_count, boolean_type_count after merging.

string_type_count: int = 0¶

mlm_insights.core.metrics.variance module¶

class mlm_insights.core.metrics.variance.Variance(config: ~typing.Dict[str, ~mlm_insights.constants.definitions.ConfigParameter] = <factory>)¶

Bases: MetricBase

This metric calculates Variance of a single numerical data column, a measure of the spread of a distribution.
This is a feature level metric which can process only numerical (int, float) data types.
This is an exact metric.
This metric handles NaN values by dropping them from the given data column

Configuration¶

None

Returns¶

float: Variance of the feature

Examples

from mlm_insights.builder.builder_component import MetricDetail, EngineDetail
from mlm_insights.builder.insights_builder import InsightsBuilder
from mlm_insights.constants.types import FeatureType, DataType, VariableType, ColumnType
from mlm_insights.core.metrics.variance import Variance
from mlm_insights.core.metrics.metric_metadata import MetricMetadata

def main():
    input_schema = {
        'square_feet': FeatureType(
            data_type=DataType.FLOAT,
            variable_type=VariableType.CONTINUOUS,
            column_type=ColumnType.INPUT)
    }
    data_frame = pd.DataFrame({'square_feet': [11.23, 23.45, 11.23, 45.56, 11.23]})
    metric_details = MetricDetail(univariate_metric=
                                  {"square_feet": [MetricMetadata(klass=Variance)]},
                                  dataset_metrics=[])

    runner = InsightsBuilder().                 with_input_schema(input_schema).                 with_data_frame(data_frame=data_frame).                 with_metrics(metrics=metric_details).                 with_engine(engine=EngineDetail(engine_name="native")).                 build()

    profile_json = runner.run().profile.to_json()
    feature_metrics = profile_json['feature_metrics']
    print(feature_metrics['square_feet']["Variance"])
    # {'value': 178.89936000000003}
if __name__ == "__main__":
    main()

Returns the standard metric result as:
{
    'metric_name': 'Variance',
    'metric_description': 'Feature Metric to compute Variance',
    'variable_count': 1,
    'variable_names': ['variance'],
    'variable_types': [CONTINUOUS],
    'variable_dtypes': [FLOAT],
    'variable_dimensions': [0],
    'metric_data': [178.89936000000003],
    'metadata': {},
    'error': None
}

classmethod create(config: Dict[str, ConfigParameter] | None = None) → Variance¶

Factory Method to create an object. The configuration will be available in config.

Returns¶

MetricBase: An Instance of MetricBase.

get_required_shareable_feature_components() → List[SFCMetaData]¶: Returns a list of Shareable Feature Components containing 1 SFC that is Descriptive Statistics SFC.

Returns¶

List of SFCMetadata, containing only 1 SFC i.e. DescriptiveStatisticsSFC

get_result(**kwargs: Any) → Dict[str, Any]¶: Returns variance of input data.

Returns¶

float: variance of the data.

get_standard_metric_result(**kwargs: Any) → StandardMetricResult¶: This method returns metric output in standard format.

Returns¶

StandardMetricResult