Application Configuration¶

ML Monitoring Application can be set up and customized by authoring a JSON configuration. The configuration then needs to be saved in an object store location and passed in the CONFIG_FILE variable of RUNTIME_PARAMETER while starting a job run.

This document demonstrates how to define the application components and create a application configuration.

Sample Config File¶

ml-monitoring-config.json

{
  "monitor_id": "<monitor_id>",
  "storage_details": {
    "storage_type": "OciObjectStorage",
    "params": {
      "namespace": "<namespace>",
      "bucket_name": "<bucket_name>",
      "object_prefix": "<prefix>"
    }
  },
  "input_schema": {
    "Age": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    },
    "EnvironmentSatisfaction": {
      "data_type": "integer",
      "variable_type": "continuous",
      "column_type": "input"
    }
  },
    "baseline_reader": {
      "type": "CSVDaskDataReader",
      "params": {
        "file_path": "oci://<path>"
      }
    },
    "prediction_reader": {
      "type": "CSVDaskDataReader",
      "params": {
          "data_source": {
              "type": "ObjectStorageFileSearchDataSource",
              "params": {
                  "file_path": ["oci://<path>"],
                  "filter_arg": [
                    {
                      "partition_based_date_range": {
                        "start": "2023-06-26",
                        "end": "2023-06-27",
                        "data_format": ".d{4}-d{2}-d{2}."
                      }
                    }
                  ]
              }
          }
      },
  "dataset_metrics": [
    {
      "type": "RowCount"
    }
  ],
  "feature_metrics": {
    "Age": [
      {
        "type": "Min"
      },
      {
        "type": "Max"
      }
    ],
    "EnvironmentSatisfaction": [
      {
        "type": "Mode"
      },
      {
        "type": "Count"
      }
    ]
  },
  "transformers": [
    {
      "type": "ConditionalFeatureTransformer",
      "params": {
        "conditional_features": [
          {
            "feature_name": "Young",
            "data_type": "integer",
            "variable_type": "ordinal",
            "expression": "df.Age < 30"
          }
        ]
      }
    }
    ],
  "post_processors": [
  {
     "type": "SaveMetricOutputAsJsonPostProcessor",
     "params": {
       "file_name": "<file_name>",
       "test_results_file_name": "<test_result_file_name>",
       "file_location_expression": "<expression>",
       "date_range": {
          "start": "2023-08-01",
          "end": "2023-08-05"
       },
       "can_overwrite_profile_json": false,
       "can_overwrite_test_results_json": false,
       "namespace": "<namespace>",
       "bucket_name": "<bucket_name>"
      }
    },
    {
      "type": "OCIMonitoringApplicationPostProcessor",
      "params": {
         "compartment_id": "<COMPARTMENT_ID>",
         "namespace": "<NAMESPACE>",
         "date_range": {
            "start": "2023-08-01",
            "end": "2023-08-05"
         },
         "dimensions": {
            "key1": "value1",
            "key2": "value2"
         }
      }
    }
],
  "tags": {
    "tag": "value"
  }
},
"test_config": {
    "tags": {
      "key_1": "these tags are sent in test results"
    },
    "feature_metric_tests": [
      {
        "feature_name": "Age",
        "tests": [
          {
            "test_name": "TestGreaterThan",
            "metric_key": "Min",
            "threshold_value": 17
          },
          {
            "test_name": "TestIsComplete"
          }
        ]
      }
    ],
    "dataset_metric_tests": [
      {
          "test_name": "TestGreaterThan",
          "metric_key": "RowCount",
          "threshold_value": 40,
          "tags": {
            "subtype": "falls-xgb"
          }
        }
    ]
  }
}

ML Monitoring Application Components¶

Monitor ID¶

This is a required component and must be defined in the config.

User provided id used to identify a monitor config uniquely.

Below are the rules to define a monitor_id.

The length should be minimum 8 characters and maximum 48 characters.
Valid characters are letters (upper or lowercase), numbers, hyphens, underscores, and periods.

Description¶

Key	Value	Example
monitor_id	user defined string	“monitor_id”: “speech_model_monitor”

Example¶

{"monitor_id": "speech_model_monitor"}

Storage Details¶

This is a required component and must be defined in the config.

Details of the type of storage and location for retrieving the baseline profile(in case of a prediction run) and persist the internal state of a run.

Description¶

Field Name	Description	Example
storage_type	type of storage to be used for storing the internal state	“storage_type”: “OciObjectStorage”
param	params (required)	“params”: { “namespace”: “<namespace>”, “bucket_name”: “<bucket_name>”, “object_prefix”: “<prefix>” }

Field Name

Description

Example

storage_type

type of storage to be used for storing the internal state

“storage_type”: “OciObjectStorage”

param

params (required)

“params”: {: “namespace”: “<namespace>”, “bucket_name”: “<bucket_name>”, “object_prefix”: “<prefix>”

}

Supported Storage Details¶

OciObjectStorage
Required Parameters
namespace - namespace of the bucket

bucket_name - bucket name
Optional Parameters
object_prefix - prefix for creating the directory for saving the internal state of the runs

Example¶

storage_details

"storage_details": {
        "storage_type": "OciObjectStorage",
        "params": {
          "namespace": "<namespace>",
          "bucket_name": "<bucket_name>",
          "object_prefix": "<object_prefix>"
        }
  }

Input Schema¶

This is a required component and must be defined in the config.

Input schema is the map of features and their data types, variable types, and column type.

Description¶

Key	Value	Example
feature_name	object of key value pair of data_type ,variable type and column_type	“Age”: { “data_type”: “integer”, “variable_type”: “continuous”, “column_type”: “input” }

Data Type (Required)
Data types can be provided for each feature of the input dataset which represent the type of the feature value.

Supported data_type - “integer”, “float”, “string”, “boolean”, “text”, “object”
Variable Type (Required)
Variable types can be provided for each feature of the input dataset which represent the type of a statistical random variable.

Supported variable_type - “continuous”, “discrete”, “nominal”, “ordinal”, “binary”, “text”, “object”
Column Type (Optional - Default value “input”)
Insights supports performance metrics for regression and classification models. In addition to these, Insights also supports multivariate metrics like Feature Importance. These metrics require the prediction columns or target columns (ground truth) to be in the input dataset. To make it easier to configure the metrics, Insights allows users to configure the prediction or target columns using the feature schema.

Supported column_type - “input”, “prediction”, “target”, “prediction_score”

Example¶

{
    "input_schema": {
        "sepal length (cm)": {
          "data_type": "float",
          "variable_type": "continuous",
          "column_type": "input"
        },
        "sepal width (cm)": {
          "data_type": "float",
          "variable_type": "continuous"
          "column_type": "input"
        }
    }
}

BASELINE READER¶

If the action type is RUN_BASELINE, this is a required component and must be defined in the config.

The baseline_reader allows for the ingestion of raw data into the framework for a baseline run.

Description¶

Field Name	Description	Example
type	type of reader to be used	“type”: “JsonlDaskDataReader”
param	reader params data_source (optional)	“params”: { “file_path”: “oci://<path>.csv” } “data_source”: { “type”: “ObjectStorageFileSearchDataSource”, “params”: { “file_path”: [ “oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv” ], “filter_arg”: [ { “partition_based_date_range”: { “start”: “2023-06-26”, “end”: “2023-06-27”, “data_format”: “.d{4}-d{2}-d{2}.” } } ] } }

Field Name

Description

Example

type

type of reader to be used

“type”: “JsonlDaskDataReader”

param

reader params

data_source (optional)

“params”: {

“file_path”: “oci://<path>.csv”

}

“data_source”: {

“type”: “ObjectStorageFileSearchDataSource”, “params”: {

“file_path”: [
“oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv”

], “filter_arg”: [

{

“partition_based_date_range”: {
“start”: “2023-06-26”, “end”: “2023-06-27”, “data_format”: “.d{4}-d{2}-d{2}.”

}

}

]

}

}

Example using data source for determining the data location¶

"baseline_reader": {
    "type": "CSVDaskDataReader",
        "params": {
          "data_source": {
            "type": "ObjectStorageFileSearchDataSource",
            "params": {
              "file_path": [
                "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
              ],
              "filter_arg": [
                {
                  "partition_based_date_range": {
                    "start": "2023-06-26",
                    "end": "2023-06-27",
                    "data_format": ".d{4}-d{2}-d{2}."
                  }
                }
              ]
            }
          }
    }
}

Example without using data_source¶

{
    "baseline_reader": {
        "type": "CSVDaskDataReader",
        "params": {
          "file_path": "oci://<path>.csv"
        }
      }
}

Supported Reader

Supported Readers

CSVDaskDataReader
JsonlDaskDataReader
NestedJsonDaskDataReader
ADWApplicationDataReader

We can use reader params to define the location of the files to be read or can specify a data source in the reader.

Data Source

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

Supported Data Sources

OCIObjectStorageDataSource
OCIDatePrefixDataSource
ObjectStorageFileSearchDataSource

PREDICTION READER¶

If the action type is RUN_PREDICTION, this is a required component and must be defined in the config.

The prediction_reader allows for the ingestion of raw data into the framework for a prediction run.

Description¶

Field Name	Description	Example
type	type of reader to be used	“type”: “JsonlDaskDataReader”
param	reader params (required) data_source (optional)	“params”: { “file_path”: “oci://<path>.csv” } “data_source”: { “type”: “ObjectStorageFileSearchDataSource”, “params”: { “file_path”: [ “oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv” ], “filter_arg”: [ { “partition_based_date_range”: { “start”: “2023-06-26”, “end”: “2023-06-27”, “data_format”: “.d{4}-d{2}-d{2}.” } } ] } }

Field Name

Description

Example

type

type of reader to be used

“type”: “JsonlDaskDataReader”

param

reader params (required)

data_source (optional)

“params”: {

“file_path”: “oci://<path>.csv”

}

“data_source”: {

“type”: “ObjectStorageFileSearchDataSource”, “params”: {

“file_path”: [
“oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv”

], “filter_arg”: [

{

“partition_based_date_range”: {
“start”: “2023-06-26”, “end”: “2023-06-27”, “data_format”: “.d{4}-d{2}-d{2}.”

}

}

]

}

}

Example using data source for determining the data location¶

"prediction_reader": {
    "type": "CSVDaskDataReader",
        "params": {
          "data_source": {
            "type": "ObjectStorageFileSearchDataSource",
            "params": {
              "file_path": [
                "oci://<bucket_name>@<namespace>/<object_prefix>/dataset.csv"
              ],
              "filter_arg": [
                {
                  "partition_based_date_range": {
                    "start": "2023-06-26",
                    "end": "2023-06-27",
                    "data_format": ".d{4}-d{2}-d{2}."
                  }
                }
              ]
            }
          }
    }
}

Example without using data_source¶

{
    "prediction_reader": {
        "type": "CSVDaskDataReader",
        "params": {
          "file_path": "oci://<path>.csv"
        }
      }
}

Supported Reader

Supported Readers

CSVDaskDataReader
JsonlDaskDataReader
NestedJsonDaskDataReader
ADWApplicationDataReader

We can use reader params to define the location of the files to be read or can specify a data source in the reader.

Data Source

The Data Source component is responsible for interacting with a specific data source and returning a list of locations to be read.

Supported Data Sources

OCIObjectStorageDataSource
OCIDatePrefixDataSource
ObjectStorageFileSearchDataSource

Feature Metrics¶

In this section, you need to add metrics that you need for each feature.

Description¶

Key	Value
feature_name	metric list

Supported Feature Metrics

More Metric details

Supported Feature Metrics

# Data quality metrics
Count
DistinctCount
DuplicateCount
FrequencyDistribution
Max
Mean
Min
Mode
ProbabilityDistribution
Range
Skewness
StandardDeviation
Sum
IQR
Kurtosis
TopKFrequentElements
TypeMetric
Variance
IsPositive
IsNegative
IsNonZero
Percentiles

# Data Integrity
IsConstantFeature
IsQuasiConstantFeature
Quartiles

# Drift Metrics
KullbackLeibler
KolmogorovSmirnov
ChiSquare
JensenShannon
PopulationStabilityIndex

# Bias and Fairness
ClassImbalance

# Date Time Metrics
DateTimeMin
DateTimeMax
DateTimeDuration

Example¶

"feature_metric": {
    "sepal length (cm)" : [
    {"type": "Sum"},{"type": "Quartiles"}
    ],
    "sepal width (cm)": [
        {"type": "Min"},{"type": "DistinctCount"}
    ],
    "petal length (cm)": [
        {"type": "Count"},{"type": "Mean"}
    ],
    "petal width (cm)": [
        {"type": "IsQuasiConstantFeature"},{"type": "Kurtosis"}
    ]
}

Dataset Metrics¶

Description¶

List of metrics to be calculated on the data set.

Example¶

"data_set_metric": [
    {
    "type": "RowCount"
    }
]

Supported Data Set Metrics

More Metric details

Supported Data Set Metrics

# Data Quality Metrics
CramersVCorrelation
PearsonCorrelation
CorrelationRatio

# Regression Metrics
RowCount
MeanAbsoluteError
MeanSquaredError
R2Score
RootMeanSquaredError
MeanSquaredLogError
MeanAbsolutePercentageError
MaxError

# Classification metrics
AccuracyScore
PrecisionScore
RecallScore
FBetaScore
FalsePositiveRate
FalseNegativeRate
Specificity
ConfusionMatrix
LogLoss
ROCCurve
ROCAreaUnderCurve
PrecisionRecallCurve
PrecisionRecallAreaUnderCurve

# Conflict Metrics
ConflictPrediction
ConflictLabel

Post Processor¶

Post processor components are responsible for running any action after the entire data set is processed and all the metrics are calculated.

Description¶

Field Name	Description	Example1	Example2
type	type of post processor	“type”: “SaveMetricOutputAsJsonPostProcessor”	“type”: “OCIMonitoringApplicationPostProcessor”
param	post processor params (required)	“params”: { “file_name”: “profile.json”, “test_results_file_name”: “test_result.json”, “file_location_expression”: “bug-bash/mlm/profile-$start_$end.json”, “date_range”: { “start”: “2023-08-01”, “end”: “2023-08-05” }, “can_overwrite_profile_json”: false, “can_overwrite_test_results_json”: false, “namespace”: “<namespace>”, “bucket_name”: “<bucket_name>” }	“params”: { “compartment_id”: “<COMPARTMENT_ID>”, “namespace”: “<NAMESPACE>”, “date_range”: { “start”: “2023-08-01”, “end”: “2023-08-05” }, “dimensions”: { “key1”: “value1”, “key2”: “value2” } }

Field Name

Description

Example1

Example2

type

type of post processor

“type”: “SaveMetricOutputAsJsonPostProcessor”

“type”: “OCIMonitoringApplicationPostProcessor”

param

post processor params (required)

“params”: { “file_name”: “profile.json”,

“test_results_file_name”: “test_result.json”, “file_location_expression”: “bug-bash/mlm/profile-$start_$end.json”, “date_range”: {

“start”: “2023-08-01”, “end”: “2023-08-05”

}, “can_overwrite_profile_json”: false, “can_overwrite_test_results_json”: false, “namespace”: “<namespace>”, “bucket_name”: “<bucket_name>”

}

“params”: {: “compartment_id”: “<COMPARTMENT_ID>”, “namespace”: “<NAMESPACE>”, “date_range”: {

“start”: “2023-08-01”, “end”: “2023-08-05”

}, “dimensions”: {

“key1”: “value1”, “key2”: “value2”

}

}

Example¶

  "post_processors": [
  {
    "type": "SaveMetricOutputAsJsonPostProcessor",
    "params": {
      "file_name": "profile.json",
      "test_results_file_name": "test_result.json",
      "file_location_expression": "bug-bash/mlm/profile-$start_$end.json",
      "date_range": {
        "start": "2023-08-01",
        "end": "2023-08-05"
      },
      "can_overwrite_profile_json": false,
      "can_overwrite_test_results_json": false,
      "namespace": "<namespace>",
      "bucket_name": "<bucket_name>"
    }
  }
]

Supported Post Processors¶

SaveMetricOutputAsJsonPostProcessor
This will store the metric result output in user provided Object storage location in a json format.
Required Parameters
bucket_name - The name of the OCI Object Storage bucket.

namespace - The OCI Object Storage namespace.
Optional Parameters
file_location_expression - The expression of the object location within the bucket, which would be configured as per the date_range argument.

if file_location_expression is not provided and no date_range is provided in runtime parameter, object location is generated by the application as ‘<location>/MLM/<monitorId>/<action_type>/file_name.json’

if file_location_expression is not provided and date_range, object location is generated by the application as - ‘<location>/MLM/<monitorId>/<action_type>/$start-$end/’

file_name - A filename for the object name. Default value for file_name is ‘profile.json’

can_overwrite_profile_json - A boolean whether the existing profile file should be overwritten. By default the profile file would Not be overwritten.

test_results_file_name - A filename for the Test result object name. Default value for file_name is ‘test_result.json’

can_overwrite_test_results_json - A boolean whether the existing test result file should be overwritten. By default the test result file would Not be overwritten.

date_range - A dictionary containing optional date range which would be configured in file location. This can be overwritten by passing START and END DATE in RUNTIME_PARAMETER.

Example

"post_processors": [
      {
        "type": "SaveMetricOutputAsJsonPostProcessor",
        "params": {
              "file_name": "profile.json",
  "test_results_file_name": "test_result.json",
              "file_location_expression": "/usecase/$start_$end",
              "date_range": {
                "start": "2023-08-01",
                "end": "2023-08-05"
              },
              "can_overwrite_profile_json": true,
  "can_overwrite_test_results_json": false,
              "namespace": "<namespace>",
              "bucket_name": "<bucket_name>"
        }
      }
]

In the above example, the JSON result would be stored at the location - /usecase/2023-08-01_2023-08-05/profile.json and Test Results would be stored at the location - /usecase/2023-08-01_2023-08-05/test_result.json

OCIMonitoringApplicationPostProcessor
This will will push the Ml Insight Test Suite results to OCI Monitoring Service in user provided Compartment Id
Required Parameters
compartment_id - The OCID of the compartment to use for metrics.
Optional Parameters
dimensions - Additional dimensions for the metrics (default is an empty).

namespace - The namespace for the OCI monitoring (default is ‘ml_monitoring’).

date_range - A dictionary containing optional date range which would be configured in file location. This can be overwritten by passing START and END DATE in RUNTIME_PARAMETER.

Example

"post_processors": [
      {
        "type": "OCIMonitoringApplicationPostProcessor",
        "params": {
              "compartment_id": "<COMPARTMENT_ID>",
              "namespace": "<NAMESPACE>",
              "date_range": {
                "start": "2023-08-01",
                "end": "2023-08-05"
              },
              "dimensions": {
                "key1": "value1",
                "key2": "value2"
              }
        }
      }
]

In the above example, Ml Insight Test Suite results will be pushed to user provided compartment_id

SaveMetricOutputAsJsonPostProcessor For details, please refer here Writing Monitoring results to OCI Autonomous Data Warehouse(ADW)

Transformer¶

The transformer component provides an easy way to do simple in-memory transformations on the input data.

The list of transformers to be used to add a conditional feature or transform the data before insights run.

Description¶

Field Name	Description	Example
type	type of transformer	“type”: “ConditionalFeatureTransformer”
param	conditional_features - List of conditional features	“params”: { “conditional_features”: [ { “feature_name”: “Young”, “data_type”: “integer”, “variable_type”: “ordinal”, “expression”: “df.Age < 30” } ] }

Field Name

Description

Example

type

type of transformer

“type”: “ConditionalFeatureTransformer”

param

conditional_features - List of conditional features

“params”: {

“conditional_features”: [

{: “feature_name”: “Young”, “data_type”: “integer”, “variable_type”: “ordinal”, “expression”: “df.Age < 30”

}

]

}

Conditional Features¶

Field Name	Value	Remarks
expression	Python expression, to be written using pandas series based functions. Only pandas series level functions are supported in a python expression and the symbol ‘df’.	the expression must return a valid output. For example: “expression”: “df.Age < 30”
feature_name	<any name that suits your feature>
data_type	The data type of the feature.
variable_type	The variable type of the feature.

Example¶

"transformers": [
{
  "type": "ConditionalFeatureTransformer",
  "params": {
    "conditional_features": [
      {
        "feature_name": "Young",
        "data_type": "integer",
        "variable_type": "continuous",
        "expression": "int(json_row['Age'] < 30)"
      }
    ]
  }
}
]

Tags¶

Note, this is a application internal concept and should not be confused with OCI resource tags.

User provided key value pair
Users can provide tags to be associated with a profile. For eg: when running the baseline/prediction run, we can store:
<”tenancy”: “tenancy-xyz”>

Example¶

"tags": {
    "tenancy": "tenancy-xyz"
}

Tests Config¶

For detailed documentation, please refer to section: Test Config