Release Notes

1.2.0

Release notes: July 31, 2024

Features and Improvements

  • New Insights Post Processors: This release introduces the following new Insights Post Processors:
    • OCI Monitoring Post Processor (OCIMonitoringPostProcessor) to push Insights Test Results to OCI Monitoring. This forms the basis for “Monitoring Notifications” feature to let data scientists continually monitor data and model health.

    • Test Results Post Processor to persist Insights Test Results JSON to OCI Object Storage

  • Insights Profile Reader: This release has the following changes:
    • Object Storage Profile Reader: This release introduces a new Profile Reader ObjectStorageProfileReader to read Insights Profile from OCI Object Storage.

    • Configure Reference Profile Reader in Insights JSON configuration: This release introduces a new Insights JSON Configuration element reference_profile_reader to let users configure a Profile Reader to load the reference profile when computing the Profile or Insights Tests.

    • Configure Reference Profile Reader in Insights Builder API: This release introduces a new Insights Builder API with_reference_profile to let users pass a Profile Reader to load the reference profile when computing the Profile or Insights Tests.

  • Unified Insights Builder API: In Oracle ML Insights v1.1.0, we introduced Insights Tests API to run user-provided tests on Insights Profile. Users had to use the Insights Builder API to compute the profile and a separate Insights Test Builder API to run the tests. This created challenges for running post processors which depended on both the Profile and Insights Test Results. This release introduces a Unified Builder API that allows users to use the Insights Builder API and Insights Config API to compute the Insights Profile and Insights Tests using a single API.

  • Custom Dask Client: Users can now provide a custom Dask client through the Insights Builder’s with_engine method call. This enables the customers to run custom applications on distributed dask clusters and pass the distributed dask client to Insights to run computations on the dask cluster.

  • Enhancements to Insights Feature Schema: This release introduces the following additions to feature schema:
    • Data Type: 2 new data types for date time features are DATETIME and TIMESTAMP . Use DATETIME when the feature consists of date time strings in formats as defined in the standard here . Use TIMESTAMP when the feature consists of float/int value representing UNIX epoch.

    • Variable Type: New variable type for date time features. DATETIME

    • Allow user to pass date time specific configuration for date time features. This removes duplication of configuration for date time metrics

  • Metrics: This release introduces the following new metrics:
    • CorrelationRatio : Computes correlation matrix for a set of categorical and numerical features. With this addition, Insights now provides 3 correlation metrics for different combinations of numerical and categorical features, the other 2 being PearsonCorrelation and CramersVCorrelation

    • DateTimeMin : Computes minimum date time value in a feature. Supports date time string and timestamp values

    • DateTimeMax : Computes maximum date time value in a feature. Supports date time string and timestamp values

    • DateTimeDuration : Computes longest duration for a date time feature i.e max date - min date. Supports date time string and timestamp values

  • Improved Insights Configuration Authoring Experience: This release introduces the following features to ease the developer experience of authoring Insights JSON Configuration:
    • Approximate feature schema detection using the sample dataset :
      • Generate approximate feature schema by inferring data_type and variable_type of all the features in dataset

    • Create Insights JSON from Insights Builder and persist to object storage:
      • InsightsConfigWriter can be used to generate a config JSON from InsightsBuilder for both mlm-insights SDK and ml-monitoring application and persist the config JSON into object storage.

  • Upgraded dask[complete] dependency version from 2022.11.1 to 2022.12.1

Bug fixes

None

Breaking changes

No breaking changes

1.1.0

Release notes: April 20, 2024

Features and Improvements

  • Insights Test/Test Suites:
    Insights Test/Test Suites enables comprehensive validation of customer’s machine learning models and data via a suite of test and test suites for various types of use cases such as :
    • Data Integrity

    • Data Quality

    • Model Performance (Classification, Regression)

    • Drift

    • Correlation, etc.

    They provide a structured / easier way to add thresholds on metrics. This can be used for Notifications and alerts for continuous Model Monitoring allowing them to take remediative actions.

  • Bias and Fairness: This release introduces a new Insights metric group for “Bias and Fairness Detection” with a new feature metric “Class Imbalance”. Class Imbalance metric measures any under-representation of sensitive groups in a categorical feature.

  • Data Source: This release introduces a new Data Source ObjectStorageFileSearchDataSource. ObjectStorageFileSearchDataSource retrieves file locations based on an OCI file path string or list of OCI file path strings and filters arguments provided by user from OCI Object storage . Various filter options are made available to filter out the file locations based on the file path prefix, file path suffix, last modified date, date in file path string and folder names containing string .

  • Post-Processor Component: This release introduces a new is_critical argument that can be passed to a post processor. When set to true, Insights run is marked as failed when the post processor execution fails. By default the flag is set to False.

  • Metrics: This release introduces the following new metrics:
    • IsPositive: Computes whether the numerical feature has all positive values

    • IsNegative: Computes whether the numerical feature has all negative values

    • IsNonZero: Computes whether the numerical feature has all non-zero value

    • Percentiles: Computes user-provided percentiles for a numerical feature

  • Upgraded the pyarrow dependency to 14.0.1

Bug fixes
  • Fixed a bug to improve the error message when Dask installation/dependencies have issues.

  • Fixed a bug where Classification metrics were not working for integer and float values in a feature of type TARGET or PREDICTION

Breaking changes

No breaking changes

1.0.4

Release notes: January 15, 2024

Features and Improvements

  • Builder: Builder Object provides a core set of APIs, with which users can set the behavior of their monitoring. For example, which reader to use, what metrics to calculate, and which post-processor to use.

  • Config Reader: The config reader component lets the user build the monitoring behavior using a config file provided by them. It creates a builder object using the config file so the user doesn’t need to create the builder object manually.

  • Runner: The runner object runs the internal workflow. It handles the life cycle of each component passed, which includes creation (if required), invoking interface functions, and destroying them.

  • Data Reader: Data Reader is responsible for reading data in a specific format. Currently all the Readers (listed below) can read from the local file system and OCI Object storage. The Readers supported with the release include, CSV Native Data Reader, JSON Native Data Reader, Nested JSON Native Data Reader, CSV Dask Data Reader, JSONL Dask Data Reader, and Nested JSON Dask Data Reader.

  • Data Source: The data source component lets you specify the data source for the data reader to read the data from. It supports attributes like File Type and File Path. The data sources supported are, Local Date Prefix Data Source, Local File Data Source, OCI Date Prefix Data Source, and OCI Object Storage Data Source.

  • Transformer: The transformer component provides an easy way to do simple in-memory transformations on the input data. The Conditional Feature Transformer is supported in this release.

  • Metrics: Metric components are responsible for calculating all statistical metrics and algorithms of the data. There are multiple metric types supported in the ML Insights Release. The set of metrics supported include:
    • Feature Metrics: (Count, Distinct Count, Duplicate Count, Frequency Distribution, Inter Quartile Range, Is Constant Feature, Is Quasi Constant Feature, Kurtosis, Max, Mean, Min, Mode, Probability Distribution, Quartiles, Range, Skewness, Standard Deviation, Sum, Top K Frequent Elements, Type Metric, Variance)

    • Model Performance: Metrics (Row Count, Mean Absolute Error, Mean Squared Error, R2 Score, Root Mean Squared Error, Mean Squared Log Error, Mean Absolute Percentage Error, Max Error, Conflict Prediction, Conflict Label)

    • Data Quality Metrics: (CramersVCorrelation, Pearson Correlation)

    • Classification Metrics: (Accuracy Score, Precision Score, Recall Score, FBeta Score, Log Loss, False Positive Rate, False Negative Rate, Specificity, Confusion Matrix, ROC Curve, ROC Area Under Curve, Precision RecallCurve, Precision Recall Area Under Curve)

    • Drift Metrics: (Jensen Shannon, KullbackLeibler, Population Stability Index, Kolmogorov Smirnov, ChiSquare)

  • Post-Processor Component: Post processor components are responsible for running any action after the entire data is processed and all the metrics calculated. The output metrics are collectively referred to as a profile, and ML Insights supports a set of default Post-Processors to save the profile: Local Writer Post Processor and Object Storage Writer Post Processor.

  • Customization: The SDK allows you to customize the SDK runs to your needs for ML monitoring. You can write a config file defining the Data Reader to be used, the data location to be used, the etrics to be evaluated, and the post processor to be used.

  • Built for Scale: ML Insights library can scale for datasets of any size. The library is built in a way that it reads data in partitions, computes metrics on the partition, and merges the partition metrics at the end. So the library doesn’t load all the data in memory to calculate metrics.

  • Compute technology choice: ML Insights supports the ability to use Pandas(Native), Dask, and Spark based compute technology for metric evaluation. You can choose compute of your choice based on the scale of data and the speed of metric evaluation needed.

  • Extensibility: The SDK provides multiple interfaces for you to extend the SDK to add custom component of your choice to extend the data reading, metric evaluation, or data writing you need to perform. For example, you can write a data reader containing authentication logic to read data from an object storage location that requires the client to authenticate.