Query External Data with ORC, Parquet, or Avro Source Files

Autonomous Database makes it easy to access ORC, Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro sources have metadata embedded in them and the DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure can utilize this metadata to simplify the creation of external tables.

You don’t need to know the structure of the data, DBMS_CLOUD can examine the file and convert ORC, Parquet, or Avro contents into the equivalent Oracle columns and data types. You only need to know the location of the data in object store, specify its type, ORC, Parquet, or Avro, and have credentials to access the source file on your object store.

Note

The steps to use external tables are very similar for ORC, Parquet, and Avro. These steps show working with a Parquet format source file.

The source file in this example, sales_extended.parquet, contains Parquet format data. To query this file in Autonomous Database, do the following:

  1. Store your object store credentials, to access the object store, using the procedure DBMS_CLOUD.CREATE_CREDENTIAL:
    BEGIN
      DBMS_CLOUD.CREATE_CREDENTIAL(
        credential_name => 'DEF_CRED_NAME',
        username => 'adb_user@example.com',
        password => 'password' );
    END;
    /
    

    Creating a credential to access Oracle Cloud Infrastructure Object Store is not required if you enable resource principal credentials. See Use Resource Principal to Access Oracle Cloud Infrastructure Resources for more information.

    This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.

    See CREATE_CREDENTIAL Procedure for information about the username and password parameters for different object storage services.

  2. Create an external table for ORC, Parquet, or Avro on top of your source files using the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE.

    The procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE supports external files in the supported cloud object storage services, including: Oracle Cloud Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake Storage, Amazon S3, and Amazon S3-Compatible, including: Oracle Cloud Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the external files must be on the same object store.

    By default, the columns created in the external table automatically map their data types to Oracle data types for the fields found in the source files and the external table column names match the source field names.

    BEGIN
       DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
        table_name =>'sales_extended_ext',
        credential_name =>'DEF_CRED_NAME',
        file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
        format =>  '{"type":"parquet", "schema": "first"}'
     );
    END;
    /     
    

    The parameters are:

    • table_name: is the external table name.

    • credential_name: is the name of the credential created in the previous step. The credential_name parameter must conform to Oracle object naming conventions, which do not allow spaces or hyphens.

    • file_uri_list: is a comma delimited list of the source files you want to query. The URI format for Dedicated Endpoints is supported in commercial (OC1) realms. See Object Storage Dedicated Endpoints, and DBMS_CLOUD URI Formats for more information.

    • format: defines the options to describe the format of the source file. For a Parquet file, use the format parameter to specify the type parquet. For an Avro file use the format parameter to specify the type avro. For an ORC file use the format parameter to specify the type orc.

    In this example, namespace-string is the Oracle Cloud Infrastructure object storage namespace and bucketname is the bucket name. See Understanding Object Storage Namespaces for more information.

    By default the format schema parameter is set and the columns and data types are derived automatically and the fields in the source match the external table columns by name. Source data types are converted to the external table column Oracle data types according to the DBMS_CLOUD mapping for ORC, Parquet, or Avro data types. The valid schema parameter values are:

    • first: Analyze the schema of the first ORC, Parquet, or Avro file that DBMS_CLOUD finds in the specified file_uri_list (first is the default value for schema).

    • all: Analyze all the schemas for all the ORC, Parquet, or Avro files found in the file_uri_list. Because these are simply files captured in an object store, there is no guarantee that each file’s metadata is the same. For example, File1 may contain a field called “address”, while File2 may be missing that field. Examining each file to derive the columns is a bit more expensive but may be required if the first file does not contain all the required fields.

    Note

    If the column_list parameter is specified, then you provide the column names and data types for the external table and the schema value, if specified is ignored. Using column_list you can limit the columns in the external table. If column_list is not specified then the schema default value is first.
  3. You can now run queries on the external table you created in the previous step:
    DESC sales_extended_ext;
    Name           Null? Type           
    -------------- ----- -------------- 
    PROD_ID              NUMBER(10)     
    CUST_ID              NUMBER(10)     
    TIME_ID              VARCHAR2(32767) 
    CHANNEL_ID           NUMBER(10)     
    PROMO_ID             NUMBER(10)     
    QUANTITY_SOLD        NUMBER(10)     
    AMOUNT_SOLD          NUMBER(10,2)   
    GENDER               VARCHAR2(32767) 
    CITY                 VARCHAR2(32767) 
    STATE_PROVINCE       VARCHAR2(32767) 
    INCOME_LEVEL         VARCHAR2(32767)
    
    SELECT prod_id, quantity_sold, gender, city, income_level 
         FROM sales_extended_ext 
         WHERE ROWNUM < 8;
    
         PROD_ID     QUANTITY_SOLD     GENDER    CITY              INCOME_LEVEL
    1    13          1                 M         Adelaide          K: 250,000 – 299,999
    2    13          1                 M         Dolores           L: 300,000 and above
    3    13          1                 M         Cayuga            F: 110,000 – 129,999
    4    13          1                 F         Bergen op Zoom    C: 50,000 – 69,999
    5    13          1                 F         Neuss             J: 190,000 – 249,999
    6    13          1                 F         Darwin            F: 110,000 – 129,999
    7    13          1                 M         Sabadell          K:250,000 – 299,999
    

    This query shows values for rows in the external table. If you want to query this data frequently, after examining the data you can load it into a table with DBMS_CLOUD.COPY_DATA.

See CREATE_EXTERNAL_TABLE Procedure for Avro, ORC, or Parquet Files and COPY_DATA Procedure for Avro, ORC, or Parquet Files for more information.

See DBMS_CLOUD URI Formats for information on supported cloud object storage services.

Query External Data with ORC, Parquet, or Avro Source Files and Explicitly Set the Text Column Size

When you access ORC, Parquet, or Avro data stored in object store using external tables on Autonomous Database, you can either automatically or explicitly set the maximum size of text columns.

By default the text column size is set based on the value of MAX_STRING_SIZE.

The source file in this example, sales_extended.parquet, contains Parquet format data. To query this file in Autonomous Database and set the maximum text column size, do the following:

  1. Store your object store credentials, to access the object store, using the procedure DBMS_CLOUD.CREATE_CREDENTIAL:
    BEGIN
      DBMS_CLOUD.CREATE_CREDENTIAL(
        credential_name => 'DEF_CRED_NAME',
        username => 'adb_user@example.com',
        password => 'password' );
    END;
    /
    

    Creating a credential to access Oracle Cloud Infrastructure Object Store is not required if you enable resource principal credentials. See Use Resource Principal to Access Oracle Cloud Infrastructure Resources for more information.

    This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.

    See CREATE_CREDENTIAL Procedure for information about the username and password parameters for different object storage services.

  2. Create an external table for ORC, Parquet, or Avro on top of your source files using the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE and specify the maxvarchar format parameter.

    The procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE supports external files in the supported cloud object storage services, including: Oracle Cloud Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake Storage, Amazon S3, and Amazon S3-Compatible, including: Oracle Cloud Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the external files must be on the same object store.

    By default, the columns created in the external table automatically map their data types to Oracle data types for the fields found in the source files and the external table column names match the source field names.

    BEGIN
       DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
        table_name =>'sales_extended_ext',
        credential_name =>'DEF_CRED_NAME',
        file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
        format =>  '{"type":"parquet", "schema": "first", "maxvarchar":"extended"}'
     );
    END;
    /     
    

    The parameters are:

    • table_name: is the external table name.

    • credential_name: is the name of the credential created in the previous step. The credential_name parameter must conform to Oracle object naming conventions, which do not allow spaces or hyphens.

    • file_uri_list: is a comma delimited list of the source files you want to query. The URI format for Dedicated Endpoints is supported in commercial (OC1) realms. See Object Storage Dedicated Endpoints, and DBMS_CLOUD URI Formats for more information.

    • format: defines the options to describe the format of the source file. For a Parquet file, use the format parameter to specify the type parquet. For an Avro file use the format parameter to specify the type avro. For an ORC file use the format parameter to specify the type orc.

      The option maxvarchar with value extended specifies that text columns are created as varchar(32767) on an Autonomous Database instance with extended string size. The possible values are standard with varchar(4000), extended with varchar(32767), and auto. The default maxvarchar value is auto. With this value, the maximum text size is based on the value of MAX_STRING_SIZE.

    In this example, namespace-string is the Oracle Cloud Infrastructure object storage namespace and bucketname is the bucket name. See Understanding Object Storage Namespaces for more information.

    By default the format schema parameter is set and the columns and data types are derived automatically and the fields in the source match the external table columns by name. Source data types are converted to the external table column Oracle data types according to the DBMS_CLOUD mapping for ORC, Parquet, or Avro data types. The valid schema parameter values are:

    • first: Analyze the schema of the first ORC, Parquet, or Avro file that DBMS_CLOUD finds in the specified file_uri_list (first is the default value for schema).

    • all: Analyze all the schemas for all the ORC, Parquet, or Avro files found in the file_uri_list. Because these are simply files captured in an object store, there is no guarantee that each file’s metadata is the same. For example, File1 may contain a field called “address”, while File2 may be missing that field. Examining each file to derive the columns is a bit more expensive but may be required if the first file does not contain all the required fields.

    Note

    If the column_list parameter is specified, then you provide the column names and data types for the external table and the schema value, if specified is ignored. Using column_list you can limit the columns in the external table. If column_list is not specified then the schema default value is first.
  3. You can now run queries on the external table you created in the previous step:
    DESC sales_extended_ext;
    Name           Null? Type           
    -------------- ----- -------------- 
    PROD_ID              NUMBER(10)     
    CUST_ID              NUMBER(10)     
    TIME_ID              VARCHAR2(32767) 
    CHANNEL_ID           NUMBER(10)     
    PROMO_ID             NUMBER(10)     
    QUANTITY_SOLD        NUMBER(10)     
    AMOUNT_SOLD          NUMBER(10,2)   
    GENDER               VARCHAR2(32767) 
    CITY                 VARCHAR2(32767) 
    STATE_PROVINCE       VARCHAR2(32767) 
    INCOME_LEVEL         VARCHAR2(32767)
    
    SELECT prod_id, quantity_sold, gender, city, income_level 
         FROM sales_extended_ext 
         WHERE ROWNUM < 8;
    
         PROD_ID     QUANTITY_SOLD     GENDER    CITY              INCOME_LEVEL
    1    13          1                 M         Adelaide          K: 250,000 – 299,999
    2    13          1                 M         Dolores           L: 300,000 and above
    3    13          1                 M         Cayuga            F: 110,000 – 129,999
    4    13          1                 F         Bergen op Zoom    C: 50,000 – 69,999
    5    13          1                 F         Neuss             J: 190,000 – 249,999
    6    13          1                 F         Darwin            F: 110,000 – 129,999
    7    13          1                 M         Sabadell          K:250,000 – 299,999
    

    This query shows values for rows in the external table. If you want to query this data frequently, after examining the data you can load it into a table with DBMS_CLOUD.COPY_DATA.

    If you specify the format option maxvarchar as standard, the varchar2() text columns are created with size 4000. For example:

    BEGIN
       DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
        table_name =>'sales_extended_ext',
        credential_name =>'DEF_CRED_NAME',
        file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
        format =>  '{"type":"parquet", "schema": "first", "maxvarchar":"standard"}'
     );
    END;
    /     
    

    The format parameter with the option maxvarchar set to the value standard specifies that text columns are created as varchar(4000).

    DESC sales_extended_ext;
    Name           Null? Type           
    -------------- ----- -------------- 
    PROD_ID              NUMBER(10)     
    CUST_ID              NUMBER(10)     
    TIME_ID              VARCHAR2(4000) 
    CHANNEL_ID           NUMBER(10)     
    PROMO_ID             NUMBER(10)     
    QUANTITY_SOLD        NUMBER(10)     
    AMOUNT_SOLD          NUMBER(10,2)   
    GENDER               VARCHAR2(4000) 
    CITY                 VARCHAR2(4000) 
    STATE_PROVINCE       VARCHAR2(4000) 
    INCOME_LEVEL         VARCHAR2(4000)
    

See CREATE_EXTERNAL_TABLE Procedure for Avro, ORC, or Parquet Files and COPY_DATA Procedure for Avro, ORC, or Parquet Files for more information.

See DBMS_CLOUD URI Formats for information on supported cloud object storage services.