Accessing Parquet files with Oracle BigData Type

In my Getting started with Oracle BigData Blog I shared how you can obtain an example parquet file and set-up a FlashBlade s3 bucket, if you want to follow this Blog and don’t have access to a parquet file you can visit my previous Blog to get started.

Bigdata Tools

The parquet file format is binary and therefore not human readable, fortunately we can use parquet-tools to inspect the data and schema information.

On my Mac I have installed both parquet-tools and avro-tools using Brew thus:

brew install parquet-tools
brew install avro-tools

parquet-tools help

rekins$ parquet-tools -h
parquet-tools cat:
Prints the content of a Parquet file. The output contains only the data, no
metadata is displayed
usage: parquet-tools cat [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
    -j,--json      Show records in JSON format.
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

parquet-tools head:
Prints the first n record of the Parquet file
usage: parquet-tools head [option...] <input>
where option is one of:
       --debug          Enable debug output
    -h,--help           Show this help string
    -n,--records <arg>  The number of records to show (default: 5)
       --no-color       Disable color output even if supported
where <input> is the parquet file to print to stdout

parquet-tools schema:
Prints the schema of Parquet file(s)
usage: parquet-tools schema [option...] <input>
where option is one of:
    -d,--detailed  Show detailed information about the schema.
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file containing the schema to show

parquet-tools meta:
Prints the metadata of Parquet file(s)
usage: parquet-tools meta [option...] <input>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to print to stdout

parquet-tools dump:
Prints the content and metadata of a Parquet file
usage: parquet-tools dump [option...] <input>
where option is one of:
    -c,--column <arg>  Dump only the given column, can be specified more than
                       once
    -d,--disable-data  Do not dump column data
       --debug         Enable debug output
    -h,--help          Show this help string
    -m,--disable-meta  Do not dump row group and page metadata
    -n,--disable-crop  Do not crop the output based on console width
       --no-color      Disable color output even if supported
where <input> is the parquet file to print to stdout

parquet-tools merge:
Merges multiple Parquet files into one. The command doesn't merge row groups,
just places one after the other. When used to merge many small files, the
resulting file will still contain small row groups, which usually leads to bad
query performance.
usage: parquet-tools merge [option...] <input> [<input> ...] <output>
where option is one of:
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the source parquet files/directory to be merged
   <output> is the destination parquet file

parquet-tools rowcount:
Prints the count of rows in Parquet file(s)
usage: parquet-tools rowcount [option...] <input>
where option is one of:
    -d,--detailed  Detailed rowcount of each matching file
       --debug     Enable debug output
    -h,--help      Show this help string
       --no-color  Disable color output even if supported
where <input> is the parquet file to count rows to stdout

parquet-tools size:
Prints the size of Parquet file(s)
usage: parquet-tools size [option...] <input>
where option is one of:
    -d,--detailed      Detailed size of each matching file
       --debug         Enable debug output
    -h,--help          Show this help string
       --no-color      Disable color output even if supported
    -p,--pretty        Pretty size
    -u,--uncompressed  Uncompressed size
where <input> is the parquet file to get size & human readable size to stdout

parquet-tools schema

rekins$ parquet-tools schema sales_extended.parquet
message hive_schema {
  optional int32 prod_id;
  optional int32 cust_id;
  optional binary time_id (UTF8);
  optional int32 channel_id;
  optional int32 promo_id;
  optional int32 quantity_sold;
  optional fixed_len_byte_array(5) amount_sold (DECIMAL(10,2));
  optional binary gender (UTF8);
  optional binary city (UTF8);
  optional binary state_province (UTF8);
  optional binary income_level (UTF8);
}

parquet-tools rowcount

rekins$ parquet-tools rowcount sales_extended.parquet
Total RowCount: 916039

parquet-tools size

rekins$ parquet-tools size sales_extended.parquet
Total Size: 9262640 bytes

parquet-tools head

rekins$ parquet-tools head -n 2 sales_extended.parquet
prod_id = 13
cust_id = 987
time_id = 1998-01-10
channel_id = 3
promo_id = 999
quantity_sold = 1
amount_sold = 1232.16
gender = M
city = Adelaide
state_province = South Australia
income_level = K: 250,000 - 299,999

prod_id = 13
cust_id = 1660
time_id = 1998-01-10
channel_id = 3
promo_id = 999
quantity_sold = 1
amount_sold = 1232.16
gender = M
city = Dolores
state_province = CO
income_level = L: 300,000 and above

Oracle Database 19.3

Unfortunately, I don’t have access to an Oracle BDA (Big Data Appliance) and the Oracle Big Data Lite Virtual Machine does not currently support the ORACLE_BIGDATA TYPE, so in this Blog I will be using my on-premise 19c database.

However, we can easily test accessing a local parquet file by creating an Oracle Directory and and then creating an External Table with TYPE ORACLE_BIGDATA.

Local parquet file

$ ls -l /home/oracle/bigdata/sales_extended.parquet 
 -rw-r--r--. 1 oracle oinstall 9263650 Jan  1 15:28 /home/oracle/bigdata/sales_extended.parquet

Create Oracle Directory

CREATE OR REPLACE DIRECTORY BIGDATA_DIR AS '/home/oracle/bigdata'

Create Oracle External Table

CREATE TABLE SALES_LOCAL (
   PROD_ID              NUMBER(10),
   CUST_ID              NUMBER(10),     
   TIME_ID              VARCHAR2(4000), 
   CHANNEL_ID           NUMBER(10),     
   PROMO_ID             NUMBER(10),     
   QUANTITY_SOLD        NUMBER(10),     
   AMOUNT_SOLD          NUMBER(10,2),   
   GENDER               VARCHAR2(4000), 
   CITY                 VARCHAR2(4000), 
   STATE_PROVINCE       VARCHAR2(4000), 
   INCOME_LEVEL         VARCHAR2(4000)
   )
   ORGANIZATION EXTERNAL
   (TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY BIGDATA_DIR  
    ACCESS PARAMETERS
    (
      com.oracle.bigdata.debug=TRUE
      com.oracle.bigdata.log.opt=normal
      com.oracle.bigdata.fileformat=parquet
    )
    location (BIGDATA_DIR:'sales_extended.parquet')
   )  REJECT LIMIT UNLIMITED
 /

Oracle Database 19.3

As we have now created an Oracle Directory, External Table, and have a local parquet file we can try accessing it from within Oracle SQL*Plus.

Describe table
Count number of rows
Get first row

Unfortunately, if you try and access an s3 bucket from an on-premises database you will hit an ‘PLS-00201’ error as the ‘DBMS_CLOUD‘ package is not available.

on-premises error message

In a subsequent Blog I will repeat on the Oracle Cloud Free tier.

[twitter-follow screen_name=’RonEkins’ show_count=’yes’]

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: