In my Getting started with Oracle BigData Blog I shared how you can obtain an example parquet file and set-up a FlashBlade s3 bucket, if you want to follow this Blog and don’t have access to a parquet file you can visit my previous Blog to get started.
Bigdata Tools
The parquet file format is binary and therefore not human readable, fortunately we can use parquet-tools to inspect the data and schema information.
On my Mac I have installed both parquet-tools and avro-tools using Brew thus:
brew install parquet-tools brew install avro-tools
parquet-tools help
rekins$ parquet-tools -h
parquet-tools cat:
Prints the content of a Parquet file. The output contains only the data, no
metadata is displayed
usage: parquet-tools cat [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-j,--json Show records in JSON format.
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
parquet-tools head:
Prints the first n record of the Parquet file
usage: parquet-tools head [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
-n,--records <arg> The number of records to show (default: 5)
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
parquet-tools schema:
Prints the schema of Parquet file(s)
usage: parquet-tools schema [option...] <input>
where option is one of:
-d,--detailed Show detailed information about the schema.
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file containing the schema to show
parquet-tools meta:
Prints the metadata of Parquet file(s)
usage: parquet-tools meta [option...] <input>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
parquet-tools dump:
Prints the content and metadata of a Parquet file
usage: parquet-tools dump [option...] <input>
where option is one of:
-c,--column <arg> Dump only the given column, can be specified more than
once
-d,--disable-data Do not dump column data
--debug Enable debug output
-h,--help Show this help string
-m,--disable-meta Do not dump row group and page metadata
-n,--disable-crop Do not crop the output based on console width
--no-color Disable color output even if supported
where <input> is the parquet file to print to stdout
parquet-tools merge:
Merges multiple Parquet files into one. The command doesn't merge row groups,
just places one after the other. When used to merge many small files, the
resulting file will still contain small row groups, which usually leads to bad
query performance.
usage: parquet-tools merge [option...] <input> [<input> ...] <output>
where option is one of:
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the source parquet files/directory to be merged
<output> is the destination parquet file
parquet-tools rowcount:
Prints the count of rows in Parquet file(s)
usage: parquet-tools rowcount [option...] <input>
where option is one of:
-d,--detailed Detailed rowcount of each matching file
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
where <input> is the parquet file to count rows to stdout
parquet-tools size:
Prints the size of Parquet file(s)
usage: parquet-tools size [option...] <input>
where option is one of:
-d,--detailed Detailed size of each matching file
--debug Enable debug output
-h,--help Show this help string
--no-color Disable color output even if supported
-p,--pretty Pretty size
-u,--uncompressed Uncompressed size
where <input> is the parquet file to get size & human readable size to stdout
parquet-tools schema
rekins$ parquet-tools schema sales_extended.parquet
message hive_schema {
optional int32 prod_id;
optional int32 cust_id;
optional binary time_id (UTF8);
optional int32 channel_id;
optional int32 promo_id;
optional int32 quantity_sold;
optional fixed_len_byte_array(5) amount_sold (DECIMAL(10,2));
optional binary gender (UTF8);
optional binary city (UTF8);
optional binary state_province (UTF8);
optional binary income_level (UTF8);
}
parquet-tools rowcount
rekins$ parquet-tools rowcount sales_extended.parquet
Total RowCount: 916039
parquet-tools size
rekins$ parquet-tools size sales_extended.parquet
Total Size: 9262640 bytes
parquet-tools head
rekins$ parquet-tools head -n 2 sales_extended.parquet
prod_id = 13
cust_id = 987
time_id = 1998-01-10
channel_id = 3
promo_id = 999
quantity_sold = 1
amount_sold = 1232.16
gender = M
city = Adelaide
state_province = South Australia
income_level = K: 250,000 - 299,999
prod_id = 13
cust_id = 1660
time_id = 1998-01-10
channel_id = 3
promo_id = 999
quantity_sold = 1
amount_sold = 1232.16
gender = M
city = Dolores
state_province = CO
income_level = L: 300,000 and above
Oracle Database 19.3
Unfortunately, I don’t have access to an Oracle BDA (Big Data Appliance) and the Oracle Big Data Lite Virtual Machine does not currently support the ORACLE_BIGDATA TYPE, so in this Blog I will be using my on-premise 19c database.
However, we can easily test accessing a local parquet file by creating an Oracle Directory and and then creating an External Table with TYPE ORACLE_BIGDATA.
Local parquet file
$ ls -l /home/oracle/bigdata/sales_extended.parquet -rw-r--r--. 1 oracle oinstall 9263650 Jan 1 15:28 /home/oracle/bigdata/sales_extended.parquet
Create Oracle Directory
CREATE OR REPLACE DIRECTORY BIGDATA_DIR AS '/home/oracle/bigdata'
Create Oracle External Table
CREATE TABLE SALES_LOCAL ( PROD_ID NUMBER(10), CUST_ID NUMBER(10), TIME_ID VARCHAR2(4000), CHANNEL_ID NUMBER(10), PROMO_ID NUMBER(10), QUANTITY_SOLD NUMBER(10), AMOUNT_SOLD NUMBER(10,2), GENDER VARCHAR2(4000), CITY VARCHAR2(4000), STATE_PROVINCE VARCHAR2(4000), INCOME_LEVEL VARCHAR2(4000) ) ORGANIZATION EXTERNAL (TYPE ORACLE_BIGDATA DEFAULT DIRECTORY BIGDATA_DIR ACCESS PARAMETERS ( com.oracle.bigdata.debug=TRUE com.oracle.bigdata.log.opt=normal com.oracle.bigdata.fileformat=parquet ) location (BIGDATA_DIR:'sales_extended.parquet') ) REJECT LIMIT UNLIMITED /
Oracle Database 19.3
As we have now created an Oracle Directory, External Table, and have a local parquet file we can try accessing it from within Oracle SQL*Plus.



Unfortunately, if you try and access an s3 bucket from an on-premises database you will hit an ‘PLS-00201’ error as the ‘DBMS_CLOUD‘ package is not available.

In a subsequent Blog I will repeat on the Oracle Cloud Free tier.
[twitter-follow screen_name=’RonEkins’ show_count=’yes’]