Accessing Avro files with Oracle BigData Type

In my Getting started with Oracle BigData Blog I shared how you can obtain an example Avro file, if you don’t have access to a Avro file you can visit my previous Blog to get started.

Bigdata Tools

The Apache Avro file stores data in a row-based binary format and therefore not human readable, fortunately we can use avro-tools to inspect the data and schema information.

On my Mac I have installed both parquet-tools and avro-tools using Brew e.g.

brew install parquet-tools
brew install avro-tools

avro-tools help

$ avro-tools -h
Version 1.9.1
 of Apache Avro
Copyright 2010-2015 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (https://www.apache.org/).
----------------
Available tools:
    canonical  Converts an Avro Schema to its canonical form
          cat  Extracts samples from files
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
  fingerprint  Returns the fingerprint for the schemas.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
 idl2schemata  Extract JSON schemata of the types from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
       random  Creates a file with randomly generated instances of a schema.
      recodec  Alters the codec of a data file.
       repair  Recovers data from a corrupt Avro Data file
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, record per line or pretty.
       totext  Converts an Avro data file to a text file.
     totrevni  Converts an Avro data file to a Trevni file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.

avro-tools getschema

The getschema option provides a pretty JSON formatted output of the schema.

rekins$ avro-tools getschema movie.avro 
 {
   "type" : "record",
   "name" : "Movie",
   "namespace" : "oracle.avro",
   "fields" : [ {
     "name" : "movie_id",
     "type" : "int",
     "default" : 0
   }, {
     "name" : "title",
     "type" : "string",
     "default" : ""
   }, {
     "name" : "year",
     "type" : "int",
     "default" : 0
   }, {
     "name" : "budget",
     "type" : "int",
     "default" : 0
   }, {
     "name" : "gross",
     "type" : "double",
     "default" : 0
   }, {
     "name" : "plot_summary",
     "type" : "string",
     "default" : ""
   } ]
 }

avro-tools getmeta

The getmeta option can also be used to provided a formatted output of the schema.

rekins$ avro-tools getmeta movie.avro 
 avro.schema {"type":"record","name":"Movie","namespace":"oracle.avro","fields":[{"name":"movie_id","type":"int","default":0},{"name":"title","type":"string","default":""},{"name":"year","type":"int","default":0},{"name":"budget","type":"int","default":0},{"name":"gross","type":"double","default":0},{"name":"plot_summary","type":"string","default":""}]}
 avro.codec null

avro-tools tojson

We can use the tojson –pretty option to format output on each line, and the –head option to return the first n rows.

rekins$ avro-tools tojson --pretty --head 1 movie.avro 
 {
   "movie_id" : 1054798,
   "title" : "Crank: High Voltage",
   "year" : 2009,
   "budget" : 20000000,
   "gross" : 3.4560577E7,
   "plot_summary" : "Crank: High Voltage (promoted as Crank 2: High Voltage in some regions and on DVD) is a 2009 American action film and sequel to the 2006 action film, Crank. The story of the film resumes shortly after the first film left off, retaining its real-time presentation and adding more special effects. Crank: High Voltage was written and directed by Mark Neveldine and Brian Taylor, who both wrote and directed the previous film. The film was released in the United Kingdom on April 15, 2009, two days prior to its North American release date."
 }

Oracle Database 19.3

I have used the same local directory for both the Avro and Parquet files as per my previous Blog. Doing this on-premises requires a little bit more effort than in the Oracle Cloud as can’t use the dbms_cloud.create_external_table procedure.

Please note currently the DBMS_CLOUD package is not available to on-premise Enterprise Edition releases.

Local Avro file

$ ls -lh /home/oracle/bigdata/movie.avro 
 -rw-r--r--. 1 oracle oinstall 324K Jan  3 16:20 /home/oracle/bigdata/movie.avro

Create Oracle Directory

CREATE OR REPLACE DIRECTORY BIGDATA_DIR AS '/home/oracle/bigdata'

Create Oracle External Table

Below I have used the details captured using avro-tools to create an external table.

CREATE TABLE MOVIE_LOCAL (
   MOVIE_ID           NUMBER(10),  
   TITLE              VARCHAR2(4000),
   YEAR               NUMBER(10),
   BUDGET             NUMBER(10),
   GROSS              BINARY_DOUBLE,
   PLOT_SUMMARY       VARCHAR2(4000)
   )
   ORGANIZATION EXTERNAL
   (TYPE ORACLE_BIGDATA
    DEFAULT DIRECTORY BIGDATA_DIR
    ACCESS PARAMETERS
    (
      com.oracle.bigdata.debug=TRUE
      com.oracle.bigdata.log.opt=normal
      com.oracle.bigdata.fileformat=avro
    )
    location (BIGDATA_DIR:'movie.avro')
   )  REJECT LIMIT UNLIMITED
 /

Accessing via SQL*Plus

As we have now created an Oracle Directory, External Table, and have a local avro file we can try accessing it from within Oracle SQL*Plus

describe table
Count number of Rows
set feed off
set pages 100
set linesize 150
 
column MOVIE_ID heading 'Movie Id'
column TITLE heading 'Title' format a20
COLUMN BUDGET FORMAT 999,999,999 Heading 'Budget'
COLUMN GROSS FORMAT 999,999,999 Heading 'Gross'
column PLOT_SUMMARY heading 'Plot Summary' format a60
 
select 
   MOVIE_ID, TITLE, YEAR, BUDGET, GROSS, PLOT_SUMMARY 
from movie_local 
where rownum=1;
Get First Row

My next step is now to repeat my on-premise tests using a Free Oracle Cloud account and the ‘DBMS_CLOUD’ package to compare experience.

[twitter-follow screen_name=’RonEkins’ show_count=’yes’]

Leave a Reply

Create a website or blog at WordPress.com

Up ↑

Discover more from Ron Ekins' - Oracle Technology, DevOps and Kubernetes Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading