Getting started with Oracle BigData and s3 – Part1

@RonEkins

6 years ago

Getting started with Oracle BigData and s3 – Part1

Before we take the Oracle BigData solution out for a test run I will create an s3 bucket and put some test data into the bucket. In this Blog series I will be using the Oracle BigData SQL and Pure Storage FlashBlade which includes support for s3.

Make a Bucket

I have previously installed and configured the AWS awscli and s3cmd client on my Mac, if you don’t have already have an s3 client installed you can download s3cmd here.

S3cmd is a free command line tool and client for uploading, retrieving and managing data in Amazon S3, other cloud storage service providers and on-premises solutions such as Pure Storage FlashBlade that use the S3 protocol.

If you prefer, you can also use the Amazon AWS provided awscli which you can download from here.

Make a Bucket with s3cmd
s3cmd mb s3://BUCKET

$ s3cmd mb s3://bigdata
Bucket 's3://bigdata/' created

List Buckets

List buckets with s3cmd
s3cmd ls [s3://BUCKET[/PREFIX]]

$ s3cmd ls
2019-12-13 12:25  s3://bigdata

or with aws s3api, the table format can be used to provide a nicely formatted output.

$ aws s3api list-buckets --query "Buckets[].Name" --endpoint-url http://10.225.112.70

BigData s3 sources

In this Blogs series I plan to use 3 popular BigData file formats, avro, parquet and CSV.

CSV File

A comma-separated values file allows data to be stored in a portable tabular format, rather than create my own data I will use data provided by the US Department of Transportation (DOT) https://www.transtats.bts.gov.

A prezipped csv file can be downloaded here

Parquet File

The Parquet file format is a compressed columnar-based designed to be more efficient than easily created traditional CSV file format.

Apache Avro File

The Apache Avro file is a row-based format and includes the schema definition (check-out the Apache Avro documentation for more details)

Put file(s) into Bucket

Put files into bucket with s3cmd
s3cmd put FILE [FILE...] s3://BUCKET[/PREFIX]

$ s3cmd put movie.avro s3://bigdata
upload: 'movie.avro' -> 's3://bigdata/movie.avro'  [1 of 1]
331459 of 331459   100% in   11s    27.13 kB/s  done

$ s3cmd put sales_extended.parquet s3://bigdata
upload: 'sales_extended.parquet' -> 's3://bigdata/sales_extended.parquet'  [1 of 1]
9263650 of 9263650   100% in  161s    56.16 kB/s  done

$ s3cmd put On_Time_Reporting_Carrier_On_Time_Performance_\(1987_present\)_2019_1.csv s3://bigdata
upload: 'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' -> 's3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv'  [part 1 of 17, 15MB] [1 of 1]
...
upload: 'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' -> 's3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv'  [part 17 of 17, 10MB] [1 of 1]
  11365512 of 11365512   100% in  209s    53.09 kB/s  done

List Objects

List objects or buckets
s3cmd ls [s3://BUCKET[/PREFIX]]

$ s3cmd ls s3://bigdata
2019-12-13 14:59 263023752   s3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv
2019-12-13 12:58    331459   s3://bigdata/movie.avro
2019-12-13 13:03   9263650   s3://bigdata/sales_extended.parquet

aws s3api list-objects --bucket bigdata --query 'Contents[].{Key: Key, Size: Size}' --endpoint-url http://10.225.112.70

Next steps

In Part 2 we will explore how the Oracle BigData SQL can be used to read data from s3 buckets.

[twitter-follow screen_name=’RonEkins’ show_count=’yes’]