Before we take the Oracle BigData solution out for a test run I will create an s3 bucket and put some test data into the bucket. In this Blog series I will be using the Oracle BigData SQL and Pure Storage FlashBlade which includes support for s3.
Make a Bucket
I have previously installed and configured the AWS awscli and s3cmd client on my Mac, if you don’t have already have an s3 client installed you can download s3cmd here.
S3cmd is a free command line tool and client for uploading, retrieving and managing data in Amazon S3, other cloud storage service providers and on-premises solutions such as Pure Storage FlashBlade that use the S3 protocol.
If you prefer, you can also use the Amazon AWS provided awscli which you can download from here.
Make a Bucket with s3cmd s3cmd mb s3://BUCKET $ s3cmd mb s3://bigdata Bucket 's3://bigdata/' created
List Buckets
List buckets with s3cmd s3cmd ls [s3://BUCKET[/PREFIX]] $ s3cmd ls 2019-12-13 12:25 s3://bigdata or with aws s3api, the table format can be used to provide a nicely formatted output. $ aws s3api list-buckets --query "Buckets[].Name" --endpoint-url http://10.225.112.70
BigData s3 sources
In this Blogs series I plan to use 3 popular BigData file formats, avro, parquet and CSV.
CSV File
A comma-separated values file allows data to be stored in a portable tabular format, rather than create my own data I will use data provided by the US Department of Transportation (DOT) https://www.transtats.bts.gov.
A prezipped csv file can be downloaded here
Parquet File
The Parquet file format is a compressed columnar-based designed to be more efficient than easily created traditional CSV file format.
Apache Avro File
The Apache Avro file is a row-based format and includes the schema definition (check-out the Apache Avro documentation for more details)
Put file(s) into Bucket
Put files into bucket with s3cmd s3cmd put FILE [FILE...] s3://BUCKET[/PREFIX] $ s3cmd put movie.avro s3://bigdata upload: 'movie.avro' -> 's3://bigdata/movie.avro' [1 of 1] 331459 of 331459 100% in 11s 27.13 kB/s done $ s3cmd put sales_extended.parquet s3://bigdata upload: 'sales_extended.parquet' -> 's3://bigdata/sales_extended.parquet' [1 of 1] 9263650 of 9263650 100% in 161s 56.16 kB/s done $ s3cmd put On_Time_Reporting_Carrier_On_Time_Performance_\(1987_present\)_2019_1.csv s3://bigdata upload: 'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' -> 's3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' [part 1 of 17, 15MB] [1 of 1] ... upload: 'On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' -> 's3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv' [part 17 of 17, 10MB] [1 of 1] 11365512 of 11365512 100% in 209s 53.09 kB/s done
List Objects
List objects or buckets s3cmd ls [s3://BUCKET[/PREFIX]] $ s3cmd ls s3://bigdata 2019-12-13 14:59 263023752 s3://bigdata/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_1.csv 2019-12-13 12:58 331459 s3://bigdata/movie.avro 2019-12-13 13:03 9263650 s3://bigdata/sales_extended.parquet aws s3api list-objects --bucket bigdata --query 'Contents[].{Key: Key, Size: Size}' --endpoint-url http://10.225.112.70
Next steps
In Part 2 we will explore how the Oracle BigData SQL can be used to read data from s3 buckets.
[twitter-follow screen_name=’RonEkins’ show_count=’yes’]