How to register S3 Parquet files in a Hive Metastore using Spark on EMR

Question

I am using Amazon Elastic Map Reduce 4.7.1, Hadoop 2.7.2, Hive 1.0.0, and Spark 1.6.1.

Use case: I have a Spark cluster used for processing data. That data is stored in S3 as Parquet files. I want tools to be able to query the data using names that are registered in the Hive Metastore (eg, looking up the foo table rather than the parquet.`s3://bucket/key/prefix/foo/parquet` style of doing things). I also want this data to persist for the lifetime of the Hive Metastore (a separate RDS instance) even if I tear down the EMR cluster and spin up a new one connected to the same Metastore.

Problem: if I do something like sqlContext.saveAsTable("foo") that will, by default, create a managed table in the Hive Metastore (see https://spark.apache.org/docs/latest/sql-programming-guide.html). These managed tables copy the data from S3 to HDFS on the EMR cluster, which means the metadata would be useless after tearing down the EMR cluster.

score 4 · Accepted Answer · answered Jul 21 '16 at 00:36

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.

score 4 · Answer 2 · edited Sep 25 '19 at 12:44

The way I solve this problem is: First Create the hive table in the spark:

schema = StructType([StructField("key", IntegerType(), True),StructField("value", StringType(), True)])
df = spark.catalog \
          .createTable("data1", "s3n://XXXX-Buket/data1",schema=schema)

Next, in Hive, it will appear the table that created from spark as above. (in this case data1)

In addition, in the other hive engine, you can link to this data is S3 by create external table data with the same type as created in spark: command:

CREATE EXTERNAL TABLE data1 (key INT, value String) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’

score 0 · Answer 3 · answered Apr 26 '17 at 23:51

0

You don't need EMR for this. Just fire up Athena, create a table to read the data in Parquet format. This is a much more inexpensive option than EMR, and also sustainable. You can use JDBC to access this data via Athena in realtime.

answered Apr 26 '17 at 23:51

Sandip Sinha

49
1
1

Unfortunately, Athena isn't HIPAA compliant – Sam King Apr 27 '17 at 03:52
The entire AWS service is not HIPAA compliant! . If he is using EMR and Hive on AWS then he might as well use Athena which is basically a Presto engine working on hive tables. – Sandip Sinha Apr 27 '17 at 17:16
1

That's not accurate. You can see https://aws.amazon.com/compliance/hipaa-compliance/ for their HIPAA documentation. – Sam King Apr 28 '17 at 02:28
I have copied a part of their FAQ under the same link that you provided.... Is AWS HIPAA-Certified? ---> There is no HIPAA certification for a cloud provider such as AWS. In order to meet the HIPAA requirements applicable to our operating model, AWS aligns our HIPAA risk management program with FedRAMP and NIST 800-53, a higher security standard that maps to the HIPAA security rule. NIST supports this alignment and has issued SP 800-66, "An Introductory Resource Guide for Implementing the HIPAA Security Rule," which documents how NIST 800-53 aligns to the HIPAA Security rule. – Sandip Sinha Apr 28 '17 at 05:33
1

Correct, AWS as a whole is not HIPAA certified. Only certain services like EC2, S3, and EMR. Thus the importance of using the HIPAA certified services and avoiding services like Athena when dealing with health data. – Sam King Apr 29 '17 at 07:29

How to register S3 Parquet files in a Hive Metastore using Spark on EMR

3 Answers3