SAP HANA Hadoop Integration

14
PUBLIC SAP HANA 2.0 SP04 Document Version: 1.0 – 2019-06-20 SAP HANA Hadoop Integration © 2019 SAP SE or an SAP affiliate company. All rights reserved. THE BEST RUN

Transcript of SAP HANA Hadoop Integration

PUBLICSAP HANA 2.0 SP04Document Version: 1.0 – 2019-06-20

SAP HANA Hadoop Integration

© 2

019

SAP

SE o

r an

SAP affi

liate

com

pany

. All

right

s re

serv

ed.

THE BEST RUN

Content

1 SAP HANA Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.1 SAP HANA Spark Controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Using the Simba ODBC Driver to Connect to Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Hive ODBC Driver Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Downloading Supporting Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Enable Remote Caching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Create a Virtual Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Adding a MapReduce Program to SAP HANA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Create a Remote Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

2 P U B L I CSAP HANA Hadoop Integration

Content

1 SAP HANA Hadoop Integration

Regardless of structure, you can combine the in-memory processing power of SAP HANA with Hadoop’s ability to store and process huge amounts of data.

SAP HANA is designed for high-speed data and analytic scenarios, while Hadoop is designed for very large, unstructured data scenarios. Hadoop can scale to thousands of nodes and is designed for use in large distributed clusters and to handle big data. Combining SAP HANA with Hadoop leverages Hadoop’s lower storage cost and type flexibility with the high-speed in-memory processing power and highly structured data conformity of SAP HANA.

SAP HANA Hadoop integration is designed for users who may want to start using SAP HANA with their Hadoop ecosystem. This document assumes you have a Hadoop cluster installed.

Integrating SAP HANA and Hadoop

There are several methods for setting up communication between SAP HANA and your Hadoop system. The documents listed under SAP HANA Hadoop Integration focus on data access used in conjunction with Smart Data Access (SDA) to access remote data as if the data was stored in local tables in SAP HANA using one of:

● (Recommended) SAP HANA spark controller.● Hive ODBC driver● WebHDFS REST API interface for HDFS

SAP HANA Spark Controller and Hadoop Architecture

SAP HANA Hadoop IntegrationSAP HANA Hadoop Integration P U B L I C 3

For more information, see the SAP HANA Spark Controller Installation Guide.

Hive ODBC and Hadoop Architecture

SAP Vora SAP Vora provides in-memory processing engines that run on a Hadoop cluster and Spark execution framework. Able to scale to thousands of nodes, SAP Vora is designed for use in large distributed clusters and for handling big data.

The SAP Vora solution is built on the Hadoop ecosystem, which provides a collection of components that support distributed processing of large data sets across a cluster of machines.

See SAP Vora and SAP Data Hub.

SAP Cloud Platform Big Data ServiceYou can use SAP HANA with a SAP Cloud Platform Big Data Services cluster to enable virtual or physical data movement between your SAP HANA system and Big Data Services.

Depending on the use case, SAP HANA spark controller is a managed services of Big Data Services. As a managed service you do not need to deploy spark controller yourself.

For more information, see SAP Cloud Platform Big Data Services

Related Information

SAP HANA Spark Controller [page 5]Using the Simba ODBC Driver to Connect to Hive [page 5]

4 P U B L I CSAP HANA Hadoop Integration

SAP HANA Hadoop Integration

1.1 SAP HANA Spark Controller

Spark controller is an SAP HANA software component that is used to integrate SAP HANA and your Hadoop cluster to allow SAP HANA the ability to access and process data stored in your Hadoop cluster.

For installation and configuration instructions, see the SAP HANA Spark Controller Installation Guide.

1.2 Using the Simba ODBC Driver to Connect to Hive

The Simba Hive ODBC Driver is a connector to Apache Hive, a SQL-oriented query language that provides a quick and easy way to work with data stored in HDFS on a Hadoop cluster. Hive leverages Hadoop’s processing power and storage capacity.

Installing the ODBC driver allows you to federate data and combine cold data stored in Hadoop with warm data stored in SAP HANA. Using Hive, you can run one SQL query and combine this data.

NoteHIVE ODBC Driver is supported on Intel-based hardware platforms only.

For installation and configuration instructions, see Hive ODBC Driver Setup [page 6].

SAP HANA Hadoop IntegrationSAP HANA Hadoop Integration P U B L I C 5

2 Hive ODBC Driver Setup

Edit interface files and create environment variables to set up the Hive ODBC driver as your first step towards getting SAP HANA to communicate with a Hadoop system.

Prerequisites

Verify that you have the Hive driver installed on the machine with the SAP HANA instance, by searching for /opt/simba/hiveodbc/lib/64/libsimbahiveodbc64.so.

Check if the ELF header, GNU_STACK, is RWE.

readelf -l /opt/simba/hive/lib/64/libhiveodbc_sb64.so | grep -A1 GNU_STACK

If the return result has RWE, set it to RW by executing:

execstack –c /opt/simba/hive/lib/64/libhiveodbc_sb64.so

Procedure

1. Log on to <sid>adm using your SAP HANA administration user.

2. Stop SAP HANA.3. Copy odbc.ini to your home directory as simba.hiveodbc.ini:

cp /opt/simba/hive/Setup/odbc.ini ~/.simba.hiveodbc.ini

4. Edit your newly copied simba.hiveodbc.ini file with the following:

○ If there is a line with DriverManagerEncoding=UTF-32, change the value to UTF-16.○ Make sure the line ErrorMessagePath=/opt/simba/hive/ErrorMessages exists.○ Comment out the line: ODBCInstLib=libiodbcint.so.○ Uncomment the line: ODBCInstLib=libodbcinst.so.

5. Open or create your .odbc.ini file in your home directory, and add the following:

[hive1] Driver=/opt/simba/hive/lib/64/libhiveodbc_sb64.so Host=<server.com> HIVE Port=<portNmber>

Where:○ Host is the machine running Hive.○ HIVE Port is the port running Hive (the default is 10000).

6 P U B L I CSAP HANA Hadoop Integration

Hive ODBC Driver Setup

6. Edit the $HOME/.customer.sh file:

a. Set the path of the shared libraries (LD_LIBRARY_PATH) to the Hive ODBC library at /opt/simba/hive/lib/64. For example:

$> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/simba/hive/lib/64

b. Set the ODBCINI environment variable to your .odbc.ini file location. For example:

$> export ODBCINI=$HOME/.odbc.ini

7. Restart the Hadoop server and verify that you can connect to Hive through the ODBC driver using the data source name (DSN). For example:

isql $>isql –v hive1

8. Start SAP HANA.

Related Information

Downloading Supporting Libraries [page 7]Enable Remote Caching [page 8]

2.1 Downloading Supporting Libraries

Use third-party libraries to process dates or times, and comma-separated files.

Procedure

1. If it does not already exist, create the following Hadoop Distributed File System (HDFS) directory in Hadoop:/sap/hana/mapred/lib

2. If they do not already exist, add the following libraries:○ joda-time-2.3.jar

Download from http://mvnrepository.com/artifact/joda-time/joda-time/2.3 .○ solr-commons-csv-3.5.0.jar

Download from http://mvnrepository.com/artifact/org.apache.solr/solr-commons-csv/3.5.0 .

SAP HANA Hadoop IntegrationHive ODBC Driver Setup P U B L I C 7

2.2 Enable Remote Caching

When using the Hive ODBC driver to connect SAP HANA and Hadoop, you can enable remote caches in Hive for queries on low-velocity data, which allows you to use materialized data for the repetitive execution of the same query.

SAP HANA dispatching a virtual table query to Hive involves a series of map and reduce job executions. Completing such a query could take hours, depending on the data size in Hadoop and the current cluster capacity. In most cases, the data in the Hadoop cluster is not frequently updated, and successive execution of map and reduce jobs might result in the same queries. Using remote caching with Hadoop through the Hive interface allows you to use the cached remote data set so you need not wait for map reduce to be executed. The first time you run a statement, you see no performance improvement because of the time it takes to run the job and sort the data in the table. The next time you run the same query, you are accessing the materialized data, therefore reducing the execution time for the job.

Use this feature for Hive tables with low-velocity data (which are not frequently updated).

This behavior is controlled by using a hint to instruct the optimizer to use remote caching. For example:

Example

You have created a virtual table hive_activity_log and then fetch all erroneous entries for plant 001:

select * from hive_activity_log where incident_type = 'ERROR' and plant ='001' with hint (USE_REMOTE_CACHE)

When you use the hint USE_REMOTE_CACHE, this result set is materialized in Hive and subsequent queries are served from the materialized view.

Configuration Parameters

The following configuration parameters are used for remote caching and are stored in the indexserver.ini file in the smart_data_access section:

Parameter Description

enable_remote_cache ( 'true' | 'false' )

A global switch to enable or disable remote caching for fed­erated queries. This parameter only supports Hive sources. The USE_REMOTE_CACHE hint parameter is ignored when this parameter is disabled. .

remote_cache_validity = 3600 (seconds) Defines how long the remote cache remains valid. By default, the cache is retained for one hour.

8 P U B L I CSAP HANA Hadoop Integration

Hive ODBC Driver Setup

3 Create a Virtual Function

Create virtual functions when using the WebHDFS REST API interface for HDFS, so that applications can use Hadoop MapReduce jobs and virtual user-defined functions.

Prerequisites

● A remote source exists.● If you are creating a virtual function that uses a MapReduce package, also add it to SAP HANA before

creating the virtual function.

Context

A virtual user-defined function (UDF) provides an abstraction for Hadoop MapReduce jobs. It has no function body and can be used in normal SQL statements.

Procedure

1. In SAP HANA studio, go to the SAP HANA Development perspective then open the Project Explorer view.2. In the Project Explorer view, right-click the shared project folder where you want to create the new virtual

function and choose New Other... SAP HANA Database Development Hadoop Virtual Function in the context-sensitive menu. Select Next to select the parent folder and specify the file name for your Hadoop virtual function.

3. Edit then save the following Hadoop virtual function creation template:

VIRTUAL FUNCTION <valid schema>.<function name>() RETURNS TABLE <return table type> PACKAGE <hadoop mrjobs archive schema>.<hadoop mrjobs archive name> CONFIGURATION '<remote proc properties>' AT <hadoop remote source name>

4. Activate the Hadoop virtual function by right-clicking the function name in the Project Explorer and selecting Team Activate in the context-sensitive menu.

5. Invoke the virtual function remote source by highlighting the corresponding SELECT statement in SQL Console, and executing it.

6. To see the job results, open the Result tab in the SQL Console to see the job results. For Example:

VIRTUAL FUNCTION "SYSTEM"."hadoop.mrjobs.demos::DEMO_VF"() RETURNS TABLE ( WORD NVARCHAR(60), COUNT integer) PACKAGE SYSTEM.WORD_COUNT

SAP HANA Hadoop IntegrationCreate a Virtual Function P U B L I C 9

CONFIGURATION 'enable_caching=true;mapred_jobchain=[{"mapred_input":"/apps/hive/warehouse/dflo.db/region/","mapred_mapper":"com.sap.hana.hadoop.samples.WordMapper","mapred_reducer":"com.sap.hana.hadoop.samples.WordReducer"}]' AT "hadoop.mrjobs.demos:: DEMO_SRC"

Results

You have queried a Hadoop file using a custom MapReduce job from SAP HANA using simple SQL.

For more information, see the following commands in the SAP HANA SQL and System Views Reference:

● VIRTUAL_FUNCTIONS System View● VIRTUAL_FUNCTIONS_PACKAGES System View● CREATE_REMOTE_SOURCE Statement (Access Control)● CREATE_VIRTUAL_FUNCTION Statement (Procedural)

Related Information

Adding a MapReduce Program to SAP HANA [page 10]

3.1 Adding a MapReduce Program to SAP HANA

You can push a Java MapReduce (MR) job to the SAP HANA repository as a .hdbmrjob repository file. It will include all JAR files for the created MapReduce job. This is a prerequisite step for applications to make use of the Hadoop MapReduce jobs and virtual user-defined functions.

Prerequisites

You have created a Java project in the SAP HANA Development perspective of the SAP HANA studio. For more information about creating a project, see Create a Project for SAP HANA XS in the SAP HANA Developer Guide.

Context

SAP HANA Hadoop controller facilitates MapReduce functionality. It is an adapter that can be installed as a delivery unit in HANA XS engine, and can be pushed to Hadoop. After it is installed, assign the SAP HANA sap.hana.xs.lm.roles::Administrator role to your SAP HANA user then start the SAP HANA Application Lifecycle Manager to import it as a delivery unit. Downloaded the controller from SAP Marketplace.

10 P U B L I CSAP HANA Hadoop Integration

Create a Virtual Function

Procedure

1. In SAP HANA studio, go to the SAP HANA Development perspective., and open the Project Explorer view.2. In the Project Explorer view, right-click the shared project folder where you want to create the new virtual

function, and choose New Other... SAP HANA Database Development Hadoop MR Jobs Archive in the context-sensitive menu.

3. Select Next to select the parent folder and specify the file name for your Hadoop MapReduce jobs archive. Then select Next.

4. Select the Java project folder holding the source of the Hadoop MapReduce jobs program. An archive (JAR file), packing in all the dependencies needed to run the program, is created and associated with the shared project folder.

5. Activate the Hadoop MapReduce jobs archive by right-clicking the archive in the Project Explorer and selecting Team Activate in the context-sensitive menu.

Results

During activation of the design-time object, the corresponding runtime object .hdbmrjob is created and stored in the catalog.

Next Steps

You can submit your MapReduce job files to a Hadoop cluster from a SQL statement, and define a virtual user-defined function where you specify the Hadoop MapReduce job name stored as a catalog object.

SAP HANA Hadoop IntegrationCreate a Virtual Function P U B L I C 11

4 Create a Remote Source

Create the remote data source by running a SQL statement.

Context

Use the fully qualified domain name of your host. You can find the host name by running the command hostname -f.

Procedure

Execute the following statement:

For creating a remote source that connects to a Hive destination source, use this syntax:

CREATE REMOTE SOURCE Hadoop_test ADAPTER "hiveodbc" CONFIGURATION 'DSN=hwp’ WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=<user_name>;password=<password'>;

For creating a remote source and running VIRTUAL FUNCTION map reduce jobs, use this syntax:

CREATE REMOTE SOURCE "hadoop_demo" ADAPTER "hadoop" CONFIGURATION 'webhdfs_url=http://full_qualified_domain_name:50070;webhcat_url=http://full_qualified_domain_name:50111'WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=hdfs;password=hdfs';

You can see the remote source under Provisioning Remote Source .

12 P U B L I CSAP HANA Hadoop Integration

Create a Remote Source

Important Disclaimers and Legal Information

HyperlinksSome links are classified by an icon and/or a mouseover text. These links provide additional information.About the icons:

● Links with the icon : You are entering a Web site that is not hosted by SAP. By using such links, you agree (unless expressly stated otherwise in your agreements with SAP) to this:

● The content of the linked-to site is not SAP documentation. You may not infer any product claims against SAP based on this information.● SAP does not agree or disagree with the content on the linked-to site, nor does SAP warrant the availability and correctness. SAP shall not be liable for any

damages caused by the use of such content unless damages have been caused by SAP's gross negligence or willful misconduct.

● Links with the icon : You are leaving the documentation for that particular SAP product or service and are entering a SAP-hosted Web site. By using such links, you agree that (unless expressly stated otherwise in your agreements with SAP) you may not infer any product claims against SAP based on this information.

Beta and Other Experimental FeaturesExperimental features are not part of the officially delivered scope that SAP guarantees for future releases. This means that experimental features may be changed by SAP at any time for any reason without notice. Experimental features are not for productive use. You may not demonstrate, test, examine, evaluate or otherwise use the experimental features in a live operating environment or with data that has not been sufficiently backed up.The purpose of experimental features is to get feedback early on, allowing customers and partners to influence the future product accordingly. By providing your feedback (e.g. in the SAP Community), you accept that intellectual property rights of the contributions or derivative works shall remain the exclusive property of SAP.

Example CodeAny software coding and/or code snippets are examples. They are not for productive use. The example code is only intended to better explain and visualize the syntax and phrasing rules. SAP does not warrant the correctness and completeness of the example code. SAP shall not be liable for errors or damages caused by the use of example code unless damages have been caused by SAP's gross negligence or willful misconduct.

Gender-Related LanguageWe try not to use gender-specific word forms and formulations. As appropriate for context and readability, SAP may use masculine word forms to refer to all genders.

Videos Hosted on External PlatformsSome videos may point to third-party video hosting platforms. SAP cannot guarantee the future availability of videos stored on these platforms. Furthermore, any advertisements or other content hosted on these platforms (for example, suggested videos or by navigating to other videos hosted on the same site), are not within the control or responsibility of SAP.

SAP HANA Hadoop IntegrationImportant Disclaimers and Legal Information P U B L I C 13

www.sap.com/contactsap

© 2019 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice.

Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.

Please see https://www.sap.com/about/legal/trademark.html for additional trademark information and notices.

THE BEST RUN