Apache Sqoop – Overview(概述)

Apache Sqoop 概述

采用Hadoop来分析及拍卖数量要将数据加载到集结众多被并且用她同店生产数据库中的别样数进行整合处理。从养体系加载大块数据到Hadoop中或打大型集群的map
reduce应用中落多少是独挑战。用户要意识及管数据一致性,消耗生产系统资源,供应下游管道的数预处理这些细节。用底本来转化数据是无用与耗时的主意。使用map
reduce应用直接去取标系统的多少令应用变得复杂和增加了生体系自集群节点过度负载的风险。

即便是Apache Sqoop能够好的。Aapche Sqoop
目前凡是Apache软件会的抱项目。更多关于这类型之消息方可当http://incubator.apache.org/sqoop查看

Sqoop能够令像提到项目数据库、企业数目仓库和NoSQL系统那样简单地由结构化数据仓库中导入导出数。你可以使Sqoop将数据由外表系统加载到HDFS,存储在Hive和HBase表格中。Sqoop配合Ooozie能够协助而调度和自动运行导入导出任务。Sqoop使用基于支持插件来供新的外表链接的连接器。

当你运行Sqoop的时候看起是非常简单的,但是表象底层下面有了什么吧?数据集将被切开分到不同之partitions和运作一个只有map的功课来负责数据集的之一切片。因为Sqoop使用数据库的头版数据来想数据类型所以各条数还以同样种植档次安全之道来拍卖。

当即时首文章其余部分中我们将透过一个事例来显示Sqoop的各种应用方法。这篇稿子的对象是提供Sqoop操作的一个概述而无是深深高级功能的细节。

导入数据

下面的指令用于将一个MySQL数据库中名为ORDERS的表中所有数据导入到集结众多被—

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password ****

以当时长达命令中的各种选项解释如下:

import: 指示Sqoop开始导入

–connect , –username , –password :
这些还是连数据库时要之参数。这跟你通过JDBC连接数据库时所运用的参数没有区别

–table: 指定要导入哪个表

导入操作通过下Figure1所描绘的那么片步来就。第一步,Sqoop从数据库被获得要导入的多少的正数据。第二步,Sqoop提交map-only作业到Hadoop集群被。第二步通过以面前同一步着得的老大数据做实际的数传工作。

Figure 1: Sqoop Import Overview

导入的数额存储在HDFS目录下。正使Sqoop大多数操作一样,用户可以指定其他替换路径来存储导入的数据。

默认情况下这些文档包含用逗号分隔的字段,用新实践来分隔不同的记录。你得显著地指定字段分隔符和著录了符容易地落实公文复制过程被之格式覆盖。

Sqoop也支撑不同数量格式的数目导入。例如,你得经过点名
–as-avrodatafile 选项的授命执行来简单地贯彻导入Avro 格式的多寡。

Sqoop提供多挑选可以用来满足指定需求的导入操作。

导入数据到 Hive

当众多状态下,导入数据及Hive就跟运行一个导入任务然后使用Hive创建同加载一个确定的表和partition。手动执行这操作需要您若掌握科学的数据类型映射和其余细节像序列化格式和分隔符。Sqoop负责将适度的表格元数据填到Hive
元数据仓库和调用必要的一声令下来加载table和partition。这些操作都好经简单地在命令行中指定–hive-import
来落实。

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password **** –hive-import

当你运行一个Hive
import时,Sqoop将会见用数据的品种从外表数据仓库的原生数据类型转换成Hive中对应之花色,Sqoop自动地挑选Hive使用的本地分隔符。如果叫导入的数中发出新行或者来另外Hive分隔符,Sqoop允许而移除这些字符并且赢得导入到Hive的科学数据。

设导入操作完,你就算比如Hive其他表格一样去查和操作。

导入数据到 HBase

您得采取Sqoop将数据插入到HBase表格中一定列族。跟Hive导入操作非常像,可以透过点名一个附加的精选项来指定要插入的HBase表格和列族。所有导入到HBase的数量以易成为字符串并盖UTF-8字节数组的格式插入到HBase中

$ sqoop import –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–hbase-create-table –hbase-table ORDERS –column-family mysql

下是令行中各种选项之诠释:

–hbase-create-table: 这个选项指示Sqoop创建HBase表.

–hbase-table: 这个选项指定HBase表格的名字.

–column-family: T这个选项指定列族的名字.

结余的选择项与一般的导入操作一样。

导出数据

每当片状态遇,通过Hadoop
pipelines来拍卖数量可能得以生产体系受运作额外的基本点作业函数来供支援。Sqoop可以于必要之当儿用来导出这些的数目及表面数据仓库。还是采取方面的事例,如果Hadoop
pieplines产生的多少对应数据库OREDERS表格中之一点地方,你可利用下的命令行:

$ sqoop export –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–export-dir /user/arvind/ORDERS

下是各种选项的诠释:

export: 指示Sqoop开始导出

–connect , –username , –password
:这些都是连接数据库时要的参数。这和你通过JDBC连接数据库时所使用的参数没有分别

–table: 指定要受填的报表

–export-dir : 导出路径.

导入操作通过下Figure2所写的那片步来好。第一步,从数据库被落要导入的数据的处女数据,第二步则是数码的传。Sqoop将输入数据集分割成片然后就此map任务将片插到数据库中。为了保险最佳的吞吐量和无限小之资源使用率,每个map任务通过多独事情来施行是数额传。

Figure 2: Sqoop Export Overview

有连接器支持即表格来帮忙隔离那些任何原因导致的作业失败而有的产报表。一旦有的多少都传就,临时表格中的多寡首先给填充到map任务以及集合到对象表格。

Sqoop 连接器

采取特别连接器,Sqoop可以接连那些有优化导入导出基础设备的外部系统,或者不支持本地JDBC。连接器是插件化组件基于Sqoop的而是扩大框架和可以增长到其他当前留存的Sqoop。一旦连接器安装好,Sqoop可以下其当Hadoop和连接器支持之表仓库里展开快速的传输数据。

默认情况下,Sqoop包含支持各种常用数据库例如MySQL,PostgreSQL,Oracle,SQLServer和DB2的连接器。它也包含支持MySQL和PostgreSQL数据库的敏捷路径连接器。快速路径连接器是特地的连接器用来贯彻批次传输数据的高吞吐量。Sqoop也蕴藏一般的JDBC连接器用于连接通过JDBC连接的数据库

以及坐的连接不同之凡,许多局见面付出他们自己之连接器插入到Sqoop中,从专门的公司仓库连接器到NoSQL数据库。

总结

每当即时首文档中可以看出那个数目集在Hadoop和外部数据仓库例如关系项目数据库的导是何其的略。除此之外,Sqoop提供多高等提醒要不同数量格式、压缩、处理查询等等。我们建议乃差不多品尝Sqoop并给我们提供报告。

还多关于Sqoop的信方可于脚路径找到:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status:  http://incubator.apache.org/projects/sqoop.html

Mailing Lists:
https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

下是原文


Apache Sqoop – Overview

Using Hadoop for analytics and data processing requires loading data
into clusters and processing it in conjunction with other data that
often resides in production databases across the enterprise. Loading
bulk data into Hadoop from production systems or accessing it from map
reduce applications running on large clusters can be a challenging task.
Users must consider details like ensuring consistency of data, the
consumption of production system resources, data preparation for
provisioning downstream pipeline. Transferring data using scripts is
inefficient and time consuming. Directly accessing data residing on
external systems from within the map reduce applications complicates
applications and exposes the production system to the risk of excessive
load originating from cluster nodes.

This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing
incubation at Apache Software Foundation. More information on this
project can be found at http://incubator.apache.org/sqoop.

Sqoop allows easy import and export of data from structured data stores
such as relational databases, enterprise data warehouses, and NoSQL
systems. Using Sqoop, you can provision the data from external system on
to HDFS, and populate tables in Hive and HBase. Sqoop integrates with
Oozie, allowing you to schedule and automate import and export tasks.
Sqoop uses a connector based architecture which supports plugins that
provide connectivity to new external systems.

What happens underneath the covers when you run Sqoop is very
straightforward. The dataset being transferred is sliced up into
different partitions and a map-only job is launched with individual
mappers responsible for transferring a slice of this dataset. Each
record of the data is handled in a type safe manner since Sqoop uses the
database metadata to infer the data types.

In the rest of this post we will walk through an example that shows the
various ways you can use Sqoop. The goal of this post is to give an
overview of Sqoop operation without going into much detail or advanced
functionality.

Importing Data

The following command is used to import all data from a table called
ORDERS from a MySQL database:


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password ****


In this command the various options specified are as follows:

import: This is the sub-command that instructs Sqoop to initiate an
import.

–connect , –username , –password : These are connection parameters
that are used to connect with the database. This is no different from
the connection parameters that you use when connecting to the database
via a JDBC connection.

–table: This parameter specifies the table which will be imported.

The import is done in two steps as depicted in Figure 1 below. In the
first Step Sqoop introspects the database to gather the necessary
metadata for the data being imported. The second step is a map-only
Hadoop job that Sqoop submits to the cluster. It is this job that does
the actual data transfer using the metadata captured in the previous
step.

Figure 1: Sqoop Import Overview

The imported data is saved in a directory on HDFS based on the table
being imported. As is the case with most aspects of Sqoop operation, the
user can specify any alternative directory where the files should be
populated.

By default these files contain comma delimited fields, with new lines
separating different records. You can easily override the format in
which data is copied over by explicitly specifying the field separator
and record terminator characters.

Sqoop also supports different data formats for importing data. For
example, you can easily import data in Avro data format by simply
specifying the option –as-avrodatafile with the import command.

There are many other options that Sqoop provides which can be used to
further tune the import operation to suit your specific requirements.

Importing Data into Hive

In most cases, importing data into Hive is the same as running the
import task and then using Hive to create and load a certain table or
partition. Doing this manually requires that you know the correct type
mapping between the data and other details like the serialization format
and delimiters. Sqoop takes care of populating the Hive metastore with
the appropriate metadata for the table and also invokes the necessary
commands to load the table or partition as the case may be. All of this
is done by simply specifying the option –hive-import with the import
command.


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

 –table ORDERS –username test –password **** –hive-import


When you run a Hive import, Sqoop converts the data from the native
datatypes within the external datastore into the corresponding types
within Hive. Sqoop automatically chooses the native delimiter set used
by Hive. If the data being imported has new line or other Hive delimiter
characters in it, Sqoop allows you to remove such characters and get the
data correctly populated for consumption in Hive.

Once the import is complete, you can see and operate on the table just
like any other table in Hive.

Importing Data into HBase

You can use Sqoop to populate data in a particular column family within
the HBase table. Much like the Hive import, this can be done by
specifying the additional options that relate to the HBase table and
column family being populated. All data imported into HBase is converted
to their string representation and inserted as UTF-8 bytes.


$ sqoop import –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–hbase-create-table –hbase-table ORDERS –column-family mysql


In this command the various options specified are as follows:

–hbase-create-table: This option instructs Sqoop to create the HBase
table.

–hbase-table: This option specifies the table name to use.

–column-family: This option specifies the column family name to use.

The rest of the options are the same as that for regular import
operation.

Exporting Data

In some cases data processed by Hadoop pipelines may be needed in
production systems to help run additional critical business functions.
Sqoop can be used to export such data into external datastores as
necessary. Continuing our example from above – if data generated by the
pipeline on Hadoop corresponded to the ORDERS table in a database
somewhere, you could populate it using the following command:


$ sqoop export –connect jdbc:mysql://localhost/acmedb \

–table ORDERS –username test –password **** \

–export-dir /user/arvind/ORDERS


In this command the various options specified are as follows:

export: This is the sub-command that instructs Sqoop to initiate an
export.

–connect , –username , –password : These are connection parameters
that are used to connect with the database. This is no different from
the connection parameters that you use when connecting to the database
via a JDBC connection.

–table: This parameter specifies the table which will be populated.

–export-dir : This is the directory from which data will be exported.

Export is done in two steps as depicted in Figure 2. The first step is
to introspect the database for metadata, followed by the second step of
transferring the data. Sqoop divides the input dataset into splits and
then uses individual map tasks to push the splits to the database. Each
map task performs this transfer over many transactions in order to
ensure optimal throughput and minimal resource utilization.

Figure 2: Sqoop Export Overview

Some connectors support staging tables that help isolate production
tables from possible corruption in case of job failures due to any
reason. Staging tables are first populated by the map tasks and then
merged into the target table once all of the data has been delivered it.

Sqoop Connectors

Using specialized connectors, Sqoop can connect with external systems
that have optimized import and export facilities, or do not support
native JDBC. Connectors are plugin components based on Sqoop’s extension
framework and can be added to any existing Sqoop installation. Once a
connector is installed, Sqoop can use it to efficiently transfer data
between Hadoop and the external store supported by the connector.

By default Sqoop includes connectors for various popular databases such
as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes
fast-path connectors for MySQL and PostgreSQL databases. Fast-path
connectors are specialized connectors that use database specific batch
tools to transfer data with high throughput. Sqoop also includes a
generic JDBC connector that can be used to connect to any database that
is accessible via JDBC.

Apart from the built-in connectors, many companies have developed their
own connectors that can be plugged into Sqoop. These range from
specialized connectors for enterprise data warehouse systems to NoSQL
datastores.

Wrapping Up

In this post you saw how easy it is to transfer large datasets between
Hadoop and external datastores such as relational databases. Beyond
this, Sqoop offers many advance features such as different data formats,
compression, working with queries instead of tables etc. We encourage
you to try out Sqoop and give us your feedback.

More information regarding Sqoop can be found at:

Project Website: http://incubator.apache.org/sqoop

Wiki: https://cwiki.apache.org/confluence/display/SQOOP

Project Status:  http://incubator.apache.org/projects/sqoop.html

Mailing Lists:
https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

相关文章