Migrate\Load data into Azure SQL DW

Posted on Posted in Data & Business Intelligence, DevOps & Networking

There are several ways to load data inside Azure SQL Data Warehouse: DataFactory, BCP tool, SSIS, Blob storage and more (see the complete list)

We will review DataFactory vs. BCP tool (Bulk Copy Program).

But first, how to install BCP tools on Linux:

BCP Installation

Installing bcp is pretty straightforward, we use  CentOS 7 so we installed bcp and sqlcmd via yum (documentation for other Linux distributions can be found on Microsoft’s website)

The tools will be installed in the /opt/mssql-tools/bin directory

Our Use Case

We needed to migrate some of our data out from one datawarehouse (HP Vertica) into Azure SQL datawarehouse.

We exported the data to csv: Total 161M rows and the size of the file was 30GB

We first tried to export CSV and load using BCP and then using DataFactory. We used the same connection string in DataFactory (Linked Services) and in BCP (which means same login and resource group which has an impact on performance).

bcp command:

 

Arguments (for the complete list click here)

  • -S
  • -d
  • -U
  • -P
  • -t
  • -q
  • -c
  • -n
  • -e

/opt/mssql-tools/bin/bcp stg.table_name  in file_name   -S  server    -d  database -U  db_user  -P db_password  -t ‘\t’ -q -c -m 10 -e error_bcp.log

-q
Executes the SET QUOTED_IDENTIFIERS ON statement in the connection between the bcp utility and an instance of SQL Server. Use this option to specify a database, owner, table, or view name that contains a space or a single quotation mark. Enclose the entire three-part table or view name in quotation marks (“”).

To specify a database name that contains a space or single quotation mark, you must use the –q option.

-q does not apply to values passed to -d.

-t field_term
Specifies the field terminator. The default is \t (tab character).

-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes

-m max_errors
Specifies the maximum number of syntax errors that can occur before the bcp operation is canceled. A syntax error implies a data conversion error to the target data type. The max_errors total excludes any errors that can be detected only at the server, such as constraint violations.

e err_file
Specifies the full path of an error file used to store any rows that the bcp utility cannot transfer from the file to the database. Error messages from the bcp command go to the workstation of the user. If this option is not used, an error file is not created

Testing with different DWU

DWU: 100

161,832,694 rows copied.
Network packet size (bytes): 4,096
Clock Time (ms.) Total : 4,025,661 Average : (40,200.3 rows per sec.)

67 minutes to load

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DWU 1500:

161,832,694 rows copied.
Network packet size (bytes): 4,096
Clock Time (ms.) Total : 3,066,134 Average : (52780.7 rows per sec.)

51 minutes

 

Data Factory

Define the job:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DWU 100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DWU 1500

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Interesting to see that the difference between DWU 100 (cost 1.512 per hour) to DWU 1500 (22.682)  is negligible (67 minutes when compared to 51 minutes).

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *