Pentaho for ETL

By admin, August 13, 2018

Recently I was asked to install and run a bridge between a MSSQL database having SAP datasets to import into SugarCRM Ondemand (Cloud) instance.

Thus, I created a development environment with the same data and connection. The steps for running Spoon (aka Kettle, which was bought) follow.

 

Applicable to Pentaho Versions: 7 – 8.1

1 – Download the Application itself, and unzip it in the good folder. E.g. d:\pentaho\data-integration

For the version 8.1, the file is pdi-ce-8.1.0.0-365.zip and may be found in https://sourceforge.net/projects/pentaho/files/Pentaho%208.1/client-tools/

2 – Make sure you have Java 8 64 bits installed in your computer.

   2.1 Go to Control Panel -> System -> Advanced -> Environment variables

   2.2 Set the JAVA_HOME variable in your Windows accordingly, to the right path.

   2.3 Add the JAVA path to the system PATH variable

3 – Prepare for run Spoon.

For that, try to run it by clicking in the Spoon.bat file.

In case it does not run, edit the .sh or .bat files

Open the Spoon.bat or Spoon.sh file and find the line with PENTAHO_DI_JAVA_OPTIONS.

Change the original -xms1024m by -xms512m and the original -xmx2048m to -xmx1024m

Try again. You should see

Notice

The minimal production memory configuration is the original one. The proposed values are for development only. For big databases, the new 512m and 1024m may not be enough for proceeding with some use cases.

4 – Download and unzip the most used SQL drivers, for MS SQL and MySQL

Microsoft Native driver
https://www.microsoft.com/en-us/download/details.aspx?id=54670
For MySQL
https://dev.mysql.com/downloads/file/?id=479586

Take the jar files and put them inside data-integration\lib

e.g.

Save the file sqljdbc_4.1\enu\jre7\sqljdbc41.jar inside the data-integration/lib

5 – Open the MS SQL Server, and run some commands to make sure of the MS SQL server information. For this, use the MS SQL Tool:

select host_name();
select SERVERPROPERTY(‘INSTANCENAME’);
select @@version;

6 – To enable the sa user, you may run the command lines or click on the folder “Management of the instance:

ALTER LOGIN sa ENABLE ;
GO
ALTER LOGIN sa WITH PASSWORD = ‘<enterStrongPasswordHere>’ ;
GO

  6.1 – In case you need to use Windows integration, add the DLL file to the data-integration\lib folder. It is in sqljdbc_4.1\enu\auth\x64 folder -> sqljdbc_auth.dll

7 – Connecting to the SQL Server Manager for enabling TCP connections

If working locally, you should make sure the database is connectable.

Notice

Maybe you need to ask the customer to make sure the database is accessible and suggest the following steps

Open the port 49170 in the SQL-SERVER. For that, open the right executable. Let’s say for the version 2016, the file is

C:\Windows\SysWOW64\SQLServerManager13.msc

For 2012, the file is

C:\Windows\SysWOW64\SQLServerManager11.msc

In this, using the left side tree,  go to SQL Server Network Configuration -> Protocols for <Your DB> e.g. SQLEXPRESS

– Go to the TCP/IP. Change status to Enabled.

– Go to the 2nd tab, IP Addresses. Go to the LAST option – IPAII
– Put 0 in TCP Dynamic Ports
– Put 49170 in the TCP Port.

Save it issuing OK

Restart the SQL Server:
– Go to SQL Server Services, go the active instance, and right-click and select restart.

More in:

https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-configuration-manager?view=sql-server-2017

8 – Make sure the Firewall is OK for this Port. In Windows 10 or SQL Server 2012/2016:

If using Windows Firewall.

Access the Control Panel. Windows -> Right-click -> Run. Type control <enter>
In the search control panel dialog box, click “Firewall”
Click on Windows Firewall. Click on the advanced settings in the left side.
Click in Inbound rules in the left side. Right- click “New rule” in the right side. Next.
Choose the “Port” radiobox. Next
Specific local ports: TCP  and Type in the textbox 49170. Next
Choose the Allow connection radiobox. Next
Check the 3 checkboxes. Next
Give it the name 49170 SQL Server. Click Finish.

You should see something like this.

9 – Starting Spoon.bat: Use native driver with the user sa

Create an Icon for Spoon in the Desktop

Run it and create a new Configuration, following the steps 1 to 6

In this case:

LAPTOP-DEV01

crm_78_stage

SQLEXPRESS
sa
password

If questions in how update the firewall or the TCP/IP MS SQL Server, visit
https://www.youtube.com/watch?v=5UkHYNwUtCo

10 – Check the size of the job

For each database, we will have an alternative to checking the total number of lines in each table. We could check the size of each table as well.

In MS SQL Server:

In this case, a lightweight one

For MySQL:

SELECT table_name as Name, table_rows as Rows
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = “schema”;

11 – Organizing the job

Inside the Pentaho folder, or in the adequate disk/folder, create two folders for saving the work done:

e.g.

d:\pentaho\pdi-rawdata

d:\pentaho\pdi-transformation

Create also the kettle files folder

d:\pentaho\pdi-kettle

The last folder should be included in a Bitbucket repository, under DevOps projects

12 – Creating a new transformation

A transformation is needed to import the files from Databases.

File->New->Transformation

13 – A simple transformation process is composed of the following steps and named as convention:

13.1 – <module>_rawdata –

Module, in this case, is the SugarCRM or SalesForce module to where data is going to.

This transformation simply is there to get the data from the data source, and should also transform the field names from the original datasource to the sugarfield or SalesForce mapping

That means that the datasource could be a XLSX table, a CSV file or a SQL dataset, from any DB engine.

This is the typical appearence of this step, when connecting from database

13.2 – <module>

This step gets the data, prepare the JSON file, connect to the SugarCRM or Salesforce to get the credentials, create the JSON object for injecting it into the CRM via API request.

This is step starts from the CSV generated in the step <module>_rawdata

13.3 – Other transformations

Other transformations may be required, and it is very likely they are possible to be provided. In the next example, we just transformed the input database into an XLSX file

Its name should be <module>_<transformation>. In the following case, it would be articles_xlsx

14 – Preparing the raw_data step

14 – Now you should create the <module>_rawdata transformation. If data is coming from a database, connect the database already tested in the steps before.

Notice that we take the moment to also transform the name of the variables. This way, we may save at least one step.

For testing, try to limit the query via component.

15 – Press preview. You should see a sample of the rows you want to send.

16 – The step “Rename the variables” will not be used in this case.

17 – In the Output CSV, make sure you point the file to a valid and writable path. Only the cases Add filenames to results and Create parent folder should be clicked.

17.1 – The separator here will be the comma, enclosure “, Force the enclosure, disable enclosure fix and header shoube true. Format CR+LF terminated (Windows, DOS), Compression None, Encoding UTF-8

17.2 – IMPORTANT. In the tab Fields, click minimal width.

18 – Save and test the transformation

19 – Working with transformation

tbd

Some JSON sources and targets, when converted, may be changed to the JavaScript UTF-16. To handle that, keep in mind using in the Script component, when necessary, the following:

// Before sending (on client)
var encodedObject = unescape(encodeURIComponent(JSON.stringify(myObject)))
// After sending (on server)
var objectReceived = JSON.parse(decodeURIComponent(escape(messageArgs[0])))

SugarCRM – Generic transformation

The look and feel of a typical SugarCRM Transformation without using any custom endpoint is shown above. If you need to keep track of the “Assigned to / Assigné à”, you need to take a look at:

File: Accounts.ktr

There you will notice a second path where the system takes the assigned full name, do the necessary transformation through a custom endpoint, which can be found in

Bitbucket Repository: zitta.sugarondemand.com

File: rest_example.php

It should go under custom\clients\base\api\

The Kettle (Spoon) files expect parameters:
They are the step “Get Variables”. They are used by the jobs to tell the transformations to know if the data is for inserting new items (POST) or updating existing ones (PUT)

In red, you will find the objects you need to update:

1 – Each module will have a different CSV file. Therefore, is necessary to update its name;

2 – Get Variables is automatically set by the JOB. If you want to test, the default value is already set as POST

3 – The Script has the endpoint that directs to where you want to put data in.

4 – SugarCRM REST: it has the variables for URL OAUTH and BULK. You will want to update it according to the instance you are working with.

Working with jobs – Automating process

Working with jobs is not different from working with transformations.

A typical view of a sequenced job is like the following.

The start content should be like this when activating the task via CRON / Scheduler:

Passing parameters to the existing transformations should be made like the following when using the transformations mentioned above.

Scheduling JOBS

For setting up our jobs, though, we do not use the start function. We call them from the Linux / or Windows Scheduler.

It allows us to run the jobs in a given period

https://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation#KitchenUserDocumentation-Scheduling

A batch file (or sh when using Linux) should be created to trigger the command line. Here is an example. You need to run Kitchen from the its working folder.

run_quotes.bat:

cd c:\data-integration
c:\data-integration\kitchen.bat /file:c:\SUGAR\QuotesJob.kjb /rep:”LocalRepo” /level:Basic > c:\SUGAR\QuotesJob_log.log

A typical usage can be viewed under Windows Scheduler.

When creating a new task, try using Create Task instead of Create Basic Task.