Anusha Ruwanpathirana

Introduction to sophisticated data integration using Talend (bonus - recommended best practices)

Talend is one of the world’s leading providers for open source data integration and application integration solutions along with vendor supportIn this article we explore the features Talend offers to data integration teams, why Talend is one of the more powerful data integrator tools and also – as a bonus – we’ve thrown in some recommended best practices that will help your team achieve data integration successfully.

talend-logo-white
Introduction to Talend

Talend  is a cool product to explore. It is great to have a DIY (do it yourself) tool that does not require extensive coding skills and also delivers a rich user experience. Talend is one of the world’s leading providers for open source data integration and application integration solutions along with vendor support. The software integration vendor was recognised by Gartner as a leader for both Data Integration Tools and Data Quality Tools in 2017.

Talend is a next-generation leader in the field of cloud and big data integration software. Use of such technology helps companies transform to data driven growth models by having better access to data, improved data quality and capable of swiftly mobilising where required for real-time decision making. Talend’s native and unified integration platform – Data Fabric – enables users to embrace new innovations and scale to meet evolving data demands of the business.

Talend also applies major ongoing research and development efforts to the maintenance and improvement of its open source products, and provides professional user documentations and training materials.

 

Why Talend

There are many reasons to select Talend as a solution for integration requirements in the industry. The main reasons are identified as below:

  • 7x faster ETL tool –

According to Talend CEO – Mike Tuchen – “Talend’s solutions, which can be up to seven times faster on big data platforms and have lower total cost of ownership than traditional data integration approaches, have already attracted over 1,300 customers worldwide.” To support the above phrase, there is a performance comparison of total execution times of Information workflows and Talend jobs to complete all loading, transformations, and file generations constructed using the Transaction Processing Performance Council TPC Benchmark H (TPC-H) Revision 2.17.1 Standard Specification 2

Furthermore, to measure and validate  performance, Talend puts every product release through a rigorous set of performance and scalability tests, including a performance benchmark developed by the Transaction Processing Performance Council, known as TPC-H.  Out of the 22 standard TPC-H tests, Talend ran up to 67 percent faster with an average improvement of 45 percent across generated MapReduce code. 3

  • Future proof –

Talend uses a ‘Future-proof Architecture’, which is capable of generating inline java snippets for Talend components with the Talend studio. Once it is built, it can be reused everywhere, whether cloud or any other big data platform.

  • Cost effective –  

Talend’s open source solutions deliver substantial cost savings compared to either labor-intensive custom development or proprietary software. The savings associated with the no-charge Talend Open Studio for Data Integration are obvious, but even with subscription-based Talend Data Integration, costs are markedly lower than the proprietary technologies. This was evident in the recent data driven enterprise use cases Mitra Innovation implemented for a large scale organisation.

  • Unified Platform –

All of Talend is integrated into one enterprise platform with a wide variation of products. Based on requirements, users can select a suitable set of products with rich features and functionalities that are common in commercial editions.

(Figure 1 – Talend products on offer)

  • Respond faster to business needs –

Talend demands less of a learning curve than most other integration software products. Talend also enables faster development times and customisation is easy as components can be created using Java.

 

Talend features
  • Native code support – Talend’s visual development environment generates high performance and highly scalable Hadoop-native code. The user friendly Talend Studio GUI with drag and drop components designs jobs in the front end, and a native code generated at the back-end compiles the jobs at runtime. As a result, maximum performance and functionality is achievable – thanks to native codes.
  • Powerful, easy-to-use features – Talend Open Studio for Data Integration, which can be downloaded and used at no cost, provides all the functionality you need to design and execute a wide range of data integration processes such as data migration (including both ETL and ELT) and data synchronisation. Talend features an Eclipse-based graphical development environment alongside more than 900 component and built-in connectors, a unified metadata repository, automated java code generation and robust ETL functionalities. As a result – subscription based Talend Data Integration supplements Talend Open Studio for data integration with functionalities that are specifically designed for enterprise scale projects such as team collaboration tools, industrial scale deployments and real time load balancing.
  • Better Collaboration – Better collaboration with any big data component effectively; connector and databases. Talend supports all custom features of such components (e.g: it supports most of the unique features of Oracle when Talend database components are used.
  • Faster design – Thanks to a redefinable and a highly optimised design, Talend quickly delivers an overall ETL design to support industry requirements that are necessary for the minimisation of time and cost factors.
  • Efficient management – Talend offers profiling capabilities for your data before the  ETL stage with early cleansers and use of filters. Monitoring logs and custom workflows are quick and easy ways of understanding and investigating where jobs fail. This makes the  management of jobs or designs less cumbersome.
  • Scalability – Talend allows to easily scale jobs along with the ability to scale servers by adding new servers.
  • Real time statistics – ETL real time statistics displayed in the job with relation to how much data is executed through related components. E.g Job enterprise versions, job failure, job success real.
  • Active community – The community around Talend’s data integration and application integration solutions is highly active. Several community applications are available for sharing questions, advice, and code.
  • Cost effective – Talend is more cost-effective than either ad hoc custom scripts or proprietary software. Talend also allows for rapid adoption and ramp-up and is updated and improved more frequently than other proprietary products. The product suite is also surrounded by active user communities that are a rich source of practical advice and application extensions.
  • Vendor backed facilities – For the purpose of complex, mission critical applications such as data integration, open source technology components that are backed by an established commercial vendor is a particularly promising solution model. With vendor backed open source middleware, the usual open source benefits along with the benefits of a dedicated research and development unit, professional documentation, training materials, technical support, consulting and value added product features are on offer.

 

A Talend Job and its components
  • A Job consists of one or more connected components that generates Java code. A component is the executable part of a Job. Talend provides hundreds of ready-to-use components to accomplish specific tasks. Components are connected with rows (links that carry data) or triggers (links that transfer control).
  • Connectors – types of connections that enable communications between the components
  • Rows – Row connections deal with actual data flow. Examples are: Main, Lookup, Filter, Rejects, ErrorRejects, Output, Uniques/Duplicates Multiple input/output
  • Iterators – Iterator connections are used to perform a look on files contained within a directory, or on rows contained in a file, or on database entries.
  • Triggers – Trigger connections are used to create dependencies between Jobs or Subjobs –  which can be triggered one after the other – according to the trigger’s nature. There are two main types of Triggers; Subjob Triggers and Component Triggers. They are as follows:
Subjob trigger vs Component trigger

(Table 1: Subjob trigger versus Component trigger)

 

  • Link – Link connections can only be used with ELT components
  • Context Variables – Context variables are user-defined parameters in use by Talend. These are passed into a job at runtime, and these variables can change their values as the job progresses from development to test and production environments.

 

Optimising a Talend Job and Best Practices
  1. Activate batch mode for output components and activate fetch size for input components.
  2. Enable parallelisation – Options are offered by the Studio to increase Job performance through parallel processing. This includes enabling multithreading Job-wide, and uses the Parallelize component to orchestrate subJobs as well as components.
  3. Chunk data together prior to processing large volumes of data – Performance is usually influenced by a clever job design. Therefore, don’t try to process all available data at once when large volumes of data are concerned (Ex: XML file or Delimited file with Gigabytes of data). It is recommended to slice them into handy chunks and process them in iterations. This will improve job performance. It is better to implement a suitable logic to split if there are no suitable components found in Talend.
  4. Remove unnecessary Columns and Rows using tFilterColumns and tFilterRows before mapping for large volumes of data processing in order to avoid keeping garbage data in the job flow.
  5. Avoid complex SQL joins and statements for SQL statements – Try to join data in a tMap within the job.
  6. Assign more memory to specific Jobs where optimisation is need.
  7. Minimise the custom code and use Talend in-built components – The XML and JSON components are powerful in Talend. Widely used components are tFileInputDelimited, tFileOutputDelimited, tMap.
  8. Use indexing Key columns for databases where database read operations are high.
  9. Design reusable components using joblets or reference projects (which are the common predefined set of functionalities of the job) to speedup development times.
  10. Design suitable subjobs of the job by splitting functionalities in a proper manner. This will improve readability as well supports implementation of proper error handling of jobs.
Anusha

Anusha Ruwanpathirana

Associate Architect | Mitra Innovation