Child care deduplication use case

 

Subscription onlyThis content is available for Talend Academy subscription users only. Open module - EN

 

This training module is an introduction to data deduplication implemented using Talend Studio.

In this exercise, you create multiple Jobs. You leverage matching components in Big Data batch Jobs running on Spark.

The first Job cleans up data and splits it depending on whether they are unique rows, exact duplicates, or suspect pairs. From a sample extracted from the suspect pairs and manually labeled as pairs or not, the second Job creates a matching model. The matching model is then used in the third Job to identify duplicates in the suspect pairs identified in the first Job. Finally, the last Job merges the group of duplicates based on survivorship rules.

 

Catalog

Big Data

Languages

EN

Format

Presentation, hands-on practices

Roles

Integration developer

Badge track

 

Learning plan

Talend Big Data Machine Learning

Hands-on tasks
  • Use the tMatchPairing component to identify suspect pairs of data 

  • Use the tMatchModel component to create the matching model based on labeled suspect pairs 

  • Use the tMatchPredict component to apply the matching model to a new set of suspect pairs 

  • Use the tRuleSurvivorship component to merge groups of suspect duplicates