Child care deduplication use case

You are here:

Child care deduplication use case

Open module - EN

This training module is an introduction to data deduplication implemented using Talend Studio.

In this exercise, you create multiple Jobs. You leverage matching components in Big Data batch Jobs running on Spark.

The first Job cleans up data and splits it depending on whether they are unique rows, exact duplicates, or suspect pairs. From a sample extracted from the suspect pairs and manually labeled as pairs or not, the second Job creates a matching model. The matching model is then used in the third Job to identify duplicates in the suspect pairs identified in the first Job. Finally, the last Job merges the group of duplicates based on survivorship rules.

Catalog	Big Data
Languages	EN
Format	Presentation, hands-on practices
Roles	Integration developer
Badge track
Learning plan	Talend Big Data Machine Learning
Hands-on tasks	Use the tMatchPairing component to identify suspect pairs of data Use the tMatchModel component to create the matching model based on labeled suspect pairs Use the tMatchPredict component to apply the matching model to a new set of suspect pairs Use the tRuleSurvivorship component to merge groups of suspect duplicates