Child care deduplication use case
Subscription onlyThis content is available for Talend Academy subscription users only.
This training module is an introduction to data deduplication implemented using Talend Studio.
In this exercise, you create multiple Jobs. You leverage matching components in Big Data batch Jobs running on Spark.
The first Job cleans up data and splits it depending on whether they are unique rows, exact duplicates, or suspect pairs. From a sample extracted from the suspect pairs and manually labeled as pairs or not, the second Job creates a matching model. The matching model is then used in the third Job to identify duplicates in the suspect pairs identified in the first Job. Finally, the last Job merges the group of duplicates based on survivorship rules.
Catalog |
|
Languages |
EN |
Format |
Presentation, hands-on practices |
Roles |
Integration developer |
Badge track |
|
Learning plan |
|
Hands-on tasks |
|