In the Tableau Prep workbook "duplicates from union.tfl", the "January Orders" is the first report of orders in January. The "January Orders (corrections)" table is a report released later with some new orders, but also corrections on orders that already exist in "January Orders". When the two tables are unioned, all rows from both tables are kept, which creates duplicated measures.
The first flow in "duplicates from union.tfl" demonstrates this issue.
Option 1: Keep all Data from one Table and Non-Matching Data from the Other Table
For this example, we want all of the data from "January Orders (corrections)" and only the orders from "January Orders" that do not exist in "January Orders (corrections)".
- Create a new join between "January Orders" and "January Orders (corrections)"
- In the Join 1 step, do for the following:
- For Applied Join Clauses, create a join clause on [Order ID] = [Order ID]
- For Join Type, click the sections of the Venn diagram so that only the left most segment is filled. Tableau Prep will call this a "Left unmatched only" join.
- Drag the "January Orders (corrections)" over Join 1 and drop the table on New Union
Option 2: Keep Records with the Most Recent Date
For example, Ms. Mouse changed her order for to add another coat, but this change was accidentally recorded in "January Orders". The correct order is the one with the most recent date.
- Create a FULL OUTER join between "January Orders" and "January Orders (corrections)" on [Order ID] = [Order ID]
- Add a Step from Join 2
- Merge all matching fields, except the date fields and the measure values:
- Click on the field whose values should supersede the other in the Profile Pane
- Ctrl+click on the matching field
- Right-click either field and select Merge Fields
- When fields are merged, Tableau Prep does not keep both values (if there are two different values), but rather keeps the value from the field selected first if there is one.
- For example, Ms. Horse's name was accidently recorded as Ms. House in the "January Orders" table and her correct name appears in "January Orders (corrections). When merging [Customer] and [Customer-1] I want to select [Customer-1] first.
- You can verify which field will supersede the other from the name of the merged field. The name be formatted like [first field name & second field name]. The field name listed first will be the field that will supersede the other.
- If the fields were merged in the wrong order, the merge can be undone by right-clicking the merge icon above the name of the merged field in the Profile Pane and selecting Remove.
- Click the Create Calculated Field… button
- In the Calculated Field dialog box that opens, do the following, and then click OK:
- Name the calculated field. In this example, the calculated field is named "Most Recent Date"
- In the formula field, create a calculation similar to the following:
IF [Date] > [Date-1]
THEN [Date]
ELSE IFNULL( [Date-1], [Date] )
END
- The above calculation will return [Date] if it is "larger", or in the case of dates more recent, than [Date-1]. Otherwise the calculation will return [Date-1].
- If either field is NULL, then the first condition will always be false. The IFNULL() function replaces [Date-1] with [Date] when [Date-1] is NULL. This means the calculation will always return a non-NULL value if there is one.
- Remove [Date] and [Date-1]
- Create a calculated field with a name like "New Cost" with a calculation similar to the following:
IF [Date] > [Date-1]
THEN [Cost]
ELSE IFNULL( [Cost-1], [Cost] )
END
- Remove [Cost] and [Cost-1]
- Repeat steps 6-7 for every measure
Option 3: Every Duplicate Should be Treated Differently
- Create a FULL OUTER join between "January Orders" and "January Orders (corrections)" on [Order ID] = [Order ID]
- Add a Step from Join 3
- Merge all fields that either a) have the same values in both tables, or b) the values from one table should always supersede
- Click the Create Calculated Field… button
- In the Calculated Field dialog box that opens, do the following, and then click OK:
- Name the calculated field. In this example, the calculated field is named "New Cost"
- In the formula field, create a calculation similar to the following:
CASE [Order ID-1 & Order ID]
WHEN 5 THEN [Date]
WHEN 9 THEN [Date]
ELSE IFNULL([Date-1],[Date])
END
- The above calculation will read through every value of [Order ID-1 & Order ID] and return the specified date field. To make things simpler, I only specified which IDs should return [Date] and then everything else will return [Date-1], or [Date] if [Date-1] is NULL, due to the ELSE statement.
- The [Order ID-1 & Order ID] value is used in the CASE statement because this field will have a unique value for every row of data.
- This also means that any new IDs added will return [Date-1] when this flow is re-run in the future.
- Remove [Date] and [Date-1]
- Repeat steps 4-6 for all fields where duplicates need to be handled on an individual basis