KNOWLEDGE BASE

Removing Duplicate Data in Tableau Prep


Published: 11 May 2018
Last Modified Date: 31 Oct 2018

Question

How to remove duplicate data in Tableau Prep. 

Duplicate data can occur in data for different reasons, and the best way to resolve the duplicates will vary. Below are the scenarios covered in this article:
  • When joining two tables that have a 1:many or a many:many relationship, then the measures values of the output are duplicated.
  • When unioning two or more tables, some records may be repeated between tables and thus are duplicated in the output.

Environment

Tableau Prep

Answer

CLICK TO EXPAND STEPS
Scenario 1: Duplicated Measures from a Join
In the Tableau Prep workbook "duplicates from join.tfl", the "January Inventory" table lists each product once with its total inventory. The "January Orders" table may have many rows for a product if multiple orders were for the same product. After joining the "Sales" and "Inventory" table, the measure [Inventory] has been duplicated. The first flow demonstrates the issue.

Option 1: Use a Union instead of a Join

Note: None of the values of [Inventory] will be directly related with any of the data from "January Orders". This may cause issues when creating calculations without aggregations, or in views that show a higher level of granularity then the field(s) from the join clause, which would be [Product] in this example

Option 2: Create a Calculation to Evenly Distribute Measure Values

We can use an aggregate step to find the number of orders per product and then divide inventory by the number of orders. Then final result will be that an equal portion of the inventory will be represented for each order. For example, there are 2 orders for hats. [Inventory] will be 100 for each hat order. [New Inventory] will be 50 for each order.

Please note: this may cause inaccurate inventory information in a view if only some orders for a product are filtered out.
  1. Create an INNER join "January Inventory" and "January Orders" on [Product] = [Product]
  2. Add Aggregate from Join 2
  3. In the Aggregate 1 step, do the following:
    1. Drag either [Product] to Grouped Fields
    2. Drag [Order ID] to Aggregated Fields
    3. Change the aggregation for Order ID to Count Distinct
    4. Rename Order ID to something like Orders per Product
      • Instead of [Order ID], use the field from the other table that has a unique value for every row.
      • If there is not one unique identifier, then before creating the aggregate step, add a step to create a calculated field that concatenates the dimensions required to have a unique value for every row.
      • For example, STR([Order ID])+STR([Date])
  4. Drag Join 2 over Aggregate 1 and drop Join 2 onto New Join
  5. In the Join 3 step, create a LEFT join on [Product] = [Product]
  6. Add a Step from Join 3
  7. In the Clean 1 step, click the Create Calculated Field… button
  8. In the Calculated Field dialog box that opens, do the following, and then click OK:
    1. Name the calculated field. In this example, the calculated field is named "New Inventory"
    2. In the formula field, create a calculation similar to the following:
      [Inventory]/[Orders per Product]

Option 3: Aggregate data so that the tables have a 1:1 relationship

  1. Add an Aggregate step from January Orders 4
  2. In the Aggregate 2 step, do the following:
    1. Drag [Product] to the Grouped Fields pane
    2. Drag [Sales], [Quantity], [Order ID] and [Customer] to the Aggregated Fields pane
    3. Click CNT on the [Customer] field select Count Distinct
    4. Click SUM on the [Order ID] field select Count Distinct
  3. Drag Aggregate 2 over January Inventory 4 to create an INNER join on Product = Product
NOTE: The [Product] Field was dragged to the Grouped Fields pane because that is the only field used to join on. The final table will only have one row per product, which means details on each order will be lost.



Option 4: Use a FIXED expression in Tableau Desktop

For detailed directions, see Removing Duplicate Data with LOD Calculations.

The advantage of using a FIXED expression in Tableau Desktop is that level of detail (LOD) expressions have special aggregation rules. Therefore, a de-duplicated inventory field using FIXED could show the full, unduplicated inventory for a product whether there was only one order from that product in the view, or multiple.
 
CLICK TO EXPAND STEPS
Scenario 2: Duplicated Data after a Union of Data Sources with the Same Data
In the Tableau Prep workbook "duplicates from union.tfl", the "January Orders" is the first report of orders in January. The "January Orders (corrections)" table is a report released later with some new orders, but also corrections on orders that already exist in "January Orders". When the two tables are unioned, all rows from both tables are kept, which creates duplicated measures.

The first flow in "duplicates from union.tfl" demonstrates this issue.

Option 1: Keep all Data from one Table and Non-Matching Data from the Other Table

For this example, we want all of the data from "January Orders (corrections)" and only the orders from "January Orders" that do not exist in "January Orders (corrections)".
  1. Create a new join between "January Orders" and "January Orders (corrections)"
  2. In the Join 1 step, do for the following:
    1. For Applied Join Clauses, create a join clause on [Order ID] = [Order ID]
    2. For Join Type, click the sections of the Venn diagram so that only the left most segment is filled. Tableau Prep will call this a "Left unmatched only" join.
  3. Drag the "January Orders (corrections)" over Join 1 and drop the table on New Union

Option 2: Keep Records with the Most Recent Date

For example, Ms. Mouse changed her order for to add another coat, but this change was accidentally recorded in "January Orders". The correct order is the one with the most recent date.
  1. Create a FULL OUTER join between "January Orders" and "January Orders (corrections)" on [Order ID] = [Order ID]
  2. Add a Step from Join 2
  3. Merge all matching fields, except the date fields and the measure values:
    1. Click on the field whose values should supersede the other in the Profile Pane
    2. Ctrl+click on the matching field
    3. Right-click either field and select Merge Fields
      • When fields are merged, Tableau Prep does not keep both values (if there are two different values), but rather keeps the value from the field selected first if there is one.
      • For example, Ms. Horse's name was accidently recorded as Ms. House in the "January Orders" table and her correct name appears in "January Orders (corrections). When merging [Customer] and [Customer-1] I want to select [Customer-1] first.
      • You can verify which field will supersede the other from the name of the merged field. The name be formatted like [first field name & second field name]. The field name listed first will be the field that will supersede the other.
      • If the fields were merged in the wrong order, the merge can be undone by right-clicking the merge icon above the name of the merged field in the Profile Pane and selecting Remove.
    4. Click the Create Calculated Field… button
    5. In the Calculated Field dialog box that opens, do the following, and then click OK:
      1. Name the calculated field. In this example, the calculated field is named "Most Recent Date"
      2. In the formula field, create a calculation similar to the following:
        IF [Date] > [Date-1]
        THEN [Date]
        ELSE IFNULL( [Date-1], [Date] )
        END
      3. The above calculation will return [Date] if it is "larger", or in the case of dates more recent, than [Date-1]. Otherwise the calculation will return [Date-1].
      4. If either field is NULL, then the first condition will always be false. The IFNULL() function replaces [Date-1] with [Date] when [Date-1] is NULL. This means the calculation will always return a non-NULL value if there is one.
      5. Remove [Date] and [Date-1]
    6. Create a calculated field with a name like "New Cost" with a calculation similar to the following:
      IF [Date] > [Date-1]
      THEN [Cost]
      ELSE IFNULL( [Cost-1], [Cost] )
      END
      
    7. Remove [Cost] and [Cost-1]
    8. Repeat steps 6-7 for every measure

Option 3: Every Duplicate Should be Treated Differently

  1. Create a FULL OUTER join between "January Orders" and "January Orders (corrections)" on [Order ID] = [Order ID]
  2. Add a Step from Join 3
  3. Merge all fields that either a) have the same values in both tables, or b) the values from one table should always supersede
  4. Click the Create Calculated Field… button
  5. In the Calculated Field dialog box that opens, do the following, and then click OK:
    1. Name the calculated field. In this example, the calculated field is named "New Cost"
    2. In the formula field, create a calculation similar to the following:
      CASE [Order ID-1 & Order ID]
      WHEN 5 THEN [Date]
      WHEN 9 THEN [Date]
      ELSE IFNULL([Date-1],[Date])
      END
      
    • The above calculation will read through every value of [Order ID-1 & Order ID] and return the specified date field. To make things simpler, I only specified which IDs should return [Date] and then everything else will return [Date-1], or [Date] if [Date-1] is NULL, due to the ELSE statement.
    • The [Order ID-1 & Order ID] value is used in the CASE statement because this field will have a unique value for every row of data.
    • This also means that any new IDs added will return [Date-1] when this flow is re-run in the future.
  6. Remove [Date] and [Date-1]
  7. Repeat steps 4-6 for all fields where duplicates need to be handled on an individual basis
Did this article resolve the issue?