Merging Datasets

Merging Datasets

Previous Session Summary

In the previous lesson, we covered the unpivoting function which converts multiple columns of values into two columns of multiple rows, the same amount of rows as the original columns. One of these columns consists of the header titles of the original multiple columns while the second is of the original values. Pivoting of course does the reverse, taking a column of multiple values and a column of values and then converts it in multiple columns, a column for each of the unique values in the original column.

Lesson Synopsis

In this lesson we start looking at the extremely useful merge function. This is quite an extensive topic and we take a bit of time over this part of the course as it acts in place of a very useful Excel function – the VLOOKUP.
 
So what does merging do?
 
Merging takes two datasets and, using a column of data that is present in both datasets (a Key) joins the two datasets together. In doing so it attributes the correct rows for one dataset to the correct rows of the other dataset based on the values in the key column.
 
This is rather like the VLOOKUP, INDEXMATCH or XLOOKUP which are all available in Excel, but there are some significant advantages to using the Merge function, and one significant difference. Read on…

Resource Listing

The Video

Play Video

Lesson Notes

As noted above, merging is quite a long topic so buckle up…

What is merging?

Merging is the concept of bringing together two datasets to create a single dataset which has the information of both datasets in it. To bring the two datasets together and make sure that the right data is matched onto the correct row, we need to identify a ‘key’ column in each dataset. This Key column is a piece of information that is present in each dataset, perhaps a unique product identifier like a SKU number, or a store identifier. We nominate this ‘key’ column in each dataset. Then when we run the merge function,PowerQuery then does the following:

a)    goes to the key column in one dataset and finds the value in each row.

b)   Goes to the key column in the second dataset and looks for the value it found in the first dataset.

c)   When it finds that value in the second dataset, it grabs ALL the information in that same row and puts it into the first
dataset

In standard Excel we do this kind of thing quite a lot and we use lookup functions such as VLOOKUPS, INDEXMATCHES or XLOOKUPS. The concept is very similar in PowerQuery; you find a piece of data that is present in both
datasets and then use that as a search criteria in both datasets to match up data from each line.

However there are some key advantages to using the Merge function and one big difference. Read on for more details!

The advantages to using the Merge

Aside from the obvious advantage: that you only need to perform the Merge function once so that PowerQuery knows what to do, and remembers it for ever after, there are several other handy advantages;

a)   If you use a VLOOKUP or INDEXMATCH, you have to bring all the columns in your second dataset (your ‘Lookup’ dataset) across one by one. With the Merge function, you have the option to bring all columns in your second dataset across at once.

b)  If you use a VLOOKUP (although not an INDEXMATCH or XLOOKUP) then the column you’re wanting to return has to be to the right of the column you’re searching in (your Key column). This means that sometimes you need to shift your datasets about to get the columns in the right place. With the Merge function in PowerQuery, your Key column can be anywhere in your dataset.

c)  There are many ‘Join Types’ which dictate how the information is brought together (we’ll go into these later) . This means that you’re not prohibited to the ‘bring information from lookup dataset into original dataset’ model. Other join types allow you to:
     take information from your original dataset and put it into your lookup dataset,
–     
Identify only information which is not in both datasets
–      Identify only information that is in a specified dataset but not in the other dataset

The big difference to using the Merge

There’s also a VERY big difference. Let me explain.

When you use a VLOOKUP or an INDEXMATCH, Excel will only return from the Lookup dataset the piece of information from the first row it finds with a matching value in the Key column. It ignores any other rows with that value. Say the value the VLOOKUP was looking for was the number 1 and number 1 was in the Key column in the Lookup dataset 4 times. The VLOOKUP would find the first row with number 1 and then the necessary piece of information in that row. The second, third and fourth rows where number 1 was present in the Key column would be ignored.

The Merge function in PowerQuery doesn’t work like that. The Merge function finds EVERY row where there is a matching value in the key column in the lookup dataset and returns all of them. In order to make them all fit, PowerQuery creates duplicate rows in the original dataset. In our above VLOOKUP example, if the Merge function were looking in the lookup dataset for the value 1 in the Key column and the number 1 was present 4 times, then the Merge function would return all four rows with the number 1 to the original dataset. To fit in all four rows, the Merge function creates 3 duplicate rows
in the original dataset.

In some scenarios, this is a HUGE advantage and I have used it extensively to split a single row of cost between multiple cost centres. However if you’re not aware of it then it can cause problems.  Let’s say for example you have a Trial Breadown in your original dataset which includes values that you later want to SUM toreconcile back to your TB. By creating duplicate lines, you’ll create duplicate values and your report will no longer tie back to your TB.


How do we use the Merge?

Great so the Merge function is useful. How do we use it?

Well let us assume that we have two tables that we wish to match together.

  1. Identify your Key column in both datasets – the piece of information present in both datasets

Assuming you’re simply wanting to do a simple VLOOKUP job (grab information from lookup dataset and put it into the
original dataset), then the Key column in the Lookup dataset shouldn’t have duplicate values or you’ll create duplicate lines and won’t know which is the correct row. If you take our example in the video, there are lots of columns you could use that are present in both the Arrivals Data and the Rates Card datasets. These include the Carrier, Port Code and Container Type columns. However none of these columns holds unique values. For example, the Container Type column only has 4 values but hundreds of rows, so your lookup wouldn’t be very effective.

You can instead create a unique Key column creating a new column that concatenates together values from a number of different columns. We demonstrate this in the video.

There’s also a second way of matching together two datasets based on multiple columns and we go through that in section 3.

2. Bring up the Merge function:

Go to the ‘Home’ tab and then to the ‘Combine’ section of the ribbon. Select the Merge Queries drop down and choose either:

a)   ‘Merge Queries’ which gives you the option to merge a dataset into the currently selected query in the Queries Pane.

b)   ‘Merge Queries as New’ which creates a brand new query that joins two datasets together, but leaves those original queries
untouched

3. Select the queries and columns you want to use to merge the two datasets

PowerQuery brings up a Merge menu and then asks us to select the tables that we want to merge and the columns that we want to use as a ‘key’. Assuming again that we’re performing a simple ‘Lookup’ operation, grabbing the
information from a Lookup dataset and plonking it in an Original dataset then we proceed as follows:

a)   In the first dropdown, select the query that will act as the original dataset or primary dataset. This is the dataset that we would like to keep all the original information.

b)  Highlight the column that will act as a key. If there are multiple columns that, joined together, act as a key, then select all
of them holding down ‘Ctrl’ while selecting them. You’ll observe a little number in each of the selected columns’ headers indicating the order in which they were selected.

c)   In the second dropdown, select the query that will act as the Lookup dataset or Secondary dataset. This is the dataset from
which we would like to draw information and don’t need all of the original information.

d)   Highlight the column that will act as a key. If there are multiple columns that, joined together, act as a key, then select all
of them holding down ‘Ctrl’ while selecting them. You’ll observe a little number in each of the selected columns’ headers indicating the order in which they were selected. Ensure that the order is the same as in the Primary dataset.

4. Select your ‘Join Kind’

Your Join Kind is the manner in which you would like to join your datasets together. Here’s a description of the available options:

a)   Left Outer – the classic ‘Lookup’ operation, grabbing the information from a Lookup dataset and plonking it in an Original
dataset based on the information in the key column. This operation will keep all the rows in the original dataset and only the matching rows in the Lookup dataset.

b)  Right Outer – does the reverse of Left Outer and grabs matching information from an Original dataset and puts it in the Lookup dataset. This operation will keep all the rows in the Lookup dataset and only the matching rows in the Original dataset.

c)   Full Outer – Keeps all data from both datasets, matching them where appropriate. All rows from both datasets are maintained

d)  Inner – only retains rows from both datasets that match are retained

e)   Left Anti – Retains rows that are only present in the Primary or Original Dataset

f)  Right Anti – Retains rows that are only present
in the Secondary or Lookup dataset

5. Select OK

Self explanatory!

6. Select the columns for inclusion in the merge

Once you’ve selected Ok then PowerQuery will set about using the key column to match rows together. However, you’ll still need to select the columns that you want brought across. To do this:

a)  Scroll to the right of your Merged Query (this will be called ‘Merge1’ in your Query Pane)

b) Here there will be a new column right at the end of the dataset that has the word ‘Table’ in each row.

c)  Click on the button in the top right hand corner of the Header which has two arrows pointing in different directions.

d)  This will result in a list of columns in the secondary dataset that you can bring in. Simply select the columns that you’d
like to bring in.

Now you can simply delight in your ability to perform the Merge function and consign VLOOKUPS and INDEXMATCHES to the history books!

.... and a little more!

That was a long lesson – have a rest before moving onto the next lesson!