Superpower (aka “Refreshing Data”)

Superpower (AKA "Refreshing Data")

Previous Session Summary

Last lesson we covered how to create basic new columns in our dataset as well as converting units of time into different durations using the duration function. We also came to the end of the first stage of our container dataset transformation.

Lesson Synopsis

In this lesson we look for the first time at refreshing our query to bring new or changed information into our dataset. I love this lesson and frankly, knowing this technique in PowerQuery is worth it’s weight in gold. Knowing how to refresh files will make daily, weekly and monthly repetitive tasks so much faster. Prepare for your value in your Finance Team to rocket….

Now for the last 6 lessons we have been working on the transforming the same dataset. You may remember that back in lesson 8, we connected PowerQuery to a folder. In that folder were datasets showing container arrivals data, one for January and one for February. But what happens when we get the same container reports in for March arrivals, April arrivals and so on? Do we need to rewrite the query? Do we need to copy the query, point it towards the new datasets and then append the resulting query to our original query?

Definitely not. We click refresh and make a brew.

Resource Listing

The Video

Play Video

Lesson Notes

In the last lesson we got to the end of the first stage of our transformation of our dataset. However the dataset is only comprised of two months’ worth of container arrivals data; January and February. The other container arrivals data for the remainder of our year hasn’t yet arrived.

Now when the remainder of our dataset pitches up, what do we do with our query? Do we have to build it again with the new dataset? Do we have to copy the query then point it at the new files, then append the copied query to the original query? Thankfully no, the answer is much simpler.

How can I change the data in a queried dataset; working with folders of data.

Back in Lesson 8, we originally created our queried dataset through connecting PowerQuery to a folder of data. The January and February container arrivals data was saved to that folder. When we get a new month’s worth of data, all we need to do, is save that new file of information to the same folder. The next time we click  the ‘Refresh’ button in standard Excel, the new information will be brought into our queried dataset!

So here are the steps:

  1. You receive your new files. Save the new files to the same folder that the original query was pointed at.
  2. Make sure none of the files in the connected folder are open.
  3. Go back into standard Excel and into the ‘Data’ tab.
  4. Directly underneath the ‘Data’ tab header is the ‘Queries and Connections’ ribbon section. Select the ‘Refresh All’ button.

The Refresh command tells PowerQuery, to go back to the Source of the data that is queried (so in this case a specific folder), and then perform all the steps in that Applied Settings pane for the files in that folder. PowerQuery doesn’t distinguish between files that were there at the point of writing the query and those that are there at the point of refreshing, it just performs the steps for any files that are there.

In another simpler way of putting it, PowerQuery will automatically do all the cleaning and transformation that we set for the January and February months for any file that is in the originally queried folder.

How brilliant is that?!

How can I change the data in a queried dataset: working with single files of data.

We don’t look at this in the video, but sometimes we’re not interested in querying a folder of files and we have only connected PowerQuery to a single Excel file. This happens frequently when we want to understand data at a specific point in time rather than over a period of time or when all the information we need is in a single file rather than in many.

If the information in that individual file then needs to change for any reason, perhaps it’s been updated or there was an error in the original dataset, then how can we update the information in our query?

Simple! We just replace the original file with our new file. Keep the file location the same and the name of the file exactly the same and then once saved, go back into the file with the query and then click Refresh All in the Data tab.

PowerQuery goes back to the original source of the data (being the file that we just replaced) and performs the query anew, pulling in the new information!

The concept of the hierarchy of objects

PowerQuery (and standard Excel in fact) utilises the concept of ‘objects’. Objects are specific items that have a defined place on a drive and therefore can be located by PowerQuery. For the purpose of this course, here is the hierarchy of objects that PowerQuery can locate and from which it can extract:

  • A Folder
  • A File in a Folder
  • A Worksheet in a File in a Folder
  • A Table on a Worksheet on a File in a Folder

At this point in the course we’ve used PowerQuery to connect to all of these different objects. If you go into the ‘Source’ step in the Applied Settings pane in each of our queries so far then you’ll see the address of the object that PowerQuery has connected to. This means then that every time a query is refreshed, PowerQuery goes back to that object to extract the data from it. If that object no longer exists, or is called something slightly different, then PowerQuery won’t be able to find the dataset. The result? The refresh will fail.

This is another point to reiterate that PowerQuery is case sensitive.

Let’s think about the following code extract from PowerQuery.

Source = Excel.CurrentWorkbook(){[Name=”Table1″]}[Content]

This piece of code (gripping as it is) is telling PowerQuery to go into the same workbook as the current query, then find the object called “Table1”. Once it’s found it, then extract the content.

The object in this case is called Table1. If we changed the code to be “table1” or “Table2”, and then tried to refresh the query, then the code would fail as “Table1” would no longer be found.

Eagle eyed readers with the memories of an elephant may also recall that in the second lesson, we connected PowerQuery to a range of cells. A range of cells is a collection of data lumped together that are not formatted as a table i.e. it hasn’t been named. PowerQuery can’t connect to a range of cells, so instead it formats the range of cells automatically as a table and names the table for us. PowerQuery can then find the information.

 

.... and a little more!

Where can refreshing go wrong?

There’s several ways that refreshing can go wrong and cause an error, but these are the main ones:

  • Different naming of objects

Regardless of whether you’ve connected to a folder, to a file or to a table, PowerQuery notes down the exact address of the object that you’ve connected to i.e. where you can find the dataset on your system. The exact address is then baked into the code. You can find this exact address in the Source Step in the Applied Steps pane (just click on it and the address will come up in the formula bar).

Here’s some examples of what you’re likely to see in the Source step:

Connecting to a folder:

When you’ve connected to a folder, then you’ll see something like the following which is a Folder.Files function followed by the address of the folder that we’re connecting to. The folder address is in red below.

= Folder.Files(“C:\Users\gunnm\OneDrive\GunnMccrae\Excel lessons\Excel – PowerQuery\Lesson 8\Arrivals data“)

Connecting to a file:

When you’ve connected to a file, then you’ll see something like the following which is a File.Contents wrapped in an Excel.Workbook function. All of that is then followed by the address of the file that we’re connecting to. The file address is in red below.

Source  = Excel.Workbook(File.Contents(“C:\Users\gunnm\OneDrive\GunnMccrae\Excel lessons\Excel – PowerQuery\Lesson 7\February POs.xlsx“), null, true)

You probably see where I’m going with this. If you change the name of the folder or the file that the query is connected to before refreshing, then the query will break. The code will not then work again until you change the address of the folder or file to be exactly the same as the address in the code (or you change the address in PowerQuery). To be honest, you don’t see this error very frequently with folder connections as generally users only add in new files into folders. As PowerQuery connects to the folder not the files, the error doesn’t arise. However the error frequently happens when users save over the top of a single connected file and don’t keep the name the same.

  • The ‘First File’ issue

This issue only arises with folder connections and while we’ve discussed it already in Lesson 8, it is worth reiterating.

When you connect to a folder, you have the option to either use the first file in the folder as a sample file or select a particular file. Back in Lesson 8 we recommended that, as a beginner, you use a named file. The reason why is as follows.

Once you’ve extracted all the data from the files in the folder, the next thing that you’re likely to do is promote the headers; putting the first row of information into the title row. Normally this is fine but remember that when you connect PowerQuery via the folder connection, PowerQuery creates a column called Source.Name. This column shows the name of the file that each row in the dataset has come from. When you Promote Headers, the column title “Source.Name” will then be replaced by the name of a file in the folder. The file name will either be:

  1. the first file alphabetically in Windows Explorer if we opted to use the first file as a sample file
  2. the selected file we opted to use as a sample file.

The problem is then that there are certain transformations that name columns specifically. Some examples are “Remove Columns”, “Changed Type”, “Rename Columns” and so on. When we opt to use the first file alphabetically, chances are that the file that was first alphabetically when we wrote the query, won’t be first at a later date.

Let’s give an example here. In our video, when we wrote the query that connected PowerQuery to a folder, the first Excel file in our folder was “February Arrivals Data” with a January file following. If we were to opt to use the first file as a sample file, then this would be the file that PowerQuery would use.

When we combine the data from the files in the folder, the information from the February Arrivals Data file would appear first with January following. Therefore, when we promote our headers in the column Source.Name, the name “February Arrivals Data” will replace the “Source.Name” header. With me so far?

That’s not great though is it? We don’t want a column called February Arrivals Data. So we might opt to rename the column, perhaps to “File Name”, which would result in the following code:

Rename_Columns= Table.RenameColumns(#”Promoted Headers”, {“February Arrivals Data”, “File Name”})

I won’t get too deep into the code, but you can see that at the end of the code, there is the column name “February Arrivals Data” and the name that we’re changing it to; “File Name”.

This is our problem. In this lesson we brought in some new files into our folder and clicked refresh. One of those files was called “April Arrivals Data” which alphabetically comes before “February Arrivals Data”. If we opt to use the first file as our sample file, then the April Arrivals Data is now first which has the following consequences:

  1. When we combine the data, the data from the “April Arrivals Data” is now first, not “February Arrivals Data”
  2. When we promote headers, “April Arrivals Data” will replace the header Source.Name not “February Arrivals Data”
  3. When we rename the February Arrivals Data column (which used to be called Source.Name), the code will break. This is because PowerQuery will not be able to find the column February Arrivals Data which is explicitly referenced in the query as it is now called April Arrivals Data.

How do we solve this?

Well as suggested in Lesson 8, for the moment always opt to use a specific file as your sample file. Option 2 is to delete the Source.Name column as soon as you’ve combined your files and before you do any other transformations. There are other options that are slightly more complicated and therefore we don’t go into them at this stage.

  • The Changed Header Issue

When we bring in new files, sometimes the headers in our source files change (they shouldn’t if the files are system produced but it does happen sometimes!) This causes problems when the files are imported into PowerQuery as headers are frequently referenced specifically in code transformations. Like in the First File Issue, if PowerQuery can’t find Header titles that are specifically referenced in the PowerQuery code then the code will break.

How do we solve it?

There are much more complex methods to make your queries more robust but for the moment, the easiest thing to do is go into the Excel file and change the name of the Header to how it appears in the query. Make sure you have a good relationship with the people who produce your reports so you know in advance when column titles change!