We can save all the
files that we need to clean and transform to a single folder either on a
hard drive or OneDrive, and then connect to the folder. The data from all the files is then extracted in their ‘dirty’ format and appended underneath each other and can then be cleaned and transformed together. So brilliant and so simple and frankly one of the best things about PowerQuery!
In this lesson and in the next few lessons we’re going to imagine that we’re the logistics arm of an importer Global Imports, a company that imports goods on behalf of other companies. In our to day to day work, we need to keep track of all the containers that are coming into the country from abroad.
We receive a dataset every month from our Import manager who manages our freight carriers (the companies who bring our inventory into the country) and this shows us all the containers that are currently being imported into the country.
The problem is that the data needs lots of cleaning, transforming and joining together.
The Dataset
So what’s in the file that we’re working with?
Well most importantly, it shows all the containers which have arrived and a unique reference for each container.
The data shows the size of each container, the port of arrival (so where the container has arrived in the UK), the port of departure (the port where the container departed from all over the , the arrival date and the departure date.
There’s lots that we need to do with this dataset though before it’s ready for use:
a) the data is split by delimiter rather than automatically split into columns,
b) it’s got some columns that we don’t need,
c) it doesn’t have names of the carriers that are carrying the equipment
d) dates like arrival date and departure date are in a serial number format rather than a date format
e) we want to extract some of the month names from the dates as well.
So how do we get started? We open a new blank Excel workbook and use PowerQuery of course!
How do we connect to a folder of files?
To connect PowerQuery in our blank Excel workbook to a folder of files:
a) go to the Data tab.
b) Select Get Data
c) Then choose From File and then after that
d) Select From Folder
In the new pop up menu, Excel then asks for a folder name and pathway or we can browse for it through Windows Explorer. Choose the folder you want and then click OK.
After that, Excel comes up with a new menu showing the contents of the folder which we’ve just asked PowerQuery to connect to. To go into the PowerQuery Editor to extract the data, select Edit (later PowerQuery editions have ‘Transform Data’).
The PowerQuery Editor then shows us showing us something different to the extracted data that we’ve seen in the Editor in all the lessons so far. It’s not showing us the information contained in the files of the folder, but rather the ID and properties of the files in the folder; when it was accessed, the date that the file was modified, when it was created and so forth. It’s essentially the same information that you’d see in Windows Explorer.
Filtering out data rows
It might be that there’s lots of files in the folder that you want to connect to, but that there’s only certain files from which you want to extract the data.
That’s fine, just filter out the files that you don’t want! This can be done through selecting the drop down arrow in the Header and then selecting the information that you do or don’t want.
This creates a Filtered Row command in the Query Settings pane which as we can see from the cog, can be changed at a later date if we so wish.
Filtering rows in the PowerQuery screen will take any non selected rows out of any subsequent steps. In the example video, I filter out the Word File, so PowerQuery disregards it from any further queries.
Extract Data From the Files in the Folder
To extract the data from the remaining files;
a) look at the column Content, far to the left hand side of the screen.
b) In the top right hand corner of the Header, there is a box with two arrows pointing downwards. Double click on this to bring up a new screen called Combine Files.
This ‘Combine Files’ menu asks us to give a file that we want to clean or transform as an example as there are multiple files in the folder. We can select here, either the first file in the list, or any of files in the list.
I recommend that you always select a named file rather than use the first file. The reason for this is that the first file in the folder can change when you add subsequent files into the connected folder and depending on the steps you add subsequently to the query, this can break the query. For example, if you receive a file on a monthly basis, you might name the first file January. When you set up your query for the first time, then the first file will be January. When you save your February file, and refresh your query, then the first file in the folder will be February, as Windows Explorer arranges files alphabetically and ‘F’ comes before ‘J’ in the alphabet. We’ll look at this further when we look at refreshing folder queries.
Next we need to select a worksheet to which we’re going to connect. This is under the section named parameters. In here you’ll see the names of all the worksheets that you’ve selected as a sample file. Select the one that contains the information that you’d like to extract.
Now a very important point. If this technique is going to work, the data that we want to extract must be on a worksheet with exactly the same name in every file in the folder.
The data can’t be on ‘Sheet1’ in one file, ‘Sheet 1’ in another ‘Arrivals Data’ in a third file and so on. All the Sheet names must be the same, spelt exactly the same way with capitals and spaces in exactly the same place. Any file which has the data in a worksheet with a different name, will not have its data extracted.
Once we come out of the Combine Files menu then we’re into a query screen that looks a lot more familiar. On the left hand side of the screen, in our Queries Pane, we now hahve 5 queries where we used to have one. We have the initial query that we started with in a section called Other Queries and four other queries in a section called ‘Helper Queries’. In this course we stick with the initial query as the Helper Queries are a little more advanced. However, you can if you wish perform data transformations in the ‘Transform Sample File’ Query in the Helper Queries section. In this course though we perform the transformations in the initial query.
On the left hand side then of the initial query, in the left hand column, we have the name of the file from which the information has come. This is called Source.Name. The remaining columns are the extracted data from our files in the folder!!
To check this, you can select the drop down menu in the Source.Name column. Chances are that only one file name will come up so click the ‘Load More’ option which commands PowerQuery to go and bring more data into the PowerQuery Editor from the connected files. Once this information loads, you should see the names of all the files in the folder that you opted to connect to. So now, just by having the files in the same folder, any subsequent formatting we do, will apply to all files. How brilliant is that?!
Fixing the ‘first file’ pitfall – As you read above, in the Combine Files menu, it’s best to use a Named File rather than using the First File, to help avoid query problems later on.
However if you’ve already used the First File option and you’d like to change to use a Named File instead, then this is how to fix it.
a) Go to the Query Pane and to the Helper Queries section. In there, you’ll see a query called Sample File. Click on this.
b) In the Query Settings Pane in the Applied Steps section, there should be two steps called ‘Source’ and ‘Navigation’. Click on ‘Navigation’. In the formula bar (go to the View tab and click on the Formula Bar box if you haven’t already), you’ll see some code similar to this:
= Source{0}[Content]
This code says ‘In the table in the ‘Source’ step, go to the first row (that’s the {0} bit) and then to the Content column (that’s the [Content] section)’
Copy the ‘{0}[Content]’ bit and then delete the step by clicking on the little cross next to its name in the Applied Steps section.
c) Next click on the Source step and then in the ‘Name’ column in preview screen, select the drop down box and select the file that you’d like to use as the sample. This will create a ‘Filtered Rows’ step in the Applied Steps Field.
d) Next we want to give PowerQuery the name of the file that we want to use as the Navigation step. Next to the Formula Bar there’s a little button that looks like this: ‘fx’ . Click on it. This creates a new custom step and in the formula bar we should have written #”Filtered Rows”. In the formula bar, paste back in directly after #”Filtered Rows” the copied information: {0}[Content]. Then press return or the little tick next to the fx.
e) Finally we need to remove the new step that PowerQuery has added by itself in the Applied Steps field which is ‘Import File’. Simply click on the little cross next to this new step to delete it.