In this lesson we’re carrying straight on from the last esson where we demoted our headers into th e first row of our dataset. This was so that the data splitting that we’re about to do would apply to both datasets and header alike.
So what is Data Splitting?
What is data splitting in PowerQuery?
Data splitting is the act of splitting a column of values at a specific point or points (called delimiters) that we define. The Data splitting function in PowerQuery is much more versatile than in Excel because you can do all of the following:
So a pretty powerful tool then. How can we use it?
How can I use Data Splitting
To do some data splitting, the first thing we need to do is find the column that we want to split. After that of course, we need to identify the text or values at which we want to split our data. This can be quite straight forward as it might be a comma or a dash but don’t rule out something more extravagant like an @ or maybe even a delimiter made out of more than one character.
Once we have identified where we’re going to split the data we need to actually do the splitting:
– by delimiter or
– a set number of characters. In this situation we have some data that we always want to split after every 1st, 2nd, 3rd etc character, regardless of what that character might be or
– by position. We can split the text at specific character position (starting at 0) which means that unlike splitting data by a set number of characters, we can alternate the positions at which we’re splitting the data or
– By lower case to uppercase or upper case to lower case. The data can be split every time text moves from lowercase to upper case or vice versa or
– By digit to non digit and vice versa. The data can be split every time the characters move from digit to non digit and vice versa.
Splitting by delimiter
Once you have selected the Split by Delimiter option, then a new menu comes up asking you to tell PowerQuery what the delimiter is by which you’d like to split the data. PowerQuery gives you some options such as a comma, but you also have the ‘Custom’ option. When selected, the custom option allows you to enter some custom text of your choosing by which you can do your splitting. Unlike in the Text to Columns function in standard Excel, which only allows you to enter a single character, PowerQuery allows you to enter more than one.
Once you’ve selected your delimiter or defined your custom delimiter, you can then tell PowerQuery whether you’d like to just split the first occurrence of the delimiter, the last occurrence of the delimiter, or every occurrence of the delimiter that it finds.
There’s also an advanced selection of options, which allows you to split the columns such that the split data goes into new rows instead of new columns and we look at this in a later lesson.
Once you’re happy then press OK.
And there, we go! The text is split! And we have a new action in the Query Settings Pane – Split Column by Delimiter.
What are data formatting errors?
There’s two different kinds of errors in PowerQuery.
We have what I call Data Formatting errors and these just affect cells of data, but the query will continue to load all other information back into Excel.
We also have query breaking errors. This is where the code that has been written to create the query doesn’t match the data that has been connected.
In this lesson we’re looking at Data Formatting errors and how to fix them.
A Data Formatting error is where we have asked PowerQuery to convert a column of data into a particular data type such as ‘Whole Number, ‘Decimal Number’, ‘Text’ or ‘Date’. In the Applied Steps pane, this step is called ‘Changed Type’, and as revision we can give this command by pressing on the ‘Data Type’ button in the ‘Transform’ tab.
So why does the error occur?
Well let’s say we have a column of data that has both text and numbers in each cell in the column – like our Container Number column in the video. If we command PowerQuery to convert this column to a whole number format, then PowerQuery won’t be able to as text characters cannot be converted to a whole number format. For every cell that has text data in it then, there will be a data formatting error.
How can we find out if we have data formatting errors?
When we load our query back into standard Excel using the Close and Load option, in the Query and Connections pane, we see each query which is connected in the workbook and the amount of loaded rows. If any of the queries have data formatting errors then we will see the errors next to the amount of loaded rows.
How can we find out what the errors are?
In the Queries and Connections pane, click on the hyperlink showing the amount of errors there are in a query. This brings you back into the PowerQuery editor and creates a new query showing only the rows in the original query which have errors.
The first column in the preview screen shows the row number with the errors in the original query and then shows all the other columns in the original query. If you scroll across the screen you will see that some of the cells have the word ‘Error’ in them. These are the errors to be fixed.
Once you’ve identified the cells with the errors, the next thing to do is work out why the data in those cells is incongruous with the data type of the column. As a general rule though, most of these errors happen because we try to convert cells with text into Number format or Date formats. Sometimes you can work out the error by looking at the context of the other data in other columns and sometimes you have to go back to your original dataset.
How can we fix the errors?
There’s three different simple ways of fixing the errors. One of them remove the errors lines all together, while the third retains the row but changes the data format or the data itself.
Simple option number 1
Once you’ve identified the cause of the error, then you may decide that you don’t need the row with the error in it (as we don’t in the video). You can then remove the entire row by selecting any column with an error in it, right clicking the column header and selecting ‘Remove Errors’. Do make sure though that you’re happy to remove the ENTIRE row though!
Simple option number 2
Data formatting errors occur when cells have values that can’t be formatted in the way that we’ve defined to PowerQuery. A simple fix? Change the data format of the column to a format that is acceptable to PowerQuery. To do this,
Simple Option Number 3
Option number 3 is that you can use Replace Errors option. Using this option, you define to PowerQuery the value that you’d like to replace an error with. This can be a good, quick halfway house option as you don’t need to remove the entire row, nor go understand why the error has occurred which can be time consuming.
Click on the column that has the errors, then right click and select ‘Replace Errors’. From here you can then put in a value that you’d like instead of the error.
Query Breaking Errors – Earlier we talked about the different types of errors and we focused specifically on data formatting errors. Query breaking errors we look at later on in the course but they generally happen on refreshing queries when Header titles or columns have changed or been added to the dataset.