Splitting Data and Handling Errors

Splitting Data and Handling Errors

Previous Session Summary

In the previous lesson, we covered the many different ways of removing columns, as well as demoting headers into the first row of a dataset. That ‘Demoting Headers’ step was a really important step because we needed to bring the Headers of our dataset back into the dataset, so that it could be cleaned and transformed in the same way as the rest of the file. 

Lesson Synopsis

Ah splitting data. If you’ve been around Excel for a bit then you probably know the Text to Columns function in standard Excel. Using this handy little tool, we can split a single long text string into multiple columns (don’t worry if you don’t know it as we do a little intro on it in this lesson).
 
There are some serious drawbacks to the Text to Columns function though, chief among them; you can’t define your delimiter.
 
Never fear though, because PowerQuery has got the situation under control. Do you want to:
– Define multi character custom delimiter?
– Split when it changes from a capital to lower case?
– Split data when it changes from a number to text or vice versa?
– Split text into rows not columns?
– Skip delimiters?
 
PowerQuery’s splitting function can accomplish all of these and in this video we’re going to look at how you can split a column of data using a custom delimiter.

Resource Listing

The Video

Play Video

Lesson Notes

In this lesson we’re carrying straight on from the last esson where we demoted our headers into th e first row of our dataset. This was so that the data splitting that we’re about to do would apply to both datasets and header alike.

So what is Data Splitting?

What is data splitting in PowerQuery?

Data splitting is the act of splitting a column of values at a specific point or points (called delimiters) that we define. The Data splitting function in PowerQuery is much more versatile than in Excel because you can do all of the following:

  1. Define a multi character delimiter for example ‘*/*’
  2. Tell PowerQuery to only split data at the first or last delimiter
  3. Tell PowerQuery to split data when it changes from numbers to letters (and vice versa)
  4. Tell PowerQuery to split data when it changes from capitals to non capitals (and vice versa)
  5. Split data into rows rather than columns
  6. Define the number of characters after which you’d like to split the data

So a pretty powerful tool then. How can we use it?

How can I use Data Splitting

To do some data splitting, the first thing we need to do is find the column that we want to split. After that of course, we need to identify the text or values at which we want to split our data. This can be quite straight forward as it might be a comma or a dash but don’t rule out something more extravagant like an @ or maybe even a delimiter made out of more than one character.

Once we have identified where we’re going to split the data we need to actually do the splitting:

  1. Highlight the column that we want to work on by left clicking on it.
  2. Go to the Home tab
  3. Select the Split Columns drop down
  4. Up comes a new menu and we’ve then got an option to split the column, either by:

by delimiter or

– a set number of characters. In this situation we have some data that we always want to split after every 1st, 2nd, 3rd etc character, regardless of what that character might be or

– by position. We can split the text at specific character position (starting at 0) which means that unlike splitting data by a set number of characters, we can alternate the positions at which we’re splitting the data or

 – By lower case to uppercase or upper case to lower case. The data can be split every time text moves from lowercase to upper case or vice versa or

By digit to non digit and vice versa. The data can be split every time the characters move from digit to non digit and vice versa.

Splitting by delimiter

Once you have selected the Split by Delimiter option, then a new menu comes up asking you to tell PowerQuery what the delimiter is by which you’d like to split the data. PowerQuery gives you some options such as a comma, but you also have the ‘Custom’ option. When selected, the custom option allows you to enter some custom text of your choosing by which you can do your splitting. Unlike in the Text to Columns function in standard Excel, which only allows you to enter a single character, PowerQuery allows you to enter more than one.

Once you’ve selected your delimiter or defined your custom delimiter, you can then tell PowerQuery whether you’d like to just split the first occurrence of the delimiter, the last occurrence of the delimiter, or every occurrence of the delimiter that it finds.

There’s also an advanced selection of options, which allows you to split the columns such that the split data goes into new rows instead of new columns and we look at this in a later lesson.

Once you’re happy then press OK.

And there, we go! The text is split! And we have a new action in the Query Settings Pane – Split Column by Delimiter.

What are data formatting errors?

There’s two different kinds of errors in PowerQuery.

We have what I call Data Formatting errors and these just affect cells of data, but the query will continue to load all other information back into Excel.

We also have query breaking errors. This is where the code that has been written to create the query doesn’t match the data that has been connected.

In this lesson we’re looking at Data Formatting errors and how to fix them.

 A Data Formatting error is where we have asked PowerQuery to convert a column of data into a particular data type such as ‘Whole Number, ‘Decimal Number’, ‘Text’ or ‘Date’. In the Applied Steps pane, this step is called ‘Changed Type’, and as revision we can give this command by pressing on the ‘Data Type’ button in the ‘Transform’ tab.

So why does the error occur?

Well let’s say we have a column of data that has both text and numbers in each cell in the column – like our Container Number column in the video. If we command PowerQuery to convert this column to a whole number format, then PowerQuery won’t be able to as text characters cannot be converted to a whole number format. For every cell that has text data in it then, there will be a data formatting error.

How can we find out if we have data formatting errors?

When we load our query back into standard Excel using the Close and Load option, in the Query and Connections pane, we see each query which is connected in the workbook and the amount of loaded rows. If any of the queries have data formatting errors then we will see the errors next to the amount of loaded rows.

How can we find out what the errors are?

In the Queries and Connections pane, click on the hyperlink showing the amount of errors there are in a query. This brings you back into the PowerQuery editor and creates a new query showing only the rows in the original query which have errors.

The first column in the preview screen shows the row number with the errors in the original query and then shows all the other columns in the original query. If you scroll across the screen you will see that some of the cells have the word ‘Error’ in them. These are the errors to be fixed.

Once you’ve identified the cells with the errors, the next thing to do is work out why the data in those cells is incongruous with the data type of the column. As a general rule though, most of these errors happen because we try to convert cells with text into Number format or Date formats. Sometimes you can work out the error by looking at the context of the other data in other columns and sometimes you have to go back to your original dataset.

How can we fix the errors?

There’s three different simple ways of fixing the errors. One of them remove the errors lines all together, while the third retains the row but changes the data format or the data itself.

Simple option number 1

Once you’ve identified the cause of the error, then you may decide that you don’t need the row with the error in it (as we don’t in the video). You can then remove the entire row by selecting any column with an error in it, right clicking the column header and selecting ‘Remove Errors’. Do make sure though that you’re happy to remove the ENTIRE row though!

Simple option number 2

Data formatting errors occur when cells have values that can’t be formatted in the way that we’ve defined to PowerQuery. A simple fix? Change the data format of the column to a format that is acceptable to PowerQuery. To do this,

  1. Find the first Changed Type step where the data is changed to the format that causes the error. You can do this in a couple of different ways:
  2. In the Header of each column there’s a little icon showing the data type of a column. E.g. Text is ‘ABC’, ‘Whole Number’ is ‘123’, ‘Decimal Number’ is ‘1.2’ and ‘Date’ has a little calendar. Find the Changed Type step where the icon changes to the problematic data type.
  3. A bit more advanced, but the M code in the Formula Bar for each Changed Type Step shows the columns that have been changed and the data type that the column has been changed to.
  4. Once you’ve found the Changed Type step that’s causing the error, change the data type to a type that doesn’t cause the error in the problem column. You can do this by going to the Transform tab and selecting Data Type followed by the data type you want.

Simple Option Number 3

Option number 3 is that you can use Replace Errors option. Using this option, you define to PowerQuery the value that you’d like to replace an error with. This can be a good, quick halfway house option as you don’t need to remove the entire row, nor go understand why the error has occurred which can be time consuming.

Click on the column that has the errors, then right click and select ‘Replace Errors’. From here you can then put in a value that you’d like instead of the error.

.... and a little more!

Query Breaking Errors – Earlier we talked about the different types of errors and we focused specifically on data formatting errors. Query breaking errors we look at later on in the course but they generally happen on refreshing queries when Header titles or columns have changed or been added to the dataset.