Capitalisation and Data Extraction

Capitalisation and Data Extraction

Previous Session Summary

In the previous lesson, we were introduced to the powerful data splitting function in PowerQuery and practised splitting data using a custom delimiter. Of course, there are several other very useful data splitting functions including splitting data when data values change change from a digit to non digit, or a capital to a non capital.

Lesson Synopsis

In this lesson we start looking at some formatting tools, in particular, applying proper capitalisation to words.
We also look at how to extract sections of data from columns of information. This is very similar to splitting data which of course we looked at in the last lesson.
When we split data, we give PowerQuery a point ( a delimiter) at which we’d like it to split a column of information into multiple columns of data. With extraction, we give PowerQuery a splitting point, but we then tell it that we only want to keep the information before or after the delimiter or between two delimiters. We’re then again left with only one column.  You can complete the same task by splitting the dataset and then removing the columns you don’t need, but extracting is faster and with fewer applied steps, requires less computer resource.

Resource Listing

The Video

Play Video

Lesson Notes

In this video we look in depth at one particular element of formatting; capitalisation

In standard Excel, we can capitalise text in a certain way; either by elevating all text to capitals, lowering all the text to lower case or capitalising the first word of all words by using certain functions.

 The UPPER function capitalises all letters in a word, while the LOWER function puts all letters in lower case. The PROPER function capitalises the first letter of each word.

PowerQuery has the same capabilities but at the single click of a button.

In the video to demonstrate this capitalisation ability, we look at extracting some names from email addresses and then capitalising the names correctly. All the names are in slightly different formats. Some have some capitals, others don’t and still others are completely capitalised.

How Can We Capitalise Words Correctly in PowerQuery?

Firstly select the columns that has the data to which you’d like to apply your formatting.

  1. From there, go to the Transform tab,
  2. Then in the ‘Text Column’ section of the ribbon (about halfway along), select the Format drop down.
  3. There’s lots of different options from here so let’s go through them.
  • Lowercase – converts all text into lowercase so that there are no capitals
  • UPPERCASE – converts all text into UPPERCASE so that there are no capitals
  • Capitalize Each Word – Capitalizes each word in the dataset and then the remainder of the word is in lower case.
  • Trim – removes any leading or trailing spaces from a cell
  • Clean – Removes non printable characters like line breaks or carriage returns
  • Add Prefix – adds a custom prefix to the beginning of all rows in a column
  • Add Suffix – adds a custom suffix to the end of all rows in a column

    To Capitalize words correctly, of course we select the Capitalize Each Word option.

By now you should be used to the actions popping up in the Query pane and sure enough, there’s a new step added of Capitalized Each Word.

What is the ‘Extract’ option?

In the last lesson we looked at splitting data at certain points, at delimiters, in each cell in a column of data. The data in each cell would then be separated out into columns at each of those delimiters.

Extraction does a similar sort of thing. We give PowerQuery a delimiter and then we can ask PowerQuery to extract all the information before, after or between the delimiters. The rest of the information is then removed from the dataset. This is ideal when we just want a section of data from a cell, but don’t need all the information.

So how do we perform

How to use the Extract option

First of all, identify the column that has the data that we want to extract.

  1. Again in the ‘Transform’ tab go to the ‘Text Column’ section of the ribbon.
  2. Select the ‘Extract’ drop down. Again a number of options come up:
  • Length – Counts the number of characters in each row of the selected column. This is like the LEN function in standard Excel.
  • First characters – returns only a definable mount of characters from the start of the selected data. This is like the LEFT function in standard Excel.
  • Last characters – returns only a definable mount of characters from the end of the selected data. This is like the RIGHT function in standard Excel.
  • Range – returns a specified amount of characters from a specified midpoint in each row of the selected column. This is the equivalent of the MID function in standard Excel.
  • Text Before Delimiter – only retains the information before a delimiter in each row of the selected column. All other data in the selected column is discarded.
  • Text After Delimiter – only retains the information after a delimiter in each row of the selected column. All other data in the selected column is discarded.
  • Text Between Delimiters – only retains the information between two delimiters in each row of the selected column. All other data in the selected column is discarded.

    Choose the Extract option from the drop down menu that you need.

.... and a little more!

The Formula Bar – We don’t look at the Trim function in this lesson’s video but it’s a very useful function. Here’s why:

PowerQuery is very exact in it’s code. If you don’t have capitals in the right places or you have leading or trailing spaces in your dataset, this can cause problems and result in PowerQuery not doing what you expect. Take for example Merging which is a function we get to later in the course. Merging brings together two tables of information using a ‘key’, a piece of information that is common in both dataset. Using the ‘key’, PowerQuery can match the correct rows together, much like a VLOOKUP does. However if the ‘key’ column in one dataset has trailing spaces which aren’t in the key column of the matching dataset, then the Merge function won’t work. It’s quite difficult to pick out why a Merge function isn’t working properly, but the Trim function will eliminate any trailing spaces, which aren’t obvious from just looking at the dataset, which could be a principal reason.