How to get data from a PDF into Excel using PowerQuery

One of the questions I used to ask all the time as a finance analyst was ‘how can I get data from a PDF and put it into Excel?’

A lot of useful information that’s delivered to a business is in a PDF format – invoices particularly. In one of my earlier roles, one of the jobs of my team was to extract data from PDF invoices and put it in Excel  so that we could validate that we were paying for the right services. We received a lot of invoices every month and some suppliers wouldn’t give us Excel backup so we had to extract the data out of the PDF invoices instead.

I wish I had known PowerQuery then as well as I do now! Back then we used to literally copy and paste the data in or if that didn’t work, manually (eek!) write the information out into Excel.

With PowerQuery it is so easy to connect to a PDF file and then find the data that you want. Once you’ve found it, then it’s really easy to transform the data (if it needs transforming) and pop it back into Excel.

The thing is though that many people who know PowerQuery in Excel, don’t know that you can really easily connect to PDFs because there’s no obvious connector in the Get Data menu on the Data tab’s ribbon. That’s ok though, because we can just put in a blank query and then use a really easy bit of code that you can simply copy from the below video.

Of course, while I’m talking about PowerQuery in Excel, it’s even easier to connect to, and extract data from a PDF using Power BI. That’s because in Power BI, there is a connector. So check out the below video to see how it’s done, firstly in PowerQuery in Power BI and then in PowerQuery in Exce

If you’d like to follow along with the video, then download the PDF file in the link to the right, and connect Power BI or PowerQuery in Excel to the file:

A little bit about this connector...

So this code to enable the connector in Excel is a bit more complicated compared to some of the other codes that we’ve seen in the past.

           =Pdf.Tables(File.Contents(PDF location on OneDrive), [Implementation = “1.1”])

This code combines two functions: PDF.Tables and File.Contents that together combine to open up a PDF file.
 
When you’ve connected to a PDF file, then PowerQuery pulls the information out into two format types:
 
A ‘Page’, which more or less has all the data on the PDF, and a ‘Table’, which extracts the meat of the data. If the PDF is longer than one page, then you’ll get Page001, Page002 and so on. The same goes for Table as well – the more pages you have, the more tables you have.
 
You can also extract folders full of PDFs – just connect to the folder in the normal way. However, be careful with this one, as PDFs frequently don’t keep the same format from PDF file to PDF file. Frequently, data jumps between 2 or 3 different columns which makes transforming a bit more difficult as its harder to set logic.

Other Popular Finance Team Training posts on PowerQuery...

Other useful Info and links

 – If you’d like to visit the Microsoft M Code library then click here