VBA Code to convert data from a PDF to Excel & organize raw data

  • Hello - I recently received help in creating a macro that takes data from a worksheet with raw data and organizes it into another worksheet in specific rows & columns. I'm wondering if I can take this a step further by perhaps copying directly off the PDF and pasting it into the "Raw" worksheet and then run a macro/VBA code that would organize specific data into specific rows on another worksheet (SIRs) and placing it in the next empty cell?


    I'm attaching a copy of the PDF where I need the raw data from (some data has been redacted/changed for confidentiality).


    And I'm also attaching the spreadsheet of which I'm working off of. I've copied the data off the PDF into column A.


    I would need the following information from the PDF organized into the worksheet titled "SIRs":

    • Name of the Program
    • Event ID
    • First Name(s)
    • Last Name(s)
    • Date Reported to Care Provider
    • Time Reported to Care Provider
    • Description of Incident (Full description of Incident)
    • Gender
    • Child's Country of Birth
    • Age
    • LOS


    Any help would be greatly appreciated! :)

  • If you have Acrobat, not Adobe Reader, there are some methods one might use.


    Otherwise, I guess one would have to use a parsing method. The first step though would be to uncompress the pdf. For that, one 3rd party application could be pdftk. e.g.

    Code
    'https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
    Sub ken()
      'pdftk doc.pdf output doc.unc.pdf uncompress
      ChDir "C:\Users\lenovo1\Dropbox (Personal)\_Excel\pdf\FDF\wp array method"
      Shell "pdftk ""UACSIR Sample.pdf"" output doc.unc.pdf uncompress", vbNormalFocus
    End Sub


    From there, copy and paste. To automate that, I would have to think on it some.


    Of course if this is not a standard need or standard format, I would not bother doing all that work.

  • Thanks, Carim! I'll check this out.


    Do you know if there's a way to organize the data in column A (see attachment) into the respective cells in worksheet "SIRs" ? The same cells you included in the macro you helped me with?

  • Hello,


    Not sure to understand what you do mean by " Organize the Data in Column A " ...


    To me, your Column A in Sheet ' Raw ' was a constraint ... which led to building the macro in order to correctly link fields with the Sheet ' SIRs ' which must be your working database ...

    If you feel like saying "Thank You" for the help received, do not hesitate to click the "Smiley" icon, below, in the bottom right corner :)

  • Notice how on the attached spreadsheet all of the raw data is in column A under the worksheet "Raw". For example, A2 reflects "First Name: John Daniel Status: ADMITTED" - but I would need "John Daniel" to be placed in next blank cell in column C under the "SIRs" worksheet. Another example, A9 in "Raw" reflects "Africa Current Program: Test Program 1" - but I would need "Africa" placed in the next blank cell in column L under the "SIRs" worksheet since that is the child's home county of origin.


    Since I copied and pasted directly off the PDF, it put all the data in column A.


    Thoughts?

  • I will look at the online conversion of pdf to Excel since I don't have Acrobat on this computer. Since you have Acrobat, you do not need that. I have code that can do that in Acrobat.


    Without Acrobat, other methods are needed to get the raw text. For Acrobat, getting the raw text is easier than copy and paste. You should be able to modify this to suit for one file. This one gets the text content from many files.

  • Work is in the eye of the beholder I guess. Your pdf file is not generic, nor does it have fields, so it needs some steps to solve. Post #8 was just to get you to step 1.


    You have to modify post #8 code to suit. The concept that I showed was how to get the raw data and put it into a text file using Acrobat. This is just the first step to your solution. You can copy and paste the function ReadAcrobatDocument() into a Module, add the Acrobat reference as commented, and then make a Sub to call it and pass the input values. You then get a string back which is the raw data string. In the example code in #8, I wrote that string to a text file. That step is not really needed for your problem since you have Acrobat. It is useful if you want others to help you parse the string stored in a text file. The column A string that you showed was cut and pasted so it may not be consistent and not useful for automation anyway.


    I used your pdf file from post #1 in an online pdf to excel site as I said in #8. The converted Excel file would not be that useful to your step 2 needs. It could be done but paring would be more involved than what I detail later. As such, I am not sure how much help posting that code would be.


    Making a short Sub to call the function in #8 is easy. If you still need help with that post back.


    Step 2 is to parse the raw data string to get the data parts that you want. This will be the more tedious step. It should be doable but depends on the raw data string. From step 1, you can attach a text file with the string if you want help with parsing here in step 2 since others likely do not have Acrobat.


    So that others can do step 1 that do not have Acrobat, I used a 3rd party application from xpdf called pdftotext.exe. I have attached it so that you can see that parsing it may not be too difficult.


    Before I work on step 2 parsing, what worksheet did you want the parsed data to be put, SIRs?


    Before I start on step 3, it would useful to know if this is a one file at a time need or multiple pdf file data imports at once need.

  • This is ok for one file's data import but if many, I would remove the formula in column K and write the row all at once using an array to speed it up. There were a few columns that I was not sure about.


    Here is the pdftotext method with parsing. The Acrobat method would be similar but the raw data string would be split to array "a" directly. Since I don't have that data, I don't know if the raw data would be in the same locations as what pdftotext.exe does.

  • Wow! I had a feeling this would be a lot more complex than I would have wished :\


    I appreciate all your help, Kenneth. It's evident that you know what you're talking about and very knowledgeable...I wish I could keep up Lol! Essentially, I need the data from the PDF to somehow organize into the worksheet titled "SIRs". The following data would need to be organized from the PDF into its respective rows/columns in SIRs:

    • Name of the Program (next blank row in Column A)
    • Event ID (next blank row in Column B)
    • A No. (next blank row in Column C)
    • First Name(s) (next blank row in Column D)
    • Last Name(s) (next blank row in Column E)
    • Date Reported to Care Provider (next blank row in Column F)
    • Time Reported to Care Provider (next blank row in Column G)
    • Description of Incident (Full description of Incident) (next blank row in Column I)
    • Gender (next blank row in Column L)
    • Child's Country of Birth (next blank row in Column M)
    • Age (next blank row in Column N)
    • LOS (next blank row in Column P)

    To answer your questions from post #10, yes the data will go into SIRs. And it would be 1 PDF file at a time. So once I'm done with one PDF, I would repeat the same process for a new PDF and so on.


    I'm attaching a copy of the PDF with the data that I need highlighted in yellow.

  • As I explained, I do not have Acrobat so I can only parse the text file that I created with pdftotext.exe. You can download that and easily test what I posted. Just change the things in red and play from that worksheet active. The only thing that needs changing maybe is the description.


    Once you understand the concept of parsing, it is easy. Open the text file in Notepad, view the line number(s) where that data is and then use the concepts that I demonstrated. In some forms, I fear that it will not follow a consistent pattern. From the layout, that might just be the description "field". One can sometimes be more exact by searching for key words in the array elements rather than using line/array numbers. e.g. Line 5 in notepad would be "line"/row number 4 in the array since it is 0 based.


    Get it to "work" first, then one can set it up to batch process all as post #8 shows or one at a time using a file dialog pick method or such. It is not difficult to manually add the parts in red from post #11.


    As I explained too, if you do want to use Acrobat to get the text, I can show you how so ask if needed. I would need it to help you parse as I did in post #11 I suspect. Acrobat may parse the file more different than pdftotext.exe.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!