Extract data elements from a large HTML text file

  • I have a large text (html) file (500,000 rows) from which I have to extract certain elements. The file layout is fairly consistent and each element has a clear identifier. (sample attached) Unfortunately the source of the data is a webpage and the only option is to copy paste the data into excel (or Word, text file etc).

    I need to extract the date, time and account# appearing to the right of the following labels/tags.

    Example:
    Diarized for: Wed Aug 13th,2014 @ 12:00AM
    Last Viewed: Aug13th,2014@13:28
    Account# 452145871235

    The extracted data should appear in 3 columns:

    Account | Diarized Date | Viewed Date

    Occasionally a record will have a viewed date of ‘Never’ and no Diarized date at all. Ie. not even the “Diarized for: “ label will appear. Also, records with long or multiple addresses will spread across multiple rows so the number of rows per record can vary.
    Thank you for helping!

  • Re: Extract data elements from a large HTML text file


    Hi Cytop, thank you for replying. There is no file unfortunately. I have to copy/paste the data from the site into excel. I will re-title my post "Extract data elements from a large HTML text dump". I tried copying the data into a text file actually but it looked even less workable to me. I could upload that tomorrow though, the system is not available at the moment.

  • Re: Extract data elements from a large HTML text file


    I've added a screenshot of what the data looks like on the webpage as well as a copy of the same data in a text file, which is my source file for processing.


    Each record is numbered beginning at "1" followed by a dash then the customer account number, then a colon, then a 1 to 3 digit number. (not required)
    Ex 1-00123456:1
    2-45145558888:49
    The account # appears again later in the record prefaced by "Account # "
    Ex "Account # 5177590000225777"


    Generally speaking, each record begins the account number, prefaced by the record# as indicated above (1-00123456:1), and ends with the second instance of the account that looks like ""Account # 00123456:". Note: some records have 2 addresses, the second of which will appear after the second appearance of the account number, which can make the end of the record appear confusing. Thanks again for helping.

  • Re: Extract data elements from a large HTML text file


    Hi..


    Try this..


    More data to test on would have been good (to see if there were any anomalies not shown in the text file you gave)..


    Your sample text file only had 1 instance of "Diarized for:".. so i added a couple more at random places for better testing.. you will see them in the attached txt file..


    Not sure how long it will take to run through 500 000 lines though.. if it is too slow.. you could probably adapt my code to build an array and write it to the sheet at the end rather than adding a value to the sheet each time it gets a hit.. that would make it faster..


    Having said that.. someone else could well possibly come up with a completely different method that is faster too..


  • Re: Extract data elements from a large HTML text file


    Hi Apo, I just tested it with approx. 1,000 live records and it worked perfectly! If I run into any problems running the larger data sets I'll let you know. Thank you so much for your solution.

  • Re: Extract data elements from a large HTML text file


    Hi..


    As a follow up... try this one that builds an array and then writes it to the sheet at the end..


    It should be much faster on your larger data set..


    You can test out both methods in the attached workbook.


Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!