I have a large text (html) file (500,000 rows) from which I have to extract certain elements. The file layout is fairly consistent and each element has a clear identifier. (sample attached) Unfortunately the source of the data is a webpage and the only option is to copy paste the data into excel (or Word, text file etc).
I need to extract the date, time and account# appearing to the right of the following labels/tags.
Example:
Diarized for: Wed Aug 13th,2014 @ 12:00AM
Last Viewed: Aug13th,2014@13:28
Account# 452145871235
The extracted data should appear in 3 columns:
Account | Diarized Date | Viewed Date
Occasionally a record will have a viewed date of ‘Never’ and no Diarized date at all. Ie. not even the “Diarized for: “ label will appear. Also, records with long or multiple addresses will spread across multiple rows so the number of rows per record can vary.
Thank you for helping!
Extract data elements from a large HTML text file
-
-
-
Re: Extract data elements from a large HTML text file
Post a sample of the actual file...
-
Re: Extract data elements from a large HTML text file
Hi Cytop, thank you for replying. There is no file unfortunately. I have to copy/paste the data from the site into excel. I will re-title my post "Extract data elements from a large HTML text dump". I tried copying the data into a text file actually but it looked even less workable to me. I could upload that tomorrow though, the system is not available at the moment.
-
Re: Extract data elements from a large HTML text dump into Excel
<editing title>
-
Re: Extract data elements from a large HTML text file
I've added a screenshot of what the data looks like on the webpage as well as a copy of the same data in a text file, which is my source file for processing.
Each record is numbered beginning at "1" followed by a dash then the customer account number, then a colon, then a 1 to 3 digit number. (not required)
Ex 1-00123456:1
2-45145558888:49
The account # appears again later in the record prefaced by "Account # "
Ex "Account # 5177590000225777"Generally speaking, each record begins the account number, prefaced by the record# as indicated above (1-00123456:1), and ends with the second instance of the account that looks like ""Account # 00123456:". Note: some records have 2 addresses, the second of which will appear after the second appearance of the account number, which can make the end of the record appear confusing. Thanks again for helping.
-
-
Re: Extract data elements from a large HTML text file
Hi..
Try this..
More data to test on would have been good (to see if there were any anomalies not shown in the text file you gave)..
Your sample text file only had 1 instance of "Diarized for:".. so i added a couple more at random places for better testing.. you will see them in the attached txt file..
Not sure how long it will take to run through 500 000 lines though.. if it is too slow.. you could probably adapt my code to build an array and write it to the sheet at the end rather than adding a value to the sheet each time it gets a hit.. that would make it faster..
Having said that.. someone else could well possibly come up with a completely different method that is faster too..
Code
Display MorePrivate Sub CommandButton1_Click() Dim i As Long, x, xx1, xx2, xx3, s As String, strFile As String Application.ScreenUpdating = False strFile = Application.GetOpenFilename("Text Files,*.txt") If strFile = "False" Then Exit Sub s = CreateObject("scripting.filesystemobject").getfile(strFile).openastextstream.readall x = Split(s, vbLf & vbCrLf) For i = LBound(x) To UBound(x) xx1 = Split(x(i), "Account #") xx2 = Split(x(i), "Diarized for:") xx3 = Split(x(i), "Last Viewed:") If UBound(xx1) > 0 Then If UBound(xx1) = 1 Then If UBound(Split(Trim(xx1(1)), " ")) <= 3 Then Sheets("Sheet1").Range("A" & Range("A" & Rows.Count).End(xlUp).Row).Offset(1).Value = Trim(xx1(1)) Else Sheets("Sheet1").Range("A" & Range("A" & Rows.Count).End(xlUp).Row).Offset(1).Value = Split(xx1(1), " ")(3) End If End If End If If UBound(xx2) > 0 Then Sheets("Sheet1").Range("B" & Range("A" & Rows.Count).End(xlUp).Row).Offset(1).Value = Left(xx2(1), 31) End If If UBound(xx3) > 0 Then Sheets("Sheet1").Range("C" & Range("C" & Rows.Count).End(xlUp).Row).Offset(1).Value = Split(Trim(xx3(1)), " ")(0) End If Next i Application.ScreenUpdating = True MsgBox "Data has been imported.", vbInformation Application.ScreenUpdating = True End Sub
-
Re: Extract data elements from a large HTML text file
Hi Apo, I just tested it with approx. 1,000 live records and it worked perfectly! If I run into any problems running the larger data sets I'll let you know. Thank you so much for your solution.
-
Re: Extract data elements from a large HTML text file
Good to hear.. thanks for the feedback.
-
Re: Extract data elements from a large HTML text file
Hi..
As a follow up... try this one that builds an array and then writes it to the sheet at the end..
It should be much faster on your larger data set..
You can test out both methods in the attached workbook.
Code
Display MorePrivate Sub CommandButton2_Click() Dim i As Long, x, xx1, xx2, xx3, s As String, strFile As String, y, z, a As Long, c As Long Application.ScreenUpdating = False strFile = Application.GetOpenFilename("Text Files,*.txt") If strFile = "False" Then Exit Sub s = CreateObject("scripting.filesystemobject").getfile(strFile).openastextstream.readall z = Split(s, "Account #") ReDim y(1 To UBound(z), 1 To 3) x = Split(s, vbLf & vbCrLf) a = 1: b = 1: c = 1 For i = LBound(x) To UBound(x) xx1 = Split(x(i), "Account #") xx2 = Split(x(i), "Diarized for:") xx3 = Split(x(i), "Last Viewed:") If UBound(xx1) > 0 Then If UBound(xx1) = 1 Then If UBound(Split(Trim(xx1(1)), " ")) <= 3 Then y(a, 1) = Trim(xx1(1)): a = a + 1 Else y(a, 1) = Split(xx1(1), " ")(3): a = a + 1 End If End If End If If UBound(xx2) > 0 Then y(a, 2) = Left(xx2(1), 31) End If If UBound(xx3) > 0 Then y(c, 3) = Split(Trim(xx3(1)), " ")(0): c = c + 1 End If Next i Sheets("Sheet2").Range("A2").Resize(UBound(y), 3).Value = y Sheets("Sheet2").Select Application.ScreenUpdating = True MsgBox "Data has been imported.", vbInformation Application.ScreenUpdating = True End Sub
Participate now!
Don’t have an account yet? Register yourself now and be a part of our community!