Wildcard search directory of Word Docs and output hits to Excel

  • Hi,


    My basic requirement is to search a directory of word documents to see what other documents they reference in their text. Basically, these are specific task procedures that refer back to standard operating procedures. As the SOPs are revised/superceded/obseleted, I'd like a way to pull the references out and understand which specific task procedures are affected.


    I found a similar requirement and helpful code from cytop in this thread:
    Search multiple word documents for specific content and paste into excel.


    My search requirement is different from that thread in that I don't know the individual document numbers beforehand, but the document numbers I'd like to search for are in a format of:
    abc-####-##
    abc-####-##-###
    xyz-####-##-###
    where abc and xyz are letters and # are numbers. I'd like to search for and record into Excel any references in the target documents. There are a limited number of alphabetic sequences, so I'd probably search for something like:
    abc-*
    xyz-*


    The code provided by cytop in that thread is great, but I'd like to extend it. I modified it slightly (see below) to use a wildcard search and count the number of hits in each document. I have not had luck in retrieving the "whole word" of the hit that was found. Once I have that I believe that I could easily shove it into a cell of the spreadsheet.



    Thanks,
    Jim

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    Rereading my post, I'm wondering if I was explicit enough in the question I was asking. To clarify I'd like to:
    - Search a directory of Word Docs (done by original script)
    - For a set of strings with wildcards (done by modified script)
    - Output the matched pattern from each hit to Excel (need help here)


    Put in a different form, I'd like to grep the Word docs for a regular expression and get a list of the matches. I feel like I'm 90% of the way there, just need help in being able to grab the match pattern out of the Word doc. So if I search for "abc-????-??*", how do I grab the text at each hit that matches so that I get a list like this for each file:
    abc-1234-05
    abc-1234-06
    abc-2345-01
    abc-2345-02


    Thanks for reading,
    Jim

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    You can do a 'regex' like search in Word, but it's limited in the number of terms it can have. I believe (using tablet so no Word) the max is about 7 'wildcards'. For example

    Code
    With Selection.Find
          .Text = "[A-z][A-z][A-z]-[0-9][0-9][0-9][0-9]"


    Which will get you a basic match - you just need to extend the selection (if found) to the next space character...

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    cytop, Thanks for your reply (and original code).


    What property holds the matching text when the search is complete? Or does the selection change when the Find is executed? I tried changing from the Range find to a Selection find in the code, but I was not successful.

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    Butchered a bit...


    Changes were done freehand and are not tested.


    This scans all Word files in the directory (Hard coded above - you need to change to suit your requirements) for a pattern XXX-9999 where X is any alpha character and 9 is any numeric character.


    At the moment, it only adds the matching (partial) string to Sheet1 of the workbook (Test this in a blank workboook) along with the filename - very rough.


    It needs a slight change to extract up to the next space - but as I said earlier, no copy of Word available and couldn't find any example code with a quick search... Simple example down this page a little - it's probably as good as it'll get.

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    cytop, thanks again, you've gotten me pretty darn close. I'm trying a few things with the regular expression and getting closer. I'm thinking something like this to match abc- followed by any amount of numbers or dashes, followed by a whitespace.


    Code
    With oWRD.Selection.Find
                .Text = "[A-z][A-z][A-z]-[0-9-]*[\s]"


    Thought that would do it, but I am picking up some extraneous characters so I'll keep working until I understand Word's regular expression pecadillos.


    Thanks again.

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    The simple example I linked where it extends the selection until it hits a space would be easier, admittedly slower, but all you have to do is trim() the selection...


    Minor changes, and that simplistic function included:


    (Tested - was winging it last time)


    One thing to watch out for - it'll go into an endless loop if the very last thing in the document is one of the patterns and there's no trailing space. Something you might want to look at :)

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    Thanks for your help, cytop. Your extension function worked, though I modified it to look for a character that was not a number or dash, instead of looking for a space (since I can't guarantee the documents are followed by a space, but I want to capture all numbers and dashes). Then as I was looking up the Selection.MoveRight you used to tweak it, I found other Selection methods MoveEndUntil and MoveEndWhile. I think MoveEndUntil could be used to replace your ExtendSelection function, I used MoveEndWhile to replace my version. Here's my new search loop:



    Seems to be doing exactly what I wanted. Couldn't have gotten to this point without your help.


    Thanks,
    Jim

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    I have one negative with my approach above. When using the MoveEndWhile method, if the document number is in a hyperlink and it selects the end of the hyperlink, the start of the selection moves to the start of the hyperlink and I end up copying the whole hyperlink, as I can't seem to then get the start to move. Any ideas or links would be appreciated.


    For instance, in the Word document is a link such as:


    http://docserver.mycompany.com…px?DocNum=abc-1234-05-123


    My search finds the hit correctly, but then as I move the selection end to the first non-number/non-dash as described in the previous post, the entire hyperlink gets selected. Then my results looks like this:


    abc-1234-05
    abc-1234-06
    abc-2345-01
    abc-2345-02
    http://docserver.mycompany.com…px?DocNum=abc-1234-05-123


    I can detect if the search hit is hyperlinked, but I haven't found the secret sauce to finding the last hyperlinked character yet.


    Thanks,
    Jim

  • Re: Wildcard search directory of Word Docs and output hits to Excel


    I think I solved my own issue this time. Since the document number is generally the last thing in the link, if I detect a hyperlink, I just moved the Selection.End to be equal to the Selection.Hyperlinks.Item(1).Range.End - 1. That selected just the document number and got me what I wanted. A little dirty, but seems to be working in my case.


    -Jim

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!