Word files corrupted - how to recover?

Discussion in 'other software & services' started by newbino, Apr 10, 2013.

Thread Status:
Not open for further replies.
  1. newbino

    newbino Registered Member

    Joined:
    Aug 13, 2007
    Posts:
    377
    Recently I have been digging for old Word (2007) files and I have found that some of them are corrupted. How this happened is a mystery to me, and if someone could comment I would appreciate it.

    However, what I am concerned with is recovering some of them. Whereas some I can let go without much issue, others are very important. I would appreciate any help in recovering them, either with procedures I am yet unfamiliar with, or using specific software. Free would be preferable, but if best-in-class is pay and I need it, I'll buy it.

    Thanks for your opinions.
     
  2. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    What format?
    Mrk
     
  3. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    For docx files:

    Unzip the file - unzip file.docx
    Cd into word sub-directory.
    There's a file named document.xml.
    That's your main file.

    You also have the media subdirectory with images.
    And so forth.

    Now, you need something that can work with regex to parse all those <> out of there. I'm going to write a little regex and get back to you.

    For doc:

    I've written a parser that will do it for you - originally, I wrote it for parsing messages out of Outlook email, but it will work here and grab most of your stuff - you just need to compile - best if you use Linux. It strips out some of the special characters, but you will get the idea.

    #include <stdio.h>
    #include <stdlib.h>

    int main(int argc, char* argv[])
    {
    // Check nargin, TBD
    char* fname=argv[1];
    ReadFile(fname);
    return 0;
    }


    void ReadFile(char *name)
    {
    FILE *file;
    char *buffer;
    unsigned long fileLen;

    // Open file
    file = fopen(name, "rb");
    if (!file)
    {
    fprintf(stderr, "Unable to open file %s", name);
    return;
    }

    // Get file length
    fseek(file, 0, SEEK_END);
    fileLen=ftell(file);
    fseek(file, 0, SEEK_SET);

    // Allocate memory
    buffer=(char *)malloc(fileLen+1);
    if (!buffer)
    {
    fprintf(stderr, "Memory error!");
    fclose(file);
    return;
    }

    // Read file contents into buffer
    fread(buffer, fileLen, 1, file);
    fclose(file);

    int i;
    for(i = 0;i < fileLen;++i)
    {
    char character=buffer;
    int charv=character;
    // Clean non-ascii characters and only print desired mail header and body
    //if (((charv >= 65) && (charv <= 90)) || ((charv >= 97) && (charv <= 122)))
    if ( ((charv >= 32) && (charv <= 127)) || (charv ==9) || (charv ==10)||(charv ==13))
    printf("%c", ((char *)buffer));
    }

    printf("\n");

    free(buffer);
    }



    There you go, for free!

    Cheers,
    Mrk
     
  4. newbino

    newbino Registered Member

    Joined:
    Aug 13, 2007
    Posts:
    377
    Mrkvonic, thanks for the fast reply and it looks like you put some work into this, I appreciate it!

    The incriminated files are both .doc and .docx

    About what you wrote, I am afraid that my head is swimming - I simply don't get it.

    For example:
    The files are not zipped, how can I unzip them?
    I am not sure what it means

    I really don't have a clue how to proceed here. Either you detail a step-by step procedure, but I would not assume to ask you for that, or I am afraid this is the classical "pearls before swine".

    Thanks a lot!
     
  5. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    I will detail it all. The problem is, some of the tools I can think of need Linux, even a live CD one. Any chance you can grab one? Or use one?
    Mrk
     
  6. newbino

    newbino Registered Member

    Joined:
    Aug 13, 2007
    Posts:
    377
    Thanks Mrk, you are very kind. I don't see a problem wit Linux, I installled Ubuntu in my VM some time ago just to have a look at it and play around, but it was a very basic experience, and uninstalled. I'll get a recent distro.
     
  7. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    Excellent, so you will be able to accomplish all steps easily.

    In Ubuntu, for docx:

    unzip file.docx

    docx are actually zipped archives.
    Inside, you will find the sub-folder named word.
    And inside document.xml.

    So I am working on the most elegant regex to parse that xml.

    For docs, take my code and compile it - just paste in a text file named code.c:

    gcc code.c -o parser

    Chmod to executable - chmod 755 parser

    Then run it against relevant files:

    ./parser file.doc

    And you will get lovely output :)

    Won't preserve images and styling, but you will get your text.

    And of course, there's also header fixes, but that's hacking in hex editor.
    We do that later.

    Mrk
     
  8. newbino

    newbino Registered Member

    Joined:
    Aug 13, 2007
    Posts:
    377
    Super, I'll play around with this during the weekend and report back :)

    Just to be sure, as I repeat I am really not familiar with Linux, how do I send the command:
    or
    thanks!
     
  9. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    You need to use the command line.
    Open a terminal window.

    Still working on the regex for xml, which is something that shouldn't really be done, but I'm tryng to compose a nifty one line :)

    Mrk
     
  10. newbino

    newbino Registered Member

    Joined:
    Aug 13, 2007
    Posts:
    377
    To my eyes, what you are doing is magic ;)

    Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke
     
  11. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    And here's an even simpler way for you to format xml.

    Install notepad++.
    Open the document.xml file in it.

    Menu > TextFX > TextFX Convert > Strip HTML tags table tabs
    Menu > TextFX > TextFX Convert > Strip HTML tags table nontabs

    And you have your text. You will lose paragraphs.

    That said, the magical one liner will be there.

    So basically, you ready to go!
    Try it!

    Mrk
     
  12. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    And here's the regex, in case you want to do in Linux:
    cat document.xml | sed 's/<w/\n<w/g' | sed 's/<.*>(.[^>]*)<\/.*>/$1/g' | sed 's/<.*>//g' | sed 's/^$/ /g' | sed 's/[\s]+/ /g' > text-file-results.txt

    Cheers,
    Mrk
     
  13. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    And if you really want a fancy one-liner:
    cat document.xml | sed 's/<\/?[^>]*>/ /g' > text.txt

    Mrk
     
    Last edited: Apr 13, 2013
  14. Mrkvonic

    Mrkvonic Linux Systems Expert

    Joined:
    May 9, 2005
    Posts:
    8,695
    How to recover corrupt Microsoft Word files

    Something quite painful, tricky, geeky, and difficult for today: a long, extensive tutorial about how to recover data from corrupt Microsoft Word files, including both DOCX and DOC examples, demonstrated both in Windows and Linux using free tools and methods, focusing on expectations, extracting XML files from DOCX archives, XML manipulation using TextFX in Notepad++, other XML tools, regular expressions, strings utility for retrieving text from binary DOC files, other tips and resources, and more. Have fun. Don't forget to thank me when you end up not needing to call special recovery services that cost a kidney. You're welcome.

    http://www.dedoimedo.com/computers/microsoft-word-recover-corrupt-files.html


    Cheers,
    Mrk
     
Loading...
Thread Status:
Not open for further replies.