Word files corrupted - how to recover?

newbino · Apr 10, 2013

Recently I have been digging for old Word (2007) files and I have found that some of them are corrupted. How this happened is a mystery to me, and if someone could comment I would appreciate it.

However, what I am concerned with is recovering some of them. Whereas some I can let go without much issue, others are very important. I would appreciate any help in recovering them, either with procedures I am yet unfamiliar with, or using specific software. Free would be preferable, but if best-in-class is pay and I need it, I'll buy it.

Thanks for your opinions.

Mrkvonic · Apr 10, 2013

What format?
Mrk

Mrkvonic · Apr 10, 2013

For docx files:

Unzip the file - unzip file.docx
Cd into word sub-directory.
There's a file named document.xml.
That's your main file.

You also have the media subdirectory with images.
And so forth.

Now, you need something that can work with regex to parse all those <> out of there. I'm going to write a little regex and get back to you.

For doc:

I've written a parser that will do it for you - originally, I wrote it for parsing messages out of Outlook email, but it will work here and grab most of your stuff - you just need to compile - best if you use Linux. It strips out some of the special characters, but you will get the idea.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
// Check nargin, TBD
char* fname=argv[1];
ReadFile(fname);
return 0;
}

void ReadFile(char *name)
{
FILE *file;
char *buffer;
unsigned long fileLen;

// Open file
file = fopen(name, "rb");
if (!file)
{
fprintf(stderr, "Unable to open file %s", name);
return;
}

// Get file length
fseek(file, 0, SEEK_END);
fileLen=ftell(file);
fseek(file, 0, SEEK_SET);

// Allocate memory
buffer=(char *)malloc(fileLen+1);
if (!buffer)
{
fprintf(stderr, "Memory error!");
fclose(file);
return;
}

// Read file contents into buffer
fread(buffer, fileLen, 1, file);
fclose(file);

int i;
for(i = 0;i < fileLen;++i)
{
char character=buffer;
int charv=character;
// Clean non-ascii characters and only print desired mail header and body
//if (((charv >= 65) && (charv <= 90)) || ((charv >= 97) && (charv <= 122)))
if ( ((charv >= 32) && (charv <= 127)) || (charv ==9) || (charv ==10)||(charv ==13))
printf("%c", ((char *)buffer));
}

printf("\n");

free(buffer);
}

There you go, for free!

Cheers,
Mrk

newbino · Apr 10, 2013

Mrkvonic, thanks for the fast reply and it looks like you put some work into this, I appreciate it!

The incriminated files are both .doc and .docx

About what you wrote, I am afraid that my head is swimming - I simply don't get it.

For example:

For docx files:
Unzip the file - unzip file.docx
Click to expand...

The files are not zipped, how can I unzip them?

Cd into word sub-directory.
Click to expand...

I am not sure what it means

For doc:
I've written a parser that will do it for you - originally, I wrote it for parsing messages out of Outlook email, but it will work here and grab most of your stuff - you just need to compile - best if you use Linux.
Click to expand...

I really don't have a clue how to proceed here. Either you detail a step-by step procedure, but I would not assume to ask you for that, or I am afraid this is the classical "pearls before swine".

Thanks a lot!

Mrkvonic · Apr 10, 2013

I will detail it all. The problem is, some of the tools I can think of need Linux, even a live CD one. Any chance you can grab one? Or use one?
Mrk

newbino · Apr 10, 2013

Mrkvonic said:

I will detail it all. The problem is, some of the tools I can think of need Linux, even a live CD one. Any chance you can grab one? Or use one?
Mrk
Click to expand...

Thanks Mrk, you are very kind. I don't see a problem wit Linux, I installled Ubuntu in my VM some time ago just to have a look at it and play around, but it was a very basic experience, and uninstalled. I'll get a recent distro.

Mrkvonic · Apr 10, 2013

Excellent, so you will be able to accomplish all steps easily.

In Ubuntu, for docx:

unzip file.docx

docx are actually zipped archives.
Inside, you will find the sub-folder named word.
And inside document.xml.

So I am working on the most elegant regex to parse that xml.

For docs, take my code and compile it - just paste in a text file named code.c:

gcc code.c -o parser

Chmod to executable - chmod 755 parser

Then run it against relevant files:

./parser file.doc

And you will get lovely output

Won't preserve images and styling, but you will get your text.

And of course, there's also header fixes, but that's hacking in hex editor.
We do that later.

Mrk

newbino · Apr 12, 2013

Super, I'll play around with this during the weekend and report back

Just to be sure, as I repeat I am really not familiar with Linux, how do I send the command:

unzip file.docx
Click to expand...

or

./parser file.doc
Click to expand...

thanks!

Mrkvonic · Apr 12, 2013

You need to use the command line.
Open a terminal window.

Still working on the regex for xml, which is something that shouldn't really be done, but I'm tryng to compose a nifty one line

Mrk

newbino · Apr 12, 2013

To my eyes, what you are doing is magic

Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke

Mrkvonic · Apr 12, 2013

And here's an even simpler way for you to format xml.

Install notepad++.
Open the document.xml file in it.

Menu > TextFX > TextFX Convert > Strip HTML tags table tabs
Menu > TextFX > TextFX Convert > Strip HTML tags table nontabs

And you have your text. You will lose paragraphs.

That said, the magical one liner will be there.

So basically, you ready to go!
Try it!

Mrk

Mrkvonic · Apr 13, 2013

And here's the regex, in case you want to do in Linux:
cat document.xml | sed 's/<w/\n<w/g' | sed 's/<.*>(.[^>]*)<\/.*>/$1/g' | sed 's/<.*>//g' | sed 's/^$/ /g' | sed 's/[\s]+/ /g' > text-file-results.txt

Cheers,
Mrk

Mrkvonic · Apr 13, 2013

And if you really want a fancy one-liner:
cat document.xml | sed 's/<\/?[^>]*>/ /g' > text.txt

Mrk

Mrkvonic · Nov 11, 2013

How to recover corrupt Microsoft Word files

Something quite painful, tricky, geeky, and difficult for today: a long, extensive tutorial about how to recover data from corrupt Microsoft Word files, including both DOCX and DOC examples, demonstrated both in Windows and Linux using free tools and methods, focusing on expectations, extracting XML files from DOCX archives, XML manipulation using TextFX in Notepad++, other XML tools, regular expressions, strings utility for retrieving text from binary DOC files, other tips and resources, and more. Have fun. Don't forget to thank me when you end up not needing to call special recovery services that cost a kidney. You're welcome.

http://www.dedoimedo.com/computers/microsoft-word-recover-corrupt-files.html

Cheers,
Mrk

Log in or Sign up

Word files corrupted - how to recover?

newbino Registered Member

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Log in or Sign up

Word files corrupted - how to recover?

newbino Registered Member

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

newbino Registered Member

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Mrkvonic Linux Systems Expert

Useful Searches