Ext4 - heavy discussion about possible data loss

tlu · Mar 18, 2009

There is a bug report on launchpad.net (with 175 comments!) about a possible data loss when using the Ext4 file system. A roundup can be found on heise online. Particularly interesting is this post by Ext4 developer Ted Ts'o where he explains the background of the problem. The discussion is extended in Ted's blog.

Here's the background (I'm giving credit to another German heise article that summarizes the issue):

If applications overwrite a file with new or changed data (e.g. a configuration file after the user has changed a setting), they often create a temporary file first and then rename it with the system call rename (). The logic: If something goes wrong during the write process (like computer crashes or power failure) at least the old version of the file is kept.

The following two things happen. On the one hand, metadata in the file system change: For the new file an inode is created, the data is referenced, and a new directory entry is created which points to the new inode. During rename () the directory entry of the old file is modified so that it points to the new inode. On the other hand, the data itself is written: The file system has to allocate a sufficient number of data blocks on the disk first, and then write the data in these blocks.

Ext3 and Ext4 write all changes to the metadata to their journal first. Thus, even after the rename (), there hasn't been any change in the file system itself. Now, in the case of a power failure the new file doesn't exist in the file system yet because the directory entry of the old file points to the old inode and hence to the old data, while the modified metadata in the journal is not yet valid. For this to happen a "commit" of the changes in the journal is necessary - only then the the modified metadata is written to the file system after some time (or at the next reboot after a crash).

And here's a crucial difference between Ext3 and Ext4. Ext3 (with the standard mount option "data = ordered") does the commit of the metadata in the journal only if the data of the new file is already actually written to disk (which can take up to five seconds while the data is cached). This should prevent that in the case of a system crash old data is contained in a newly created file if the allocated data blocks were used by a previously deleted file and are not yet used by the new data. Thus, after a system crash the file contains either the old or the new data - depending on whether the crash happened before or after the commit is done.

Ext4, however, introduces another mechanism called delayed block allocation. After the closing of a file it can take up to one minute until data blocks on the disk will actually be allocated. With the delayed block allocation the file system is able to better optimize write processes - but at the price that the metadata of a newly created file has a file size of 0 bytes and does not have occupied data blocks until the delayed allocation happens. If the system crashes during this time period, the rename () operation in the journal can already be committed although the new file contains no data yet. Result: After a crash the file is empty, and old and new data is lost!

Ted Ts'o stresses in his response to the bug report that Ext4 behaves exactly the same way as the POSIX standard calls for file operations. Moreover, other file systems like XFS show the same behavior. He says that the "safer" behavior of Ext3 is merely an accidental byproduct. For Ts'o the application developers are the culprits as they take the good-natured behavior of ext3 as the default. His advice: If an application wants to be sure that data is actually written to disk, it must use the function call fsync () before closing the file.

Nevertheless, as a workaround he has written patches for Ext4 which recognize rename () situations (and a second approach using ftruncate () ) and treat them like Ext3. Moreover, Ted has also written a genuine solution: The new Ext4 mount option "alloc_on_commit" is equivalent to "data = ordered" in Ext3 - metadata in the journal will only be committed after the blocks are allocated and the data is written. It will probably be available in kernel 2.6.30.

Mrkvonic · Mar 18, 2009

It will take time until ext4 is fully usable.
Mrk

Arup · Mar 18, 2009

Much as I would like to use ext4 when the new Ubuntu 9 comes out, I guess I would stick to the tried and tested ext3 till all the weeds are bugged out, file system bugs are not to be taken lightly under any circumstances.

tlu · Mar 18, 2009

Mrkvonic said:

It will take time until ext4 is fully usable.
Mrk
Click to expand...

Well, Ext4 seems to be pretty stable according to what I've read so far. And I agree with Ted that it's not an Ext4 specific problem but rather a problem with the application developers as other modern file systems also use delayed block allocations. But it's also an illusion that every developer will adapt his/her app in due time. That's why it's important that Ted wrote the mentioned patches that will handle this problem. If they prove reliable, Ext4 will be a success.

NGRhodes · Mar 18, 2009

I thought EXT4 supported data=ordered as a mount option ? , is it possible to have delayed allocation with ordered mode ?

Interesting is that there have been some bad programming going on, coders ASSUMING the file has been written to sucessfully, then doing a further action on the file regaurdless. Would be like trying to send data to a location without first verifying a connection.

As mentioned the ONLY way to guarantee this is to issue an fsync() ANY filesystem mounted without ordered or full data journalling is suceptible to this, even ZFS under fuse suffers this issue.

tlu · Mar 19, 2009

N1ckR said:

I thought EXT4 supported data=ordered as a mount option ? , is it possible to have delayed allocation with ordered mode ?
Click to expand...

It seems it doesn't. Quote from http://article.gmane.org/gmane.comp.file-systems.ext4/12179 :
"Add an ext3 bug-for-bug compatible analogue for data=ordered mode. In
this mode, we force all delayed allocation blocks involved with the
to-be-commited transaction to be allocated, and then flushed out to
disk before the transaction is commited."

and from https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/118 :
"Secondly, you can turn off delayed allocation for ext4. If you mount the filesystem with the nodelalloc mount option, you will basically get the old ext3 behaviour with data=ordered."

Log in or Sign up

Ext4 - heavy discussion about possible data loss

tlu Guest

Mrkvonic Linux Systems Expert

Arup Guest

tlu Guest

NGRhodes Registered Member

tlu Guest

Log in or Sign up

Ext4 - heavy discussion about possible data loss

tlu Guest

Mrkvonic Linux Systems Expert

Arup Guest

tlu Guest

NGRhodes Registered Member

tlu Guest

Useful Searches