tlu
March 18th, 2009, 12:05 PM
There is a bug report (https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781) on launchpad.net (with 175 comments!) about a possible data loss when using the Ext4 file system. A roundup can be found on heise online (http://www.h-online.com/open/Possible-data-loss-in-Ext4--/news/112821). Particularly interesting is this (https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45) post by Ext4 developer Ted Ts'o where he explains the background of the problem. The discussion is extended in Ted's blog (http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/).
Here's the background (I'm giving credit to another German heise article that summarizes the issue):
If applications overwrite a file with new or changed data (e.g. a configuration file after the user has changed a setting), they often create a temporary file first and then rename it with the system call rename (). The logic: If something goes wrong during the write process (like computer crashes or power failure) at least the old version of the file is kept.
The following two things happen. On the one hand, metadata in the file system change: For the new file an inode is created, the data is referenced, and a new directory entry is created which points to the new inode. During rename () the directory entry of the old file is modified so that it points to the new inode. On the other hand, the data itself is written: The file system has to allocate a sufficient number of data blocks on the disk first, and then write the data in these blocks.
Ext3 and Ext4 write all changes to the metadata to their journal first. Thus, even after the rename (), there hasn't been any change in the file system itself. Now, in the case of a power failure the new file doesn't exist in the file system yet because the directory entry of the old file points to the old inode and hence to the old data, while the modified metadata in the journal is not yet valid. For this to happen a "commit" of the changes in the journal is necessary - only then the the modified metadata is written to the file system after some time (or at the next reboot after a crash).
And here's a crucial difference between Ext3 and Ext4. Ext3 (with the standard mount option "data = ordered") does the commit of the metadata in the journal only if the data of the new file is already actually written to disk (which can take up to five seconds while the data is cached). This should prevent that in the case of a system crash old data is contained in a newly created file if the allocated data blocks were used by a previously deleted file and are not yet used by the new data. Thus, after a system crash the file contains either the old or the new data - depending on whether the crash happened before or after the commit is done.
Ext4, however, introduces another mechanism called delayed block allocation. After the closing of a file it can take up to one minute until data blocks on the disk will actually be allocated. With the delayed block allocation the file system is able to better optimize write processes - but at the price that the metadata of a newly created file has a file size of 0 bytes and does not have occupied data blocks until the delayed allocation happens. If the system crashes during this time period, the rename () operation in the journal can already be committed although the new file contains no data yet. Result: After a crash the file is empty, and old and new data is lost!
Ted Ts'o stresses in his response to the bug report that Ext4 behaves exactly the same way as the POSIX standard calls for file operations. Moreover, other file systems like XFS show the same behavior. He says that the "safer" behavior of Ext3 is merely an accidental byproduct. For Ts'o the application developers are the culprits as they take the good-natured behavior of ext3 as the default. His advice: If an application wants to be sure that data is actually written to disk, it must use the function call fsync () before closing the file.
Nevertheless, as a workaround he has written patches for Ext4 which recognize rename () situations (and a second approach using ftruncate () ) and treat them like Ext3. Moreover, Ted has also written a genuine solution (http://thread.gmane.org/gmane.comp.file-systems.ext4/12179): The new Ext4 mount option "alloc_on_commit" is equivalent to "data = ordered" in Ext3 - metadata in the journal will only be committed after the blocks are allocated and the data is written. It will probably be available in kernel 2.6.30.
Here's the background (I'm giving credit to another German heise article that summarizes the issue):
If applications overwrite a file with new or changed data (e.g. a configuration file after the user has changed a setting), they often create a temporary file first and then rename it with the system call rename (). The logic: If something goes wrong during the write process (like computer crashes or power failure) at least the old version of the file is kept.
The following two things happen. On the one hand, metadata in the file system change: For the new file an inode is created, the data is referenced, and a new directory entry is created which points to the new inode. During rename () the directory entry of the old file is modified so that it points to the new inode. On the other hand, the data itself is written: The file system has to allocate a sufficient number of data blocks on the disk first, and then write the data in these blocks.
Ext3 and Ext4 write all changes to the metadata to their journal first. Thus, even after the rename (), there hasn't been any change in the file system itself. Now, in the case of a power failure the new file doesn't exist in the file system yet because the directory entry of the old file points to the old inode and hence to the old data, while the modified metadata in the journal is not yet valid. For this to happen a "commit" of the changes in the journal is necessary - only then the the modified metadata is written to the file system after some time (or at the next reboot after a crash).
And here's a crucial difference between Ext3 and Ext4. Ext3 (with the standard mount option "data = ordered") does the commit of the metadata in the journal only if the data of the new file is already actually written to disk (which can take up to five seconds while the data is cached). This should prevent that in the case of a system crash old data is contained in a newly created file if the allocated data blocks were used by a previously deleted file and are not yet used by the new data. Thus, after a system crash the file contains either the old or the new data - depending on whether the crash happened before or after the commit is done.
Ext4, however, introduces another mechanism called delayed block allocation. After the closing of a file it can take up to one minute until data blocks on the disk will actually be allocated. With the delayed block allocation the file system is able to better optimize write processes - but at the price that the metadata of a newly created file has a file size of 0 bytes and does not have occupied data blocks until the delayed allocation happens. If the system crashes during this time period, the rename () operation in the journal can already be committed although the new file contains no data yet. Result: After a crash the file is empty, and old and new data is lost!
Ted Ts'o stresses in his response to the bug report that Ext4 behaves exactly the same way as the POSIX standard calls for file operations. Moreover, other file systems like XFS show the same behavior. He says that the "safer" behavior of Ext3 is merely an accidental byproduct. For Ts'o the application developers are the culprits as they take the good-natured behavior of ext3 as the default. His advice: If an application wants to be sure that data is actually written to disk, it must use the function call fsync () before closing the file.
Nevertheless, as a workaround he has written patches for Ext4 which recognize rename () situations (and a second approach using ftruncate () ) and treat them like Ext3. Moreover, Ted has also written a genuine solution (http://thread.gmane.org/gmane.comp.file-systems.ext4/12179): The new Ext4 mount option "alloc_on_commit" is equivalent to "data = ordered" in Ext3 - metadata in the journal will only be committed after the blocks are allocated and the data is written. It will probably be available in kernel 2.6.30.