Filesystems: Linux versus FreeBSD

Wed Sep 21 12:39:22 PDT 2005

Quoting Danny Howard (dannyman at toldme.com):

> One nice thing about FreeBSD is that the SoftUpdates feature bundles
> writes intelligently to keep the filesystem in a sufficiently consistent
> state, so that most of the time a fsck operation is not required to
> bring the server back up even after a hard reset.

SoftUpdates on FFS is a neat hack (as is back-ground fsck).  Here's a
Usenix technical paper by McKusick and others comparing it against
typical journaling filesystems: 
http://www.usenix.org/publications/library/proceedings/usenix2000/general/full_papers/seltzer/seltzer_html/

I'm not sure whether it's still under patent encumbrance; my guess is
that it still is.

> UNFORTUNATELY FreeBSD's support for vendor-approved HBAs tends to be,
> shall we say, spotty ... there is a very good chance that we will want
> to replace the database OS with Fedora Core ... which leads me to ask,
> does Linux these days have an FS option that can offer similar
> advantages to SoftUpdates, such that I needn't fsck after a server
> crash?  Is it robust?

Well, you pretty much always have to go through a _minimal_ fsck even
with a journaled filesystem, but it's so quick you will barely see it.
The extent of journaling (metadata-only or metadata + data) can
generally be set via mount options -- with the exception that Reiser4
apparently cannot be set for metadata-only.

Of the filesystems commonly in use on Linux, 

  ext2 

Very much like FFS with the "-o async" mount option (which
explains why it's prone to occasional lossage in crashes).  I still 
tend to use it for performance reasons on /tmp & similar, and on 
filesystems normally mounted read-only (e.g., /usr).

  ext3

Modified ext2 with integrated journaling ("logging" in Solaris
lingo).  Good performance generally, fast I/O speed for reads.

ext3 has by far the most mature, conservative[1], effective fsck and
mkfs utilities (maintained by Ted T'so), and general operation. It is
designed to be forgiving of the failure modes of commodity x86 hardware,
which can be dangerous to data.[2]

Reports say that ext3 with data=journal[3] is great for mail spools, and
directory indexing supposedly improves access for directories with many
small files significantly.

data=journal provides highest throughput for MAIL SPOOLS because the
small files are written, sent, then removed. When this is all done in
the journal things supposedly go much faster. Otherwise, data=writeback,
which is the default, provides best overall performance AFAIK.

  ReiserFS

ReiserFS has gone through (at last count) four distinct, on-disk
formats, with at best rocky compatibility from one to the next. The
"fsck" (filesystem check) utilities for ReiserFS has earned a reputation
for often repairing filesystems by massive deletion of files. This
appears to happen primarily because of loss of metadata, as opposed to
damage to datafiles. Many observers have been leery of the design, for
those two reasons. Some would object that the characteristics of
versions before ReiserFS4 are no longer relevant: Others hold the
inconsistent, changing design, and severe reliability problems of the
prior code against it.

ReiserFS enjoys fast file creation/deletion. It's best used for
filesystems housing large numbers of small, changeable files, e.g., a
machine running a Usenet news server. Reiser is space-efficient and does
not pre-allocate inodes: They are done on the fly.

ReiserFS defaults to writing metadata before data. ext3-like behaviour
can be forced, instead, by using a "data=ordered" mount parameter.

  XFS

SGI's XFS is generally the fastest of the journaled filesystems, having
exceptionally good performance for filesystems housing (individually)
very large files (gigabytes each) on very large partitions, e.g., for
video production: XFS was designed (on SGI IRIX) to be a full 64-bit
filesystem from the beginning, and thus natively supports files as large
as 2^63 = 9 x 10^18 = 9 exabytes (about a million terabytes) as
implemented on 2.6 kernels, or 64 terabytes as implemented on 2.4. XFS
is much faster than the other filesystems at deleting large files  an
order of magnitude faster than ext3 and two orders of magnitude faster
than ReiserFS.

When last I used it, XFS performance on small and medium-sized files
tended to be relatively a little slower than ReiserFS and a bit faster
than ext3, but it's possible that this may have changed.

XFS defaults to writing metadata before data, and this behaviour cannot
be overridden.

The biggest problem with XFS is that the very extensive changes SGI had
to make to the Linux kernel's VFS layer, to incorporate it, seem troubling.

Older versions had some problems with files getting populated with
binary nulls when written to immediately before a crash because of the
way XFS handled its cache and preallocation of space. This has been
fixed, though, and things are very stable now. Delete speeds are also
much faster than before, when they used to be noticeably slow because
they were done syncronously. 

  JFS

Like most people I have no experience with IBM's JFS for Linux (which
IBM ported from OS/2, rather than from AIX). However, a friend who's
used it extensively on Linux sends the following report:

JFS is generally reliable, but lost/damaged files show up in lost+found
more often than they do with XFS. On the other hand, such files are more
likely to be intact: XFS tends to pad them with null sections, which you
most remove. JFS has a somewhat higher CPU cost than does XFS.

  Other considerations

Both ReiserFS and XFS impose significant additional CPU load, relative
to ext3 (except that XFS has very low CPU load, relatively speaking,
when handling very large files.

By the way, if you're not wedded to the idea of Fedora Core, please
consider one of the better RHEL rebuilds such as CentOS 4.1:
http://www.centos.org/  (Why?  Because Fedora Core tends to be scarily
beta.  I would very much avoid its use on any production system.)

[1] http://kerneltrap.org/node/10

[2] http://linuxmafia.com/faq/Filesystems/reiserfs.html

[3] Options are:

data=writeback: Metadata-only, and best performance.  Could allow
                recently modified files to become corrupted in the 
                event of an unexpected reboot.

data=ordered:   Officially journals only metadata, but it logically groups
		metadata and data blocks into a single unit called a
                transaction. When it's time to write the new metadata 
                out to disk, the associated data blocks are written 
                first. data=ordered effectively solves data=writeback's
                corruption risk (shared by other journeled FSes generally), 
                without resorting to full journaling. data=ordered tends
                to be perform slightly slower than data=writeback, but 
                significantly faster than full journaling.

                When appending data to files, data=ordered provides
                all of full data journaling's integrity protection. 
                However, if part of a file is being overwritten when the 
                system crashes, it's possible the region being written 
                will receive a combination of original blocks interspersed 
                with updated blocks. This is because data=ordered provides 
                no guarantees as to which blocks are overwritten first, 
                so you can't assume that just because overwritten block x 
                was updated, that overwritten block x-1 was updated as well.
                Data=ordered leaves the write ordering up to the hard drive's
                write cache. In general, this doesn't bite people often;
                file appends are much more common than overwrites. 
                So, data=ordered is a good higher-performance replacement 
                for full journaling.

data=journal:   Full data and metadata journaling:  All new data are written
                to the journal first.  Oddly, in certain situations,
                data=journal can be blazingly fast, where data needs to
                be read from and written to disk at the same time
                (interactive I/O).

-- 
Cheers,                    Never criticize anybody until you have walked a mile 
Rick Moen                  in their shoes, because by that time you will be a 
rick at linuxmafia.com        mile away and have their shoes.   -- Brian Servis