Clustered Filesystems

From V5wiki

Jump to: navigation, search



GFS: Shared storage FS. Integrated with RHCS.

OCFS2: Shared storage FS, very similar to GFS. Standalone. Fencing/failover has to be provided by something like RHCS or Heartbeat.

-- I should probably stress that the above are for SAN based operation, but logically this extends to operating on top of a mirrored network block device like DRBD. --

Lustre: Advanced network file system. Despite claims of great scalability, metadata storage failover/redundant, but not load-shared.

GlusterFS: Replicated network FS with POSIX locking and support for file based striping and mirroring. Required xattr support on the backing file system, but files are the same on the exported and underlying file systems, which makes data recovery very straightforward and sensible if anything goes wrong.

Depending on what you plan to use it for, you may also want to look into Coda: replicated FS, supports disconnected operation through caching. Permission system can take some getting used to because they are based on external ACLs rather than owner/group/other permissions as per standard UNIX paradigm. Limited to 1000-4000 files per directory.

-- These three are client-server based FS-es (there is nothing stopping a client from also being a server). GlusterFS and Coda both store files raw on the underlying file system. Coda has it's internal metadata stores (files are stored with original content, but names are just numbers), whereas GlusterFS uses xattrs for metadata and file names are as original. --

Taken from [1]


Tought question as it depends on what you are needing. Myself I have messed around with 3 of those for the last 2 years, so far I am still just using an 2 NFS servers, one for mail and one for web for my 14 or so client machines until I figure out how to use glusterfs.

I tried gfs (redhat) and I dont remember if I even ever got it to actually run, I was trying it out on fedora distros. It seemed very over complicated and not very user friendly (just my experience).

OCFS2 seemed very clean and I was able to use with with ISCSI but man the load on my server was running at 7 and it was on the slow side. What I was trying to do with it was create a single drive to put my maildir data onto (millions of small mail files). The way it worked was you actually mounted the file system like it was a local file system on all machines that needed it and the cluster part would handle the locking or whatnot. Cool concept but overkill for what I needed.

Also I believe both GFS and OCFS2 are these "specialized" file systems. What happens if it breaks or goes down? How do you access your data? Well if gfs or ocfs2 is broken you cant. With glusterfs, you have direct access to your underlying data. So you can have your big raid mounted on a server and use XFS file system, glusterfs just sits on top of this so if for some reason you break your glusterfs setup you *could* revert back to some other form of serving files (such as NFS). Obviously this totally depends on your situation and how you are using it.

I have never used lustre, it sounded cool, but over complicated.

Hence the reason that *so far* I am still using NFS. It comes on every linux installation, its fairly easy to setup by editing what, 4 lines or so. GlusterFS takes the same simple approach and if you do break it, you still have access to your data.

The learning curve for glusterfs is much better than the others from my experience so far. The biggest thing is just learning all of the different ways you can configure spec files.

Taken from the same place as previous


I can't speak to the rest, but you seem to be confused between a 'distributed storage engine' and a 'distributed file system'. They are not the same thing, they shouldn't be mistaken for the same thing, and they will never be the same thing. A filesystem is a way to keep track of where things are located on a hard drive. A storage engine like hadoop is a way to keep track of a chunk of data identified by a key. Conceptually, not much difference. The problem is that a filesystem is a dependency of a storage engine... after all, it needs a way to write to a block device, doesn't it?

All that aside, I can speak to the use of ocfs2 as a distributed filesystem in a production environment. If you don't want the gritty details, stop reading after this line: It's kinda cool, but it may mean more downtime than you think it does.

We've been running ocfs2 in a production environment for the past couple of years. It's OK, but it's not great for a lot of applications. You should really look at your requirements and figure out what they are -- you might find that you have a lot more latitude for faults than you thought you did.

As an example, ocfs2 has a journal for each machine in the cluster that's going to mount the partition. So let's say you've got four web machines, and when you make that partition using mkfs.ocfs2, you specify that there will be six machines total to give yourself some room to grow. Each of those journals takes up space, which reduces the amount of data you can store on the disks. Now, let's say you need to scale to seven machines. In that situation, you need to take down the entire cluster (i.e. unmount all of the ocfs2 partitions) and use the tunefs.ocfs2 utility to create an additional journal, provided that there's space available. Then and only then can you add the seventh machine to the cluster (which requires you to distribute a text file to the rest of the cluster unless you're using a utility), bring everything back up, and then mount the partition on all seven machines.

See what I mean? It's supposed to be high availability, which is supposed to mean 'always online', but right there you've got a bunch of downtime... and god forbid you're crowded for disk space. You DON'T want to see what happens when you crowd ocfs2.

Keep in mind that evms, which used to be the 'preferred' way to manage ocfs2 clusters, has gone the way of the dodo bird in favor of clvmd and lvm2. (And good riddance to evms.) Also, heartbeat is quickly going to turn into a zombie project in favor of the openais/pacemaker stack. (Aside: When doing the initial cluster configuration for ocfs2, you can specify 'pcmk' as the cluster engine as opposed to heartbeat. No, this isn't documented.)

For what it's worth, we've gone back to nfs managed by pacemaker, because the few seconds of downtime or a few dropped tcp packets as pacemaker migrates an nfs share to another machine is trivial compared to the amount of downtime we were seeing for basic shared storage operations like adding machines when using ocfs2.

Taken from [2]


DRBD® refers to block devices designed as a building block to form high availability (HA) clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1.

DRBD на русском

Personal tools