ZFS is the (relatively) new filesystem from Sun with some fascinating properties. Here are some headline facts just to get your attention – the maximum size of a single file on a zfs filesystem is 16 ExiBytes (that’s 1000 million gigabytes), it’s possible to take a complete filesystem backup (snapshot) in a few seconds and you’ll never have to fsck your filesystem again to make sure it’s not corrupt.
So, sounds pretty impressive, heh? So what makes all this possible?
There are 3 main components to zfs that enable a lot of the cool functionality. If you’ve used a NetApp OnTap based filer before then these will sound familiar (hence, Netapp and Sun’s lawyers getting in a bit of overtime).
The first component is the Copy On Write transactional model (COW). This means that when a block of data on the filesystem changes it is not overwritten, a new block is created and the metadata for the file that has changed is updated.
This enables near instant filesystem copies to by made (snapshots). What happens is that only the metadata for the filesystem is copied, which is actually very little data and takes very little time. Also, imediately after the snapshot has been made it takes up virtually zero disk space. So you could make a snapshot of 10TB of data (so it looks like you’ve now got 20TB of data) but only use 10TB of disk sapce – neat.
The snapshot only grows in size as the data changes. So if you update a 100GB worth of data (remember it’s block based so it’s not how big the file is that changes it’s how many blocks in that file change), then your snapshot will grow to 100GB. Snapshots can be deleted at any time, thus freeing up the data contained in them and you can have a (virtually) limitless number of snapshots. So if you want to take snapshots every week, day, hour, minute – the choice is yours. You are only limited by the disk space available and how often your data changes.
The second component are checksums. Every block of data on a zfs filesystem has a 256-bit checksum. This checksum is verified every time a block is read. This means that if a block doesn’t match it’s checksum (ie is corrupt) then zfs can recreate that block from another copy (assuming you are running zfs raid).
Another possibility with checksums is block level data duplication. This means that, when a block is about to be written to disk, zfs checks to see if there is another identical block of data. If there is, zfs can just point the files metadata to that block instead of creating a new block (it would also be possible to do this as a batch job – after the event). So, say you have a 100MB file, and you make a small change to it and save it as a new filename, you might only use 101MB of disk space yet you have 2 100MB files. Or, say 1000 staff in your company save the same 1MB e-mail attachment to disk, it will only take up 1MB of disk space (not 1GB).
Data deduplication is not currently implemented in zfs but, surely, it can only be a matter of time.
The third component is dynamic striping. This allows you to add more drives to your existing pool and zfs will restripe your data across the new drives. So, if you’ve got a 5*1TB drives in a raidz (similar to raid5) pool with 3TB of data on it, you can add another 5*1TB drives and zfs will restripe your 3TB of data across the whole 10 drives. This makes adding drives for performance or capacity reasons very simple.
ZFS is a massive leap forward for Unix filesystems with so many great potentials. Sun have licensed the technology to Apple for OSX Leopard and hopefully, one day, it will make it into the Linux kernel. Although, this is currently not possible due to license restrictions.
You can try Opensolaris (which includes ZFS for free, so why not give it a go)
More information can be found here –