Re: Scalable Mail solution with NAS

newer
OT: netsol and host records.

older
MPLS routing loop

Eric Sobocinski

31 Jan 2001 31 Jan '01

6:40 p.m.

At 11:06 AM -0500, 01/31/2001, Sebastien Berube wrote:

...

One way to fix this issue would be to use a hashing scheme to split the amount of actual mailboxes into a subdirectory structure. You could get something like

johndoe@yourdomain.com would have his mailbox in

/export/mailboxes/j/o/h/n/johndoe.mbox

so in /export/mailboxes, in order to find the j directory, you only have about 36 directories entries or so.

Although this example is not good in the case where you accept usernames with 3 or less characters.

It's not hard to right-pad any short usernames before hashing. For instance, the username "bo" might hash as "bo__" and thus would end up in the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow non-alphanumerics you'll want to translate those to something innocuous as well, or a name such as "bo.lee" will cause problems.

Show replies by date

Matthew Zito

31 Jan 31 Jan

7:12 p.m.

New subject: Scalable Mail solution with NAS

On Wed, 31 Jan 2001, Eric Sobocinski wrote:

...

At 11:06 AM -0500, 01/31/2001, Sebastien Berube wrote:

...
One way to fix this issue would be to use a hashing scheme to split the amount of actual mailboxes into a subdirectory structure. You could get something like

johndoe@yourdomain.com would have his mailbox in

/export/mailboxes/j/o/h/n/johndoe.mbox

so in /export/mailboxes, in order to find the j directory, you only have about 36 directories entries or so.

Although this example is not good in the case where you accept usernames with 3 or less characters.

It's not hard to right-pad any short usernames before hashing. For instance, the username "bo" might hash as "bo__" and thus would end up in the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow non-alphanumerics you'll want to translate those to something innocuous as well, or a name such as "bo.lee" will cause problems.

Well, hashing like that works well from the standpoint that it's very easy for the software to find the mailbox. It's going to make things like backups very costly, though, because of all the recursive directories. Also, you're going to end up with some directories very imbalanced, since there are more frequently occurring names. If you're going to use NFS, you probably want to use something like maildir format. - which is nfs-safe but becomes very costly as the number of messages increase. A lot of that has to do with the performance of the remote nfs server - the underlying filesystem's performance in reading large directories will make a BIG difference as far as that goes. Netapps have excellent large-directory performance, fwiw. If you're looking for large scalability AND high performance, my preferred solution would be to have a relational database as the backend, but don't store any messages in it - simply pointers to their location on disk. Then store the messages without regard to intended username in a hashed directory structure. The pop3 server then gets the list of new messages from the database server, which could just be a list of filenames. Then, the pop3 server simply has to open the message to return it - it doesn't have to do an opendir(). Also, if you use the filename as the UIDL returned, there's no need to even stat() the file, again saving you a whole nfs call. The obvious downside is that you can't do a : rm -f /users/j/o/h/n/johndoe.mbx But, with 200k mailboxes, you should have an automated way to do that anyway. Thanks, Matt -- Matthew J. Zito Systems Engineer Register.com, Inc., 11th Floor, 575 8th Avenue, New York, NY 10018 Ph: 212-798-9205 PGP Key Fingerprint: 4E AC E1 0B BE DD 7D BC D2 06 B2 B0 BF 55 68 99

Sebastien Berube

7:49 p.m.

New subject: Scalable Mail solution with NAS

On Wed, 31 Jan 2001, Matthew Zito wrote:

...

On Wed, 31 Jan 2001, Eric Sobocinski wrote:

...
At 11:06 AM -0500, 01/31/2001, Sebastien Berube wrote:

...
One way to fix this issue would be to use a hashing scheme to split the amount of actual mailboxes into a subdirectory structure. You could get something like

johndoe@yourdomain.com would have his mailbox in

/export/mailboxes/j/o/h/n/johndoe.mbox

so in /export/mailboxes, in order to find the j directory, you only have about 36 directories entries or so.

Although this example is not good in the case where you accept usernames with 3 or less characters.

It's not hard to right-pad any short usernames before hashing. For instance, the username "bo" might hash as "bo__" and thus would end up in the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow non-alphanumerics you'll want to translate those to something innocuous as well, or a name such as "bo.lee" will cause problems.

Well, hashing like that works well from the standpoint that it's very easy for the software to find the mailbox. It's going to make things like backups very costly, though, because of all the recursive directories. Also, you're going to end up with some directories very imbalanced, since there are more frequently occurring names.

In order to remedy this rather easily, you can always run the username through a hashing function and use the first 'n' letters of the hash to figure what directory the mail(box|dir) is in. That also prevents problems with non-alphanumerical characters such as "."

...

If you're going to use NFS, you probably want to use something like maildir format. - which is nfs-safe but becomes very costly as the number of messages increase. A lot of that has to do with the performance of the remote nfs server - the underlying filesystem's performance in reading large directories will make a BIG difference as far as that goes. Netapps have excellent large-directory performance, fwiw.

If you're looking for large scalability AND high performance, my preferred solution would be to have a relational database as the backend, but don't store any messages in it - simply pointers to their location on disk. Then store the messages without regard to intended username in a hashed directory structure. The pop3 server then gets the list of new messages from the database server, which could just be a list of filenames. Then, the pop3 server simply has to open the message to return it - it doesn't have to do an opendir(). Also, if you use the filename as the UIDL returned, there's no need to even stat() the file, again saving you a whole nfs call. The obvious downside is that you can't do a :

rm -f /users/j/o/h/n/johndoe.mbx

But, with 200k mailboxes, you should have an automated way to do that anyway.

It also makes backups a nightmare. In that case, you'll have to shutdown the entire mail system before you can backup or you'll have a database image which won't represent the actual data you have on your NAS.

...

Thanks, Matt

-- Sebastien Berube Operation Center Systems Administrator sberube@zeroknowledge.com In Gary we trust.

Neil J. McRae

8:11 p.m.

New subject: Scalable Mail solution with NAS

...

...
...
...
/export/mailboxes/j/o/h/n/johndoe.mbox

In the past I've actually found that reversing the letters gives much better randomosity around the directory structure so, johndoe@clown.org would end up in e/o/d/n/johndoe and you don't take much of a hit for this.

...

...
very costly, though, because of all the recursive directories. Also, you're going to end up with some directories very imbalanced, since there are more frequently occurring names.

...

It also makes backups a nightmare. In that case, you'll have to shutdown the entire mail system before you can backup or you'll have a database image which won't represent the actual data you have on your NAS.

In a high performance/availability system typical tape/spool based backups are problematic - with netapp you have a number of options to handle this [snap mirror etc]. It really depends on your turn over of data which for mail is usually pretty high. [oh and IBM disks tend to make a huge difference :-)]. Ofcourse spool type backups are fine for the OS and configurations. Regards, Neil.

Jim Mercer

10:55 p.m.

New subject: Scalable Mail solution with NAS

On Wed, Jan 31, 2001 at 08:11:03PM +0000, Neil J. McRae wrote:

...

...
...
...
...
/export/mailboxes/j/o/h/n/johndoe.mbox

In the past I've actually found that reversing the letters gives much better randomosity around the directory structure so, johndoe@clown.org would end up in e/o/d/n/johndoe and you don't take much of a hit for this.

i'm currently implementing a largish mail server, and have come up with what i think is a nice way to deal with scale and redundancy, etc, etc. what i have done is create a couple DNS zones, which are like: $ORIGIN mailbox.domain.com bob IN CNAME popserver1 john IN CNAME popserver1 bill IN CNAME popserver2 $ORIGIN smtp.domain.com bob IN A 10.1.1.1 ; ipaddr of mailserver1 john IN A 10.1.1.1 ; ipaddr of mailserver1 bill IN A 10.1.1.2 ; ipaddr of mailserver2 then, users are told to set their SMTP to username.smtp.domain.com and to direct their POP/IMAP client at username.mailbox.domain.com. you might even be able to get away with a single map. incoming mail should direct username@domain.com to username@username.domain.com in any case, using this method, you can now arbitrarily store mailboxes on any of several machines, even possibly in several locations. if a server fails, you can quickly redirect the users to another server so that new mail piles up in their new mailbox, and you can restore the broken server on a more leisurely pace. this can be extended to allow users to check their email with a web-based packet at http://username.mailbox.domain.com, or even be shortened so that they can have personal website http://username.domain.com how to implement the actual DNS is left as an exercise of the student. -- [ Jim Mercer jim@pneumonoultramicroscopicsilicovolcanoconiosis.ca ] [ Reptilian Research -- Longer Life through Colder Blood ] [ aka jim@reptiles.org +1 416 410-5633 ]

Matthew Zito

8:13 p.m.

New subject: Scalable Mail solution with NAS

On Wed, 31 Jan 2001, Sebastien Berube wrote:

...

...
If you're looking for large scalability AND high performance, my preferred solution would be to have a relational database as the backend, but don't store any messages in it - simply pointers to their location on disk. Then store the messages without regard to intended username in a hashed directory structure. The pop3 server then gets the list of new messages from the database server, which could just be a list of filenames. Then, the pop3 server simply has to open the message to return it - it doesn't have to do an opendir(). Also, if you use the filename as the UIDL returned, there's no need to even stat() the file, again saving you a whole nfs call. The obvious downside is that you can't do a :

rm -f /users/j/o/h/n/johndoe.mbx

But, with 200k mailboxes, you should have an automated way to do that anyway.

It also makes backups a nightmare. In that case, you'll have to shutdown the entire mail system before you can backup or you'll have a database image which won't represent the actual data you have on your NAS.

No, no, don't do that. Given the scale of something like this, I'd expect you'd be running on something like Oracle that supports the concept of "hot backups". The table spaces are put into a quiesced state, and all writes are done to memory and to recovery logs. Once the backup is finished, you take it out of hot backup and it then writes all the pending transactions to the database files. That way, the database files are stable, and you also back up the recovery logs to something with real-time access (like another nfs server). In the event you have a catastrophic database failuser, you recover from tape (or if you have the space, you have a copy of the dbf files elsewhere), and run all the transaction logs - it takes about 5 minutes per hour of transactions. Then your database is brought up to the point where it was when it died. The worst case scenario is that there's a few transactions that don't get logged, which means that a few emails get dropped. If you had a stock smtp server that died, you could be looking at the same situation. As far as backing up the actual mailboxes, there's no way to get around the fact that it'll take long enough to finish that stuff will be inaccurate by the time its finished. If you ever have to restore the mailboxes from tape without restoring the database, it'd be wise to have an application that builds a list of the messages that are on disk the database doesn't know about. Thanks, Matt -- Matthew J. Zito Systems Engineer Register.com, Inc., 11th Floor, 575 8th Avenue, New York, NY 10018 Ph: 212-798-9205 PGP Key Fingerprint: 4E AC E1 0B BE DD 7D BC D2 06 B2 B0 BF 55 68 99

Adrian Chadd

8:30 p.m.

New subject: Scalable Mail solution with NAS

On Wed, Jan 31, 2001, Matthew Zito wrote:

...

No, no, don't do that. Given the scale of something like this, I'd expect you'd be running on something like Oracle that supports the concept of "hot backups". The table spaces are put into a quiesced state, and all writes are done to memory and to recovery logs. Once the backup is finished, you take it out of hot backup and it then writes all the pending transactions to the database files. That way, the database files are stable, and you also back up the recovery logs to something with real-time access (like another nfs server). In the event you have a catastrophic database failuser, you recover from tape (or if you have the space, you have a copy of the dbf files elsewhere), and run all the transaction logs - it takes about 5 minutes per hour of transactions. Then your database is brought up to the point where it was when it died. The worst case scenario is that there's a few transactions that don't get logged, which means that a few emails get dropped. If you had a stock smtp server that died, you could be looking at the same situation.

.. and then you have to make sure that you periodically garbage collect your local store, lest you end up with a whole bunch of files which are unreferenced and just take up space. :)

...

As far as backing up the actual mailboxes, there's no way to get around the fact that it'll take long enough to finish that stuff will be inaccurate by the time its finished. If you ever have to restore the mailboxes from tape without restoring the database, it'd be wise to have an application that builds a list of the messages that are on disk the database doesn't know about.

At least two commercial filesystems support "snapshots" - AdvFS and WALF (NetApp). I don't remember if XFS supports snapshots. Oh, FreeBSD's FFS has got snapshot capabilities but its not yet useful in a "real world" scenario. Adrian -- Adrian Chadd "Sex Change: a simple job of outside <adrian@creative.net.au> to inside plumbing." - Some random movie

Adrian Chadd

8:16 p.m.

New subject: Scalable Mail solution with NAS

On Wed, Jan 31, 2001, Matthew Zito wrote:

...

If you're looking for large scalability AND high performance, my preferred solution would be to have a relational database as the backend, but don't store any messages in it - simply pointers to their location on disk. Then store the messages without regard to intended username in a hashed directory structure. The pop3 server then gets the list of new messages from the database server, which could just be a list of filenames. Then, the pop3 server simply has to open the message to return it - it doesn't have to do an opendir(). Also, if you use the filename as the UIDL returned, there's no need to even stat() the file, again saving you a whole nfs call. The obvious downside is that you can't do a :

rm -f /users/j/o/h/n/johndoe.mbx

But, with 200k mailboxes, you should have an automated way to do that anyway.

Hah. Unlink the directory, and do a background fsck every few hours? :) The trouble with the above format is that you're ignoring any locality that exists in the filesystem. For example, in Berkeley FFS, files in a given directory are allocated in the same cylinder group (or at least it is attempted..) Which, under heavy heavy load could actually give a slight performance boost on a non-filled FFS. I believe there was a paper covering this locality for web caches. Ah, yes: "Reducing the Disk I/O of Web Proxy Server Caches" - Carlos Maltzahn and Kathy J Richardson Compaq Computer Corporation, Network Systems Laboratory - Dirk Grunwald University of Colorado .. some (not all) of the concepts included there are relevant here. Other filesystems will have different allocation/layout policies, and additions such as "hinting" which can substantially speed up mail accesses. But, this is off topic, and I digress. :-) Adrian -- Adrian Chadd "Sex Change: a simple job of outside <adrian@creative.net.au> to inside plumbing." - Some random movie

Neil J. McRae

9:06 p.m.

New subject: Scalable Mail solution with NAS

...

...
But, with 200k mailboxes, you should have an automated way to do that anyway.

Hah. Unlink the directory, and do a background fsck every few hours? :)

I don't know why you'd want to do the above, but you could add code to the deliver agents; When inbound email hits the system: create the require directories and files if required, this could be mail.local, deliver or something similar. When mailbox is emptied get the delivery agent [could be pop3d or imapd] to delete any empty directories, then growing directories can be kept under control.

...

The trouble with the above format is that you're ignoring any locality that exists in the filesystem. For example, in Berkeley FFS, files in a given directory are allocated in the same cylinder group (or at least it is attempted..)

Which, under heavy heavy load could actually give a slight performance boost on a non-filled FFS.

Agreed, but depending on the scale you'd most likely want logging file systems otherwise reboots could be painful. Regards, Neil.

Adrian Chadd

9:11 p.m.

New subject: Scalable Mail solution with NAS

On Wed, Jan 31, 2001, Neil J. McRae wrote:

...

...
...
But, with 200k mailboxes, you should have an automated way to do that anyway.

Hah. Unlink the directory, and do a background fsck every few hours? :)

I don't know why you'd want to do the above, but you could add code to the deliver agents; When inbound email hits the system: create the require directories and files if required, this could be mail.local, deliver or something similar. When mailbox is emptied get the delivery agent [could be pop3d or imapd] to delete any empty directories, then growing directories can be kept under control.

oh, I was joking about the above. Yes, you're right.

...

...
The trouble with the above format is that you're ignoring any locality that exists in the filesystem. For example, in Berkeley FFS, files in a given directory are allocated in the same cylinder group (or at least it is attempted..)

Which, under heavy heavy load could actually give a slight performance boost on a non-filled FFS.

Agreed, but depending on the scale you'd most likely want logging file systems otherwise reboots could be painful.

Uhm, I didn't think I was going to, but I guess its time for a plug. I modified FFS to remove its namespace and place a flat inode-based namespace in its place. Its called IFS, and it can be found in FreeBSD-current. Directory operations are done with inode numbers. One of the things on my todo list is to pass "locality information" in with the create() (ie, say "be close to inode <foo>"). fsck'ing an IFS partition is fast, because it doesn't need to check the pathname tree. So, there's no reason you need to try to do tricks with the UNIX directory namespace. You'd be surprised how much RAM/many diskops are wasted in doing lookups and attempting to cache them.. (and thats my last public post on this topic.) Adrian

...

Regards, Neil.

-- Adrian Chadd "Sex Change: a simple job of outside <adrian@creative.net.au> to inside plumbing." - Some random movie

8922

Age (days ago)

8922

Last active (days ago)

List overview

Download

9 comments

6 participants

participants (6)

Adrian Chadd
Eric Sobocinski
Jim Mercer
Matthew Zito
Neil J. McRae
Sebastien Berube