On Wed, 31 Jan 2001, Matthew Zito wrote:
On Wed, 31 Jan 2001, Eric Sobocinski wrote:
At 11:06 AM -0500, 01/31/2001, Sebastien Berube wrote:
One way to fix this issue would be to use a hashing scheme to split the amount of actual mailboxes into a subdirectory structure. You could get something like
johndoe@yourdomain.com would have his mailbox in
/export/mailboxes/j/o/h/n/johndoe.mbox
so in /export/mailboxes, in order to find the j directory, you only have about 36 directories entries or so.
Although this example is not good in the case where you accept usernames with 3 or less characters.
It's not hard to right-pad any short usernames before hashing. For instance, the username "bo" might hash as "bo__" and thus would end up in the directory "/export/mailboxes/b/o/_/_/bo.mbox". If you allow non-alphanumerics you'll want to translate those to something innocuous as well, or a name such as "bo.lee" will cause problems.
Well, hashing like that works well from the standpoint that it's very easy for the software to find the mailbox. It's going to make things like backups very costly, though, because of all the recursive directories. Also, you're going to end up with some directories very imbalanced, since there are more frequently occurring names.
In order to remedy this rather easily, you can always run the username through a hashing function and use the first 'n' letters of the hash to figure what directory the mail(box|dir) is in. That also prevents problems with non-alphanumerical characters such as "."
If you're going to use NFS, you probably want to use something like maildir format. - which is nfs-safe but becomes very costly as the number of messages increase. A lot of that has to do with the performance of the remote nfs server - the underlying filesystem's performance in reading large directories will make a BIG difference as far as that goes. Netapps have excellent large-directory performance, fwiw.
If you're looking for large scalability AND high performance, my preferred solution would be to have a relational database as the backend, but don't store any messages in it - simply pointers to their location on disk. Then store the messages without regard to intended username in a hashed directory structure. The pop3 server then gets the list of new messages from the database server, which could just be a list of filenames. Then, the pop3 server simply has to open the message to return it - it doesn't have to do an opendir(). Also, if you use the filename as the UIDL returned, there's no need to even stat() the file, again saving you a whole nfs call. The obvious downside is that you can't do a :
rm -f /users/j/o/h/n/johndoe.mbx
But, with 200k mailboxes, you should have an automated way to do that anyway.
It also makes backups a nightmare. In that case, you'll have to shutdown the entire mail system before you can backup or you'll have a database image which won't represent the actual data you have on your NAS.
Thanks, Matt
-- Sebastien Berube Operation Center Systems Administrator sberube@zeroknowledge.com In Gary we trust.