[My first response was direct to Ross. This has been paraphrased slightly to make it useful (hopefully) to NANOG...] -------- Original Message -------- Subject: RE: Email Complexes Date: Wed, September 15, 2004 9:53 am Hi Ross :) Please don't get me wrong, I applaud your efforts, because you're right - email is huge, and most customers don't have a reasonable expectation of the service to be expected in terms of mail delivery between providers. ("What do you mean its not there yet! I sent it 10 minutes ago!") My point is that by posting on NANOG saying 'give me an account please' for the purpose of keeping *your* customers happy strikes me as, well, interesting. You have the resources to monitor your own mail systems by watching your outbound mail queue. Every daemon I know of has ways of monitoring the outbound queue, and verifying that you're definately offloading mail to advertised MX's. I noticed an example for Sendmail was quoted on the list a short while ago. This is stuff you can influence - its your systems. Thats where i'd expect you to concentrate your efforts. By extension of this, it's not unreasonable for this information to perhaps be scripted and monitored via a web interface - nagios? - and made available to your upper echilon support staff. Hell at one of the ISPs I worked for - as a Tier 1 and 2 support tech - I had shell access to one of the unix boxes and a commandline script which would tell me how much mail was in the queue. If this remained low, I could verify there wasnt a problem. If it spiked, then I escalated a query to the NOC to find out what the story was. At the ISP I work for currently we dont even have that sort of information. If mail gets delayed we troubleshoot *without* that information. We're an ISP with 500,000 customers and have a team of ~15 technical specialists whos expertise closes on that of a junior NOC engineer. They successfully deal with all manner of technical queries and they can call the NOC directly to find out if theres anything odd going on server-side. They also clearly explain to any Tier 1's (and any customers) they speak to that email is not a guarunteed service, and is delayed from time to time, and theres nothing we can do except make sure that *our systems* are working as well as possible. Who's to say that your monitoring wont be thrown off by problems at $third_party ? Parsing headers is a good way to identify total delivery times, but anything beyond your own MX's is outside of your control anyway, so outside of casual interest I see little value in actually knowing exactly whats broken at AOL and Gmail, etc. (Isnt this AOL and Gmails problem, not yours?) Get queue monitoring. Script it to make the details available via the Web to your senior tech support staff. And remind your support guys that email is not guarunteed, and you'll do your level best to keep things running smoothly, but that once the mail leaves the network its outside of your control. So once you've verified that it has infact left your network, your job is done. Mark. (Disclaimer: Comments are mine and mine alone, and do not represent my employer or any previous employer for that matter.)
Let me see if I can explain your entire email.
Ensuring that email flows freely between our mail complex and other top mail provider complexes is a support issue correct. Actually setting up the system to monitor and to ensure the support people get the data they need is operations/engineering.
We like automating a lot of our procedures as our mail complex isn't staffed 24/7. Right now we have a script that monitors incoming mail sent from probes across the us. It monitors how long it takes the email to first hit the IronPort's, then how long it takes to hit the Brightmail, then how long it takes to hit the MTA's. Our script uses pop3 to grab the email and parse the headers we send from the probes (or in this case from the complex to the pop accounts). Yes I do realize some are webmail (AOL, MSN, Gmail), but even a lot of the webmail providers do have pop3 servers.
Our intent here is not not only verify that the email got there but that it got there in a reasonable time (lets face it email is becoming a more imporant part of life/business today).
As fair as teaching the support guys to go look at the mail queue, would you honestly want them to be doing that? We have over 65 mail machines and should I trust them with checking them every 10 min? Since we are not staffed 24/7 what happeneds if we have all gone home? The way we have it setup if the mail never reaches the complex tier-1 gets a page, 15 minutes later if the problem still isn't solved tier-2 gets a page. I believe automating the system rather then trutsing a staff member to check it and to pray that it dosen't break during the night is a much better way of doing it.
*snip*