Internet email performance study
Hi, (we previously posted this on the e2e mail list; apologies if you are reading it for the second time) We're looking for operational-types lurking on the list with experience running large mail servers. In particular, we have collected a large amount of data as part of an Internet email performance study that we cannot entirely explain. If you can help us or are simply curious about our findings, we'd love to hear from you. WHAT WE DID: Briefly, we used SMTP bounce-backs as the basis of an email active measurement survey. Using random addresses as unique identifiers, we measure latency, loss, paths, etc. to a large set of Internet MTAs. Approximately 1/3 of all servers we surveyed respond with bounce-backs. We've found some interesting results. For example latencies of days (30 days in once instance). WHAT WE DON'T UNDERSTAND: Most servers behave as we expect, either always replying with bounce-backs or never replying. However, some exhibit odd and seemingly non-deterministic behavior. For example, a server will respond to all emails for weeks, and then reply to only a fraction (e.g., 25-75%) of the emails in a seemingly random pattern for some period of time (e.g, 4 hours). Further, we often see these patterns correlated within a domain (e.g., a subset of the MTAs will enter and exist this loss mode at the same time). We are fairly certain that the loss is an artifact of the MTA behavior or local administration. While we can guess reasons this might occur, we have yet to find an administrator who can explain this behavior with an architecture used in practice. More details on the project including our exact methodology, plausible explanations for the loss and a FAQ are available on our web site: http://ana.lcs.mit.edu/emailtester Thanks! Rob Beverly / Mike Afergan
----- Original Message ----- From: "Robert Beverly" <rbeverly@rbeverly.net> To: <nanog@merit.edu> Cc: <afergan@mit.edu> Sent: Thursday, April 28, 2005 22:21 Subject: Internet email performance study
Hi,
(we previously posted this on the e2e mail list; apologies if you are reading it for the second time)
We're looking for operational-types lurking on the list with experience running large mail servers. In particular, we have collected a large amount of data as part of an Internet email performance study that we cannot entirely explain. If you can help us or are simply curious about our findings, we'd love to hear from you.
WHAT WE DID: Briefly, we used SMTP bounce-backs as the basis of an email active measurement survey. Using random addresses as unique identifiers, we measure latency, loss, paths, etc. to a large set of Internet MTAs. Approximately 1/3 of all servers we surveyed respond with bounce-backs. We've found some interesting results. For example latencies of days (30 days in once instance).
WHAT WE DON'T UNDERSTAND: Most servers behave as we expect, either always replying with bounce-backs or never replying. However, some exhibit odd and seemingly non-deterministic behavior. For example, a server will respond to all emails for weeks, and then reply to only a fraction (e.g., 25-75%) of the emails in a seemingly random pattern for some period of time (e.g, 4 hours). Further, we often see these patterns correlated within a domain (e.g., a subset of the MTAs will enter and exist this loss mode at the same time). We are fairly certain that the loss is an artifact of the MTA behavior or local administration. While we can guess reasons this might occur, we have yet to find an administrator who can explain this behavior with an architecture used in practice.
Well it could be many reasons for that depending on how you probe SMTPs. Some sysadmins block IP addresses that seem to be a spammer trying some addresses to send spam to; spammers try always to find a catch-all mail to flood with messages addressed to anything@thatdomain.com . Another possiblity is that the domains you are monitoring are on dynamic IP addresses that changes all the time and the gap when they become non-responsive could be due to delay in updating the DNS roots with new IP address. Also could be a non-dedicated mail servers, meaning that server is used for web and DNS and when overloaded try to shed some load out and usually the first service to disable is SMTP. Or that domain does have a lower priority mail server which happens to be down for maintenance but your DNS server is caching the data (IP address) of that mail server which should not happen as it has to retry the other MX record but remain a possiblity. I have not yet looked at the details on your URL but there are number of things to consider when doing such survey. 1. Where is your monitoring server located in relation to the being monitored servers / domains. You need to establish a datum for how far is that server or domain using PING to see how long the packet takes on round-trip just to role out the fact of networking / routing issues that may interfer into the results which you need for the respones of MTAs. 2. Study that domain using Dig to find MX records and DNS servers and if there are back up DNS somewhere near your network. 3. Of course as indicated above, you need to find out if the IP of that domain is static or dynamic. 4. Also, you need to monitor the load on your own server and DNS responses. What I'd suggest is to use MRTG to monitor the round-trip time using PING on the servers being monitored so you have real live data that helps in establishing your final findings. Also not to forget that some MTAs users have thier SMTP with a filter to reject SMTP traffic that is not behaving as normal with SMTP Greeting. If you need any further information or some logs, please send me an email to info@riyadhub.com
More details on the project including our exact methodology, plausible explanations for the loss and a FAQ are available on our web site: http://ana.lcs.mit.edu/emailtester
Thanks!
Rob Beverly / Mike Afergan
aljuhani
On Thu, Apr 28, 2005 at 11:21:07PM +0300, aljuhani wrote:
Another possiblity is that the domains you are monitoring are on dynamic IP addresses that changes all the time and the gap when they become non-responsive could be due to delay in updating the DNS roots with new IP address. Also could be a non-dedicated mail servers, meaning that server is used for web and DNS and when overloaded try to shed some load out and usually the first service to disable is SMTP.
Or that domain does have a lower priority mail server which happens to be down for maintenance but your DNS server is caching the data (IP address) of that mail server which should not happen as it has to retry the other MX record but remain a possiblity.
Hi aljuhani, We do not consider an email lost unless it is successfully delivered to a server (which is defined by its IP address - the IP address of a server is the atomic unit of testing, not domains). By this I mean during the SMTP exchange, we monitor the response codes we get back and will only count an email lost if we don't receive a bounce-back after a complete series of positive response codes. If we can't connect to a server, for instance in your comments above, we'll never consider the email successfully delivered and hence it can never be called lost.
I have not yet looked at the details on your URL but there are number of things to consider when doing such survey.
1. Where is your monitoring server located in relation to the being monitored servers / domains. You need to establish a datum for how far is that server or domain using PING to see how long the packet takes on round-trip just to role out the fact of networking / routing issues that may interfer into the results which you need for the respones of MTAs.
2. Study that domain using Dig to find MX records and DNS servers and if there are back up DNS somewhere near your network.
3. Of course as indicated above, you need to find out if the IP of that domain is static or dynamic.
Yes, our preprocessing step involves separating a domain into all of the constituent IP addresses of MXes servering that domain. The paper has lots of details on this, but the major point again is that we are testing IP addresses rather than domains.
4. Also, you need to monitor the load on your own server and DNS responses.
DNS is not an issue for testing, as I said above, we preprocess all of the domains to generate IP addresses of MTAs which are the atomic unit of testing. As for the load, we did rather extensive load testing on our own server before putting the system into production.
What I'd suggest is to use MRTG to monitor the round-trip time using PING on the servers being monitored so you have real live data that helps in establishing your final findings.
Not entirely certain that MRTG records would provide any useful data, at least not the granularity that would be needed to say anything definitively.
Also not to forget that some MTAs users have thier SMTP with a filter to reject SMTP traffic that is not behaving as normal with SMTP Greeting.
Yes, our SMTP greetings are valid and up to spec. Again, it's the non-deterministic loss that we're most concerned about. If there were a problem with the SMTP exchange, we would see our emails always rejected (for instance). Our measurement study only includes emails that were successfully delivered (indicated by a complete series of successful status codes returned during SMTP exchange). Many thanks, rob
On Thu, Apr 28, 2005 at 23:42, Robert Beverly" <rbeverly@rbeverly.net> ......snip
Yes, our SMTP greetings are valid and up to spec. Again, it's the non-deterministic loss that we're most concerned about. If there were a problem with the SMTP exchange, we would see our emails always rejected (for instance). Our measurement study only includes emails that were successfully delivered (indicated by a complete series of successful status codes returned during SMTP exchange).
Many thanks,
rob
Hi, Perhaps this explains it. http://www.albury.net.au/netstatus/derouted.html BTW your subnet (18.0.0.0/8) is listed there as well. Regards, aljuhani
aljuhani wrote:
On Thu, Apr 28, 2005 at 23:42, Robert Beverly" <rbeverly@rbeverly.net>
......snip
Yes, our SMTP greetings are valid and up to spec. Again, it's the non-deterministic loss that we're most concerned about. If there were a problem with the SMTP exchange, we would see our emails always rejected (for instance). Our measurement study only includes emails that were successfully delivered (indicated by a complete series of successful status codes returned during SMTP exchange).
Many thanks,
rob
Hi,
Perhaps this explains it.
No, it doesn't. Please read their paper. In the paper and as he stated again in the response above, their definition of a "loss" requires the message to be delivered successfully in the first place. The anti-spam measure described in the above URL causes the remote MTA to not accept mail at all from the blocked source. This would not be counted as a loss in their methodology, but possibly as an "error."
BTW your subnet (18.0.0.0/8) is listed there as well.
I don't see it there. And those are not censures of the entire /8 networks, but just a list of how many individual hosts in that network are currently blocked. -- Crist J. Clark crist.clark@globalstar.com Globalstar Communications (408) 933-4387
participants (3)
-
aljuhani
-
Crist Clark
-
Robert Beverly