2006.06.06 Nick Feamster, Network-level spam behaviour [slides are at: http://www.nanog.org/mtg-0606/pdf/nick-feamster.pdf Spam unsolicited commercial email feb 2005, 90% of all email is spam common filtering techniques are content based DNS balcklist queries are significant fraction of DNS traffic today. (DNSbls) Using IP address based spam black lists isn't so useful. How spammers evade blacklists will be discussed as well. Problems with content-based filters ...uh oh, some technical glitches... Content-based properties are malleable low cost to evasion altering content based on scripts is too easy customized emails are easy to generate content based filters need fuzzy hashes over content, etc. high cost to filter maintainers as content changes, filters need to be updated. constantly tweaking spamassasain rules is a pain. false positives are always an issue. Content-based filters are applied at the destination too little, too late -- wasted network bandwidth, storage, etc. ; many users recieve and store the same spam content. Network level spam filtering is robust (hypothesis) network-level propeerties are more fixed hosting or upstream ISP (as number) botnet membership location in the network IP address block country? are there common ISPs that host the spammers, for example? Avoid receiving mail from machines that are part of botnets. Challenge--which properties are most useful for distinguishing spam traffic from legitimate email? very little if anything is known about these characteristics yet! Randy gave a lightning talk last NANOG about some of this. Some properties listed. Spamming techniques mostly botnets, of course other techniques too we're trying to quantify this coordination characteristics how we're doing this correlations with Bobax victims from georgia tech botnet sinkhole other possilities: heuristics distance of client IP from the MX record coordinated, low-bandwidth sending looked at pcaps coming in from hijacked command and control station from bots trying to talk to it; spamming bots, Bobax drone botnet, exclusively used to send spam. Collection two domains instrumented with MailAvenger (both on the same network) sinkhole domain 1 continuous spam collection since aug 2004 no real email addresses--sink everything 10 million + pieces of spam sinkhole domain #2 recently registered Nov 2005 "clean control" domain posted at a few places not much spam yet--perhaps being too conservative contact page with random email contact, look at who crawls, and then who spams the unique email addresses Monitoring BGP route advertisments from same network Also capturing traceroutes, DNSBL results, passive TCP host fingerprinting, simultaneous with spam arrival (results in this talk focus on BGP+ spam only) Mail Avenger, not an MTA, it forks to sendmail or postfix, it sits in front of MTA, does things like do DNSBL lookups, add headers, passive OS fingerprinting, as the spam is arriving. Also logged BGP routes from same network that got the spam; see connectivity to the spamming machine at the time. Picture of collection up at MIT network. Mail Collection: MailAvenger X-Avenger header. best guess at operating system, POF, DNSBL lookups, traceroutes back to mail relay at the time the mail was sent (used for debugging BGP) distribution across IP space plot /24 prefix vs how much spam coming from it. steeper lines mean more spam from that part of the IP space; you can see where spam is coming from. bunch comes from apnic, cable modem space, etc. few interesting things to note; still redoing legitimate mail characteristics. from georgia tech mail machines, it's legit plus spam, need to split out better. between 90.* and 180.*, legitimate mail mainly. Is IP-based blacklisting enough? Probably not: more than half of spamming client IPs appear less than twice. Roughly 50% of the IPs showed up less than twice; but that's a single sinkhole domain, would help more across multiple domains. emphasizes need to collaborate across multiple domains to build blacklists; any one domain won't see repeated patterns of IPs. Distribution across ASes 40% of spam coming from the US BGP spectrum agility Log IP addresses of SMTP relays Join with BGP route advertisements seen at network where spam trap is co-located. A small club of persistent players appears to be using this technique 61.0.0.0/8 AS4678 66.0.0.0/8 AS21562 82.0.0.0/8 AS8717 somewhere between 1-10% of all spam (some clearly intentional, others might be flapping) about 10 minute announcement time of the /8 while spam is flooded out. Might be interesting to couple this with route hijacking alerting to filter out if this is really a hijacking vs a flapping legitimate route. A slightly different pattern; announce-spam-withdraw on a minute-by-minute basis. really really egregious! Why such big prefixes? flexibility: client IPs can be scattered throughout dark space within a large /8 same sender usually returns with different IP addresses visibility: route typically won't be filtered (nice and short prefix length) Characteristics of IP-agile senders IP addresses are widely distributed across the /8 spce IP addresses typically appear only once at the sinkhole Depending on which /8, 60-80% of these IP addresses were not reachable by traceroute when we spot-checked some IP addresses were in allocated, albeit unannounced space Some AS paths associated with the routes contained reserved AS numbers Odd AS numbers injected, usually well-known to make it look more legitimate. Length of short-lived BGP epochs 10% of spam coming from short-lived BGP events Spam from Botnets Example: Bobax approximate size: 100k bots one sinkhole domain--this is ONLY stuff that is verifiable as coming from bots via command and control hijacked IPs, intersect the single sinkhole domain, so much smaller data subset, but well correlated and verified. Proportionally less spam from bots in 61-90 range; that tends to be where BGP route hijacks happen instead. Most Bot IP addresses do not return 65% of bots only send mail to a domain once over 18 months. Some hang around for a *long* time. About 20% stick around for several months. collaborative spam filtering seems to be helping track bot IP addresses. Most bots send low volumes of spam most bot IP addresses send very little spam regardless of how long they have been spamming Effectiveness of blacklisting: only about half of the IPs spamming from short-lived BGP are listed in any blacklist spam from IP-agile senders tend to be listed in fewer blacklists Looking at 8 different spam blacklists, checking when the spam arrives at the sinkhole. Known Bobax drones listed in more DNSbls than the BGP agile senders. About 90-95% of the Bobax bot drones are listed in one or more DNSBLs. Suggests some of the spamming bots are listed more than other techniques--that is, bots are easier to identify than BGP-agile spammers or spammers using other techniques. Harvesting tracking web-based harvesting register domain, set up MX record post, link to page with randomly generated email addresses Example Phish: a flood of email for a phishing attack for paypal.com all to: addresses harvested in a single crawl on January 16th 2006 emails received from IPs different from those who crawl. X-mailer headers totally diffrent. Lessons for better spam filters: effective spam filtering requires a btter notion of end-host identity distribution of spamming IP addresses is highly skewed detection based on network-wide, aggregate behavioru may be more fruitful than focusing on individual IPs large, emergent properties. two critical pieces of the puzzle botnet detection securing the internet's routing infrastructure compare distributions of spam to legitimate mail, see if certain spaces are more likely to send spam than legitimate mail. Questions: Q: Steve Bellovin, columbia university bots from strange ASes, is tunnelling taking place from bots to BGP speakers? A: Not sure if there's evidence or not; some data from TORS?? but TORS latency may be too high. Q: Fingerprinting to try to identify who is doing things, see how many hosts are actually doing this? Many addresses being used, how many hosts does it actually represent? A: Not sure, haven't checked that. Haven't checked on aliasing, since not much was seen from a single IP. NAT'ing? What about hosts hopping? (same host using multiple IPs?) Not sure, they didn't do that correlation. Q: Randy Bush, IIJ, they did do OS fingerprinting, so some of that are in the paper. didn't do anything with the traceroutes, though. Q: Matt asks what the difference between the two domains was; was one of them a recognizable word or name, or were they both random character strings? A: they were both random character strings, but one of them had been used to host a real website for a while, which might explain why it gets such a huge volume of spam compared to the other. Q: Matt points out that for some networks, receiving spam is actually a good thing, as it helps balance out traffic ratios, which helps during peering negotiations. Q: Randy Bush, IIJ, responding to Matt about traffic ratios: only those backbones who are on ADSL should they care which way traffic goes. :P Curious to work with large networks, see if filters could be installed to detect it, and possibly take action.