representativeness of flow data based on samples
Traffic measurement techniques such as NetFlow work by associating some characteristics of inbound packets on an interface with a flow, e.g. some tuple like (source addr, source port, dest addr, dest port, protocol). Counters per flow are incremented, and the numbers are exported periodically or when flows become inactive. There are a few vendors who now provide traffic export from high-speed interfaces by sampling those interfaces at a particular rate, and using the sampled packets to populate the per-flow counters, rather than looking at every packet. Does anybody here know of recent research with real internet traffic which compares different sample rates wrt the representativeness of the resulting flow data? For example, if I am trying to rank the top traffic sinks for my network beyond an attached peer (i.e. an ordinal rather than cardinal measurement), will I get different answers if I use a sampling rate of 1:1000 compared to 1:50, given a statistically "long enough" measurement period? Intuitively, it seems to me that the answers should be the same. However, it also seems to me that statistics are frequently non- intuitive. Joe
### On Wed, 30 Jan 2002 14:02:30 -0500, Joe Abley <jabley@automagic.org> ### casually decided to expound upon nanog@merit.edu the following thoughts ### about "representativeness of flow data based on samples": JA> For example, if I am trying to rank the top traffic sinks for my JA> network beyond an attached peer (i.e. an ordinal rather than cardinal JA> measurement), will I get different answers if I use a sampling rate JA> of 1:1000 compared to 1:50, given a statistically "long enough" JA> measurement period? I suspect that it will just determine the smoothness of your statistics over the long run which I assume is what you're interested in. I guess it will depend on the ballpark expected packet flow. One might ask the question of "how close do things seem/need to be?" One has to assume the sampling run time is bigger than the sampling rate by a certain order of magnitude because the amount of sampling error can be predicted as the square root of the number of samples. So what does a per-sample loss mean to you? And how much error can you tolerate? Figure that out and you can narrow in on an appropriate sampling period. -- /*===================[ Jake Khuon <khuon@NEEBU.Net> ]======================+ | Packet Plumber, Network Engineers /| / [~ [~ |) | | --------------- | | for Effective Bandwidth Utilisation / |/ [_ [_ |) |_| N E T W O R K S | +=========================================================================*/
Does anybody here know of recent research with real internet traffic which compares different sample rates wrt the representativeness of the resulting flow data?
You might find this related talk useful: http://www.research.att.com/~duffield/pubs/usage-imw2001.pdf There has been more detailed work done, but I'm not sure if it has been published or released yet. -fred
participants (3)
-
Fred True
-
Jake Khuon
-
Joe Abley