On Thu, Mar 28, 2002 at 03:14:53PM -0800, Gironda, Andre wrote:
Why do you say that? In the 10/100 range, yes, no problems. But at the Gigabit range (say with two GbE cards or a single OC-48 card) on an x86 box with IDE disks (or even SCSI RAID0), doesn't disk I/O become a severe problem? Under Solaris or Linux, scaling disk seems relatively easy with Veritas Foundation Suite on Solaris or GFS under Linux http://www.sistina.com/products_gfs.htm
Capturing packets for realtime analysis is an attainable goal using cheap off the shelf hardware and a little bit of clue. Storing many Gbps of data on a harddrive is much harder task. Even using 160Gig drives, 1Gbps fills one in about 20 minutes (10 if you're recording full duplex). Unless you're the FBI, I really don't think you want to store that much data for any reason. Be smart in what you write to disk, and how you write it.
However, I don't think Linux or Solaris can handle the packet capture capabilities like FreeBSD and BPF can. I've heard things about the new LPF capabilities and turbopacket, but it's just hard to believe coming from such a joke/toy operating system.
The data capture mechanism of BPF is pretty simple (the filter language is whats complex), I doubt even Linux can get it too wrong. All you need is a buffer in the kernel (FreeBSD defaults to 4096, libpcap turns it up to 32768 I believe but doesn't expose the value to the user, you should probably turn that up a bit if you want to capture at high speed). Read data from the nic, copy it into the buffer (or preferably have the NIC be responsable for transfering it into the buffer :P), and increment the offset. Then when someone comes along to read for more data, copy out the buffer into the userland buffer, use the offset value to indicate the total length, and reset the offset to 0. If you need more than 20 lines to do that part, you're probably doing it wrong. :) In normal use of bpf the data is copied 3 times, from the NIC to an mbuf, from the mbuf to the bpf buffer if there is a configured bpf reader, and then from the bpf buffer to the user supplied buffer when the user does a read() on the BPF descriptor. Fortunately multiple packets are buffered into a single copy in stages 2 and 3. If you want to eliminate some of those copies, you have to make a dedicated reader mechanism. Malloc the memory in userland so you get nice page aligned chunk, allocate the counters in userland and pass it in via a character device similar to BPF. You probably want to go with a ring structure, use 2 counters as a producer and consumer index. The kernel updates the producer index, and you update the consumer index as you process data. When both values are equal, the ring is empty. When the end is 1 below the start, the ring is full. With an intelligent card, you pass the memory address of your userland allocated memory as where you want the RX data to be DMA'd. The kernel updates the producer index, discarding any data which the consumer can't read. Then you just have your userland program constantly scanning the ring for new data, put a usleep(1); in there and you'll stay below 0.01% cpu. Think there would be a benefit to writing this as an extension to BPF? -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras PGP Key ID: 0x138EA177 (67 29 D7 BC E8 18 3E DA B2 46 B3 D8 14 36 FE B6)