Apologies to all on handheld devices. If you're not into BSD or Linux TC operationally, skip this post. Due to my usual rambling narrative style for "alternative" troubleshooting I was going to mail this direct to the OP but I was persuaded AMBJ by a co-conspirator to post this to list in full. # @all with similar "traffic shaping" problems Googling in the future: On Wed, 2009-12-09 at 12:07 +1100, Simon Horman wrote:
but trying to use much more than 90% of the link capacity
......though not directly relevant in this case, for lower speed links and things like xDSL to the CPE that 90% must include protocol overheads (you are getting close to bone in that last 10%) and _much_ more affective (<- that's A-ffective) things like actual modem "sync speed". It depends how the TC is calc'ed/applied of course. Just a general note for a more CPE-oriented occurence of this. So kids, if you're struggling with your IPCOP in a SOHO shop with ADSL+PPPoE, this means you! #### Meanwhile, back at our level....... @all generally: do many of us use Linux TC at small-carrier level? I know of a lot of BSD boxen out there that handle huge complex flows but I suspect Linux kernel is less popular for this - or am I assuming wrong? Personally I'd lean to BSD for big stuff and Linux on for CPE, am I out of touch nowadays? #### Fully back on topic from here on....... @Chris - I've not used RED in any anger, sorry. Other than a typo in the config for the affected queue (maybe an extra digit loose somewhere?), things are definitely going to get complicated. Is something exceeding a tc bucket mtu occasionally? Chris <chris@ghostbusters.co.uk> wrote:
My thoughts are that any dropped packets on the parent class is a bad thing:
yes, generally speaking, but.....
qdisc htb 1: root r2q 10 default 265 direct_packets_stat 448 ver 3.17 Sent 4652558768 bytes 5125175 pkt (dropped 819, overlimits 10048800 requeues 0) rate 0bit 0pps backlog 0b 28p requeues 0
... in the above example, that loss rate is extremely low at 000.0159% ( 819 / 5125175 %) It may not be a representative sample, but I just thought I'd check you hadn't dropped a few significant digits in a %loss calc along the way :) That level of loss if operationally insignificant of course, especially for TCP. As you are I'm sure aware, perfect TC through any box is pretty specialist and usually unique to that placement. Without any graphical output, queues and the like are extremely difficult to visualize (mentally) under load (though for smaller boxes the RRD graphs in pfSENSE are nicely readable - see below). Because of this I usually try to eliminate ~everything~ else before I get into qdisks and the nitty-gritty. As a natural control fr/geek I've wasted far to many hours stuck in the buckets to no real improvement in many cases. Chris <chris@ghostbusters.co.uk> wrote:
I've isolated it to the egress HTB qdisc
good, though read on for a strange tale You MUST make a distinction between TC dropping the packets and the interface dropping the packets; I see in your later post a TC qdisc line showing that tc itself had dropped packets, BUT it ALWAYS pays to check at the same time (using ifconfig) that no packets are reported being dropped by the interfaces as well. I've had 2 or 3 occasions where `TC drops` were actually somehow linked to _interface_ drops and it really threw me, we never did work out why. The interaction confounded us totally. IF the INTERFACES are ALSO dropping in ifconfig, THEN, and ONLY then, you are into the lowest layer. So, with that in mind and the sheer complexity of possibilities, here's how I personally approach difficult BSD/Linux "TC problems". Note that I have zero experience or inclination towards Cisco TC: Kick the tyres! A lot of people mentioned layer 2 link-config problems, but as far as I can see, no-one has suggested quickly yanking the cables and blowing the dust off the the ends. Whenever I have to reach for a calculator or pen for a problem, I first swap out the interconnects to reduce the mental smoke ;) Next, I check the NICs to see if they're unseated (if applicable), or CPU (think: rogue process - use top) or even bus utililisation if you have only 32bit PCI NICs in a busy box. Next. does the box do anything else like Snort/Squid/etc at the same time? To eliminate wierdness and speedup troubleshooing if TC is acting strange I'd run tcpdump continually from the very start of my troubleshooting, dumping into small 10MB-ish files - use the special -C option ="split to filesize" and the -W option to set about 100 files in a ring buffer so that you have a decent history to go back through if you need it, without clogging the fisystem of the box with TB or packetdata :) (splitting them into 10MB files at the start leads to fast analysis in the shark, though you could carve up larger files manually I guess) That way, if the TC hurts your brain run the dumps them through wireshark's "expert info" filter while you have a coffee. (Analylse>ExpertInfo I think?) It's just in case something external or unusual is splattering the interfaces into confusion, it will only take a minute or less to run this analysis with an "affected" dump, as 10MB is very manageable and you can select the relevant dumpfile from it's last access time. Don't waste any time viewing them manually, just a glance. Remember to kill the tcpdumps when you find the problem though, scrubbing the files if needed for compliance etc. If you need to run tcpdump for a really long time I'd suggest setting it up with setuid because it usually needs to run as root. Personally I get nervous on important perimiter devices dumping during a coffeebreak ;) When I'm trying to get my head around flows through "foreign" Linux boxen I tend to use "iftop" for a couple of minutes or so to just get the feel of connections and throughputs actually travelling through it, I sometimes run it over SSH continually on another monitor when dealing with critical flow boxes that show problems. If you throw a config "switch" somewhere it's nice to see the effect visually, though be careful, it runs as root so again don't leave going 24/7, just while you are fiddling {cough} adjusting. Again, for longterm watching try to use as setuid. I set up a few iperf flows to stimulate your TC setups or use netcat, scp or similar to push some files through to /dev/null at the other end, use "trickle" to limit the flow rates to realistic or less operationally-damaging levels during testing. wfm. Adding 9 flows of about 10% of link capacity each should give tc some thinking work to do in an already active network, script it all to run for only a few seconds at a time in pulses, rather than saturating the operational link for an hour on end or the fone won't stop haha. If your queues are all port-based,(depends how you're feeding flows into the tc mechanism I suppose) set up "engineering test queues" on high ports and force iperf to use these high ports while you test inline. If the box isn't yet in service, this obviously isn't an issue. now, IF there are NO drops reported by ifconfig or kernel messages, just drops reported by the TC mechanism, it gets complicated. Only THEN do I reach for a calculator (and I also print out the relevant man pages!): But! There is one more rapid technique available you shouldn't ignore- Swapouts: --------- TC is hard to get perfect using any vendor, so eliminate the hardware and configs in one swoop if you can! If you feel like trying a swapout (not sure if `availability` will allow in this case) a modern mobo running pfSENSE will allow a quick "have I done something stupid on the Linux box?" comparison. I suggest pfSENSE because it has a reputation for fast throughput and {cringe} a "wizard" in the web GUI so you can have it set up a set of TC rules rapidly. I'd run it direct from the liveCD for a quick comparison, give it min. 256MB ram and a P4 or higher for G-eth speeds with shaping but no ipsec. (this is overkill, but you must eliminate doubt in this case) Allow 10 seconds to swap the interconnects once it's config-ed though, this could be more than you can allow for downtime? Dunno Another `swapout` option, but actually a same-box alternative, is setting up simple TC yourself manually using "tc" at the shell or a (better) a simple script instead of %whatever-you-are-using-right-now% (possibly a flat scripted config file for tc? or maybe some fancy custom web-thingy?) Flattening the tc config this way for an couple of hours can give a comparison though it all depends on the desired availability/quality and if good shaping is essential 24/7 on a saturated link. Luckily, you hint that your hardware is significantly better than i686 I think? If we knew more about the actual hardware and the flows through it + a little about the adjacent topology, we could all offer some hardware sizing comments in case you're pushing something over it's limit. Finally, I've seen more than a few examples of people using old P3-era hardware for heavy duty throughput. It can work well (especially with PCI-X) but NEVER assume that layer one ends at the RJ45. It goes inside the case and a significant distance sometimes: all prone to heat/dust/fluff/broken solder/physical alignment problems. In years gone by, mis-seated AGP cards would take ethernet links down then up again on hot days. In these roles, your old leaky PSU and mobo capacitors can lead you on a merry dance for a l-o-n-g time. Pained memories :) regards, Gord