"Pipeline" in the context of networking chips is not a terribly well-defined term. In some chips, you'll have a pipeline that is built from very rigid hardware logic blocks -- the first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!
"Pipeline", in the context of networking chips, is not a terribly well-defined term! In some chips, you'll have an almost-literal pipeline that is built from very rigid hardware logic blocks. The first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!
The advantages here is that you can make things very fast and power efficient, but they aren't all that flexible, and deity help you if you ever need to do something in a different order than your pipeline!
You can also build a "pipeline" out of software functions - write up some Python code (because everyone loves Python, right?) where function A calls function B and so on. At some level, you've just build a pipeline out of different software functions. This is going to be a lot slower (C code will be faster but nowhere near as fast as dedicated hardware) but it's WAY more flexible. You can more or less dynamically build your "pipeline" on a packet-by-packet basis, depending on what features and packet data you're dealing with.
"Microcode" is really just a term we use for something like "really optimized and limited instruction sets for packet forwarding". Just like an x86 or an ARM has some finite set of instructions that it can execute, so do current networking chips. The larger that instruction space is and the more combinations of those instructions you can store, the more flexible your code is. Of course, you can't make that part of the chip bigger without making something else smaller, so there's another tradeoff.
MOST current chips are really a hybrid/combination of these two extremes. You have some set of fixed logic blocks that do exactly One Set Of Things, and you have some other logic blocks that can be reconfigured to do A Few Different Things. The degree to which the programmable stuff is programmable is a major input to how many different features you can do on the chip, and at what speeds. Sometimes you can use the same hardware block to do multiple things on a packet if you're willing to sacrifice some packet rate and/or bandwidth. The constant "law of physics" is that you can always do a given function in less power/space/cost if you're willing to optimize for that specific thing -- but you're sacrificing flexibility to do it. The more flexibility ("programmability") you want to add to a chip, the more logic and memory you need to add.
From a performance standpoint, on current "fast" chips, many (but certainly not all) of the "pipelines" are designed to forward one packet per clock cycle for "normal" use cases. (Of course we sneaky vendors get to decide what is normal and what's not, but that's a separate issue...) So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that means that it can forward 1.25 billion packets per second. Note that this does NOT mean that I can forward a packet in "a one-point-two-five-billionth of a second" -- but it does mean that every clock cycle I can start on a new packet and finish another one. The length of the pipeline impacts the latency of the chip, although this part of the latency is often a rounding error compared to the number of times I have to read and write the packet into different memories as it goes through the system.
So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a "pipeline" that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines... and so on and so forth. The exact details of how the pipelines are constructed and how much parallelism I built INSIDE a pipeline as opposed to replicating pipelines is sort of Gooky Implementation Details, but it's a very very important part of doing the chip level architecture as those sorts of decisions drive lots of Other Important Decisions in the silicon design...
--lj