On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
Thanks for all the people that replied off list, asking me to send them responses i will get. [snip] Which is useful but i am looking for more stuff from the best people that run the best NOCs in the world.
So i'm throwing this out again.
I am looking for pointers, suggestions, URLs, documents, donations on what a professional NOC would have on the below topics:
A lot, as others have said, depending on the business, staffing, goals, SLA, contracts, etc.
1) Briefly, how they handle their own tickets with vendors or internal
Run a proper ticketing system over which you have control (RT and friends rather than locking you into something you have to pay for changes). Don't just by ticket closure rate, judge by succesfully resolving problems. Encourage folks to use the system for tracking projects and keeping notes on work in progress rather than private datastores. Inculcate a culture of open exploration to solve problems rather than rote memorization. This gets you a large way to #2.
2) How they create a learning environment for their people (Documenting Syslog, lessons learned from problems...etc)
Mentoring, shoulder surfing. Keep your senior people in the mix of triage & response so they don't get dull and cross-pollenate skills. When someone is new, have their probationary period be shadowing the primary on-call the entire time. Your third shift [or whatever spans your maintenance windows] should be the folks who actually wind up executing well-specified maintenances (with guidance as needed) and be the breeding ground of some of your better hands-on folks.
3) Shift to Shift hand over procedures
This will depend on your systems for tickets, logbooks, etc. Sole that first and this should become evident.
4) Manual tests they start their day with and what they automate (common stuff)
This will vary on the business and what's on-site; I can't advise you to always include the genset is you don't have one.
5) Change management best practices and working with operations/engineering when a change will be implemented
Standing maintenance windows (of varying severity if that matters yo your business), clear definition of what needs to be done only duringthose and what can be done anytime [hint: policy tuning shouldn't be restructed to them, and you shouldn't make it so an urgent things like a BGP leak can't be fixed]. Linear rather than parallel workflows for approval, and not too many approval stages else your staff will be spending time trying to get things through the administrative stages instead of actual work. Very simply, have a standard for specifying what needs to be done, the minimal tests needed to verify success, and how you fallback if you fail the tests. If someone can't specify it and insist on frobbing around, they likely don't understand the problem or the needed work. Cheers, Joe -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE