Previous page 2 of 3 Previous
Kinn "I'd like to know which mmd controls which device."

Kinn is the "day shift guy" who is there in the afternoons and when backups start Thursday, Friday and Saturday from noon until midnight, monitoring and verifying that the backups work. Kinn is happy in his role and enjoys working with the backup group; the work environment is very important to him.

Kinn started his career as an electronic assembly technician after immigrating from Laos 20 years ago and getting some technical training in hardware troubleshooting. He then began work at Microsoft as a hardware / system administrator. After a few years the hours were too much, and he wanted less stress and more time for family. His current hours are perfect for him.

Kinn has pride in doing a good job. He has a special interest in hardware. Generally sits quiety at his desk all evening; doesn't play music or anything. like some social interaction. Not into learning new stuff at this point.


"I enjoy playing with hardware!"

"I prefer the messages pane in NW Admin rather than the NMC equivalent, because it tracks history for as long as the UI is up - and sometimes I leave it up for days".

Environment Kinn's cube is fairly clean , but bits of hardware are scattered about and one of the disk drives is open as if he'd been repairing it. He uses two computers, laptop and a desktop, both running windows 2003, displayed on two 19 ” monitors, running 1280x1024. One screen always shows the Windows NW Admin GUI, and the other shows either Outlook or (claimed) NMC.

Other software: Outlook, Excel, Remedy (trouble-ticket system), MS Internet Explorer. Has PC Anywhere installed, but generally uses MS Terminal Services to connect to remote machines. Runs HP's “Insight Manager” for host and network monitoring of ~700 machines. Probably also runs the StorageTek library Java applet that other Safeco people run to monitor libraries, but didn't mention it.

Skills 

General Technology
Medium comfort, learned Windows just four years ago
NetWorker
Has worked with Legato NetWorker for 5 years; didn't receive training when he started, but eventually got sent to one class (NetWorker 6.0) but didn't get much out of it, because by that time he knew what he needed to know. He estimates it took 3 months to figure out NetWorker, and he was probably mid-level then. He thinks he kept learning more through his first year with NW, and then reached a plateau: he said he doesn't know everything, but knows enough to do his job. Used to use the Legato web site for support, but doesn't use it much now, because he's learned what he needs to know. Occasionally looks at websites for EMC, or other vendors.
Goals 
Take care of backups
Get done within work hours

Tasks 

- Monitor backups to make sure they work
- Check that there are enough tapes in the libraries, etc, to do the next backup
- Perform hardware upgrades and maintenance
- Service user requests, like restores
Daily Workflow
  1. Might talk to people for 30 minutes when he comes in if he wants to chat
  2. Spends 30-45 minutes on email, reading through all of the status messages from the last day. This lets him know if a problem mentioned in an early email has already been fixed. The group keeps everyone up-to-date on status by emailing a distribution list.
  3. Opens up the Windows NW Administration GUI to look at devices. Uses the devices pane on the monitoring tab to see if NetWorker has automatically disabled any drive.
  4. Troubleshooting: if NetWorker has automatically disabled a drive, he'll try to simply re-enable the drive and tell it to mount another tape. If the problem was a tape problem, that will fix it. If that doesn't work either... he can email the group, use the StorageTek Java applet that manages libraries, and email a local operator, if one is still there.
  5. General troubleshooting style: tries to take care of what he can, but if it's hard he'll let the rest of the team deal with it.
  6. Looks at the group properties page to see if the groups are running. He's interested in the completion message (the first little bit) to see overall status, and he wants to look at the last start / last end time to make sure no backup was “missed” because things were still running from the day before.
  7. Looks at the daemon.log file to see if there's anything interesting in it.
  8. Looks at the “messages” pane in the monitoring page of the Windows NW Administration GUI. He likes this much better than the equivalent in NMC, because it tracks history for as long as he leaves the UI up---and he leaves it up for days.
  9. Uses HP Insight Manager to look at machines and network.
  10. As groups finish, he starts checking the group to see that it's OK: successful, last run is reasonable, etc.
  11. He moves failed clients into a “doghouse” group, which he'll use to re-run the failed clients
  12. He checks volumes in the Volumes tab of the Windows NW Administration GUI. He looks at volumes that were written today, by sorting on expiration date. He knows that everything is kept for a month, so he just looks for next month's date. He checks overall status and that no volumes were “full” prematurely, and then he starts drilling down to examine each saveset by double-clicking on it to bring up a dialog with status. He wants to make sure none are incomplete or otherwise bad. There are on the order of a thousand savesets (maybe 500, maybe 1500) from groups and clones each day. He looks at each one to see if it's there, and then he clicks on it to see if it's incomplete or otherwise “bad.” Note that Jared has written a script to automate this, which most group members now use for this task, but Kinn's uncomfortable with the load Jared's script puts on the machines , and doesn't run it while any groups are still running. Hence the time-consuming manual process.
  13. He does similar work for clones. (It's not clear if he kicks off clones manually, or lets NetWorker do it, or there's a script.)
  14. He makes sure there's enough blank tapes for the next backup(s)
  15. He handles any recovers that are needed.
  16. He usually spends all 12 hours on this monitoring and troubleshooting: hardware upgrades are less frequent.
  17. Before he leaves, he always sends out an email to the group explaining the current status of the backups.
  18. He doesn't use NMC much. He didn't mention it, or demonstrate it, in the first 90 minutes of the interview when he walked us through the tasks he did in the windows GUI. But at the end of the interview he said he did use it, and kept it running on one of his screens. But it wasn't on his screen, he couldn't find how to launch it for five long minutes, and then took three tries to remember the username/password to use to get into it. I'd tend to say he's used it---he was pretty familiar---but not that often.
Issues & Concerns
  • If one of his tape drives and/or the mmd(b?) controlling it get stuck, he'll kill the mmd. But there's nothing in the product to tell him which mmd to kill, and so he gets a lot of collateral damage. He'd really like to see something that helped him know which mmd was controlling which device.
  • In the NMC equivalent of the Windows NW Administration GUI's monitoring tab / messages pane, NMC only shows a small number of messages. (Namely, the ones still in the server resource.) But he monitors busy servers, and so he can only see about 5 minutes worth of messages in that window: if his attention is elsewhere, or he goes to the restroom or anything, that information is GONE. The Windows GUI that lets him show messages for as long as it has been up---and he leaves it up for days---is much more useful.
  • He could really use a “status” column for savesets in the drilldown for volumes and in the report produced by the clone savesets screen.
  • If a machine dies and the group is restarted, then the backups are not too good: they sometimes don't work when you try to recover from them, even though the saveset is marked complete. (Editor's note: I personally don't understand how/why this could happen, but he was pretty sure that is what caused problems.)
  • If you delete a volume in the GUI, perhaps to reuse a volume that is beyond its 30 day limit, you currently must delete all of the index entries for the volume. But this is bad, the way Safeco uses NetWorker, because that also eliminates the index entries for the clone that is being kept for 7 years. Please provide a way to remove the volume and its savesets from the media database, but keep the index records.