Julie
"I was asked to prove that 80 servers could all be recovered."
Julie cares about the quality of her work. A team player, she likes to take on new responsibilities, and does not want to be bored at her desk; she is not a web surfer. Her day starts at 5AM from home, when she logs in to check on the nightly backups. She comes in to the office about 8AM. Julie has a family, and works only occasionally more than 40 hours.
Julie started her career as a business analyst for a large insurance company. Four years ago she moved into IT, after volunteering to monitor a mailing list that included backup issues. When the backup group expanded, after a high-profile recovery failed, she moved into that group as a Systems Analyst. She is the senior troubleshooter for backup and also responsible for corporate compliance, for which Sarbanes-Oxley has been a major factor.
"NetWorker is generally easy to use, but hard to verify it worked. I worry about the integrity of the data."
Environment
Julie has a big cube, with lots of light. Her desk is fairly neat, with Mariner bric-a-brac and photos of children on the wall. Boxes of bad tapes are piled in a corner, and shipping boxes for tapes are on a small table in another corner. She generally spends her day at desk, but about once a week walks down the hall to log on to a server because of an error that the remote application, Microsoft Terminal Services, won't let her correct.
Julie uses a laptop with windows (2003?) and a 19” glass monitor at 1280x1024. She uses Outlook for email, Remedy for tracking trouble tickets. Decided not to use NMC because it didn't have cloning reports.
Skills
General Technology
Medium comfort, learned Windows just four years ago
NetWorker
Mid to high skill level, started using NetWorker when first introduced four years ago, had some internal training, took about 6 mos. to confidently work with NetWorker.
Goals
Maintain data integrity by making sure data is backed up correctly, tapes go offsite correctly that next backup will work, by making sure we have enough tapes, bad tapes are removed and drive status is current.
Recover data as needed
Comply with audits , such as Sarbanes-Oxley
Tasks
Some of Julie's main tasks are read/send email for current status, check drives, check clones, make clones, find and remove bad tapes and check libraries to make sure enough blank tapes (in software). View detailed task analysis (xls).
Daily Workflow
Julie logs in from home with laptop and cable modem at 5 AM. Checks email from the evening person to get current status. She checks groups, takes under an hour. Then she brings up Windows NetWorker Administration GUI for each server. She looks in the monitoring pane for pending sessions and drive monitoring. She opens up configuration tab and goes to groups. But “green doesn’t tell you anything!” (see Issues & Concerns) and so she ignores the icons.
For each group (on each server), she opens up the group details modal dialog. She checks the last start time, and looks briefly at the first line or two of the completion message to see that it completed OK. She then would check clones of savesets which, until recently* took about four hours /day. She used to have to bring up the "clone savesets” pane, where she rans a query on savesets, then check hundreds of savesets to verify that they appear in pairs: one for the original backup, and one for the clone. This still wasn’t good enough, because she wasn’t looking for suspect or incomplete savesets. So she'd click on each saveset (of which there are hundreds) to bring up details on it, and verify the saveset is OK. See Saveset Storyboard (pdf).
* She now runs Jared’s automated scripts to get the details on savesets. If aborted savesets in the typical query set up are found, she then looks for suspect tapes. She wants to find “prematurely” full tapes. Their tapes hold 160 GB uncompressed, and 320 GB compressed, and so she wants to find any tapes marked full with less than about 300 GB of stuff on them. These are (probably) “bad” tapes, and she wants to get them out of the system as soon as she can.
Next, she selects volumes for a particular server. She looks at the used col to find all the full columns, then looks at the written col to see how much was used before the vol was marked full. (can sort on written). She’ll send email to the local operator to remove the tape(s).
If Julie finds bad equipment in her monitoring, she’ll call it in for the operators or IT people.
In general, she does a bit of looking for problems at home, and then comes into work to do the serious troubleshooting. Much of her time is spent troubleshooting, kicking off groups again, etc. She can spend all day troubleshooting, and there’s no guarantee it will be fixed by tomorrow.
There is a script for cloning that she (or someone?) runs. If there’s a problem, she’ll look at the NW daemon log file for more details.
Julie does about one recover request per week, which can take anywhere from a few minutes to an hour.
Issues & Concerns
Group status is not useful: “Green doesn’t tell you anything!” She hates this, and it came up more than once. In NetWorker, a group is marked “green” if the last run worked. Beyond the well-known problems with open files, etc, is the over-run problem. This is easiest to explain by example: Suppose there is a group that runs “Full on Friday” and at level 1 on other days. On Thursday evening the group starts, and backs up everything since last Friday---but it gets stuck. On Friday evening the group is kicked off again, for what should be a level full, and immediately aborts because the group is still running (from Thursday). Later Friday night the Thursday group finishes. Julie checks the GUI and sees that the last run was successful---it’s marked green---but in fact the Friday run was aborted, and she doesn’t have any full backups she can send offsite. This requires her to open up the group details panels, one at a time, for each group to check to see when the group ran, and she dislikes this. But she seems to really loathe the green icon that is essentially lying to her.
Getting good clones is very frustrating. If the backups or clones are screwed up [corrupted?] it can take days to try to get good ones, because of backup window time constraints, and general difficulties of troubleshooting the environment (any specifics here?)
Verifying whether clones or savesets is painful. She would spend four hours a day trying to verify in the GUI before Jared made the scripts.
Cryptic machine names make it easy to get lost. If you use the Windows recover GUI and search through a deep directory structure, it’s easy to click on the wrong directory with this naming convention. Would really like to have an “address” line / control [breadcrumb]? like you have in Windows Explorer. This would show her where she was, and if it was something she could use as a control, she could cut-n-paste directly from the user’s recover request.