As a programmer, we tend to take sysadmins for granted. The few times I've been without a good sysadmin have really made me appreciate what you guys do. When we're venturing into an environment without a sysadmin, what words of wisdom can you offer us?
I'd start with:
<insert big post disclaimer here>
Some of these have been said before, but it's worth repeating.
Documentation:
Document everything. If you don't have one, install an under-the-radar wiki, but make sure you back it up. Start off with collecting facts, and one day, a big picture will form.
Create diagrams for each logical chunk and keep them updated. I couldn't count the number of times an accurate network map or cluster diagram has saved me.
Keep build logs for each system, even if it's just copy and paste commands for how to build it.
When building your system, install and configure your apps, test it works and perform your benchmarking. Now, wipe the disks. Seriously. 'dd' the first megabyte off the front of the disks or otherwise render the box unbootable. The clock is ticking: prove your documentation can rebuild it from scratch (or, even better, prove your colleague can with nothing more than your documentation). This will form half of your Disaster Recovery plan.
Now you have the first half your Disaster Recovery plan, document the rest; how to get your application's state back (restore files from tape, reload databases from dumps), vendor/support details, network requirements, how and where to get replacement hardware -- anything you can think of that will help get your system back up.
Automation:
Monitoring:
Application instrumentation is pure gold. Being able to watch transactions passing through the system makes debugging and troubleshooting so much easier.
Create end-to-end tests that proves not only that the application is alive, but really does what it's supposed to. Points are yours if it can be jacked into the monitoring system for alerting purposes. This serves double duty; aside from proving the app works, it makes system upgrades significantly easier (monitoring system reports green, upgrade worked, time to go home).
Benchmark, monitor and collect metrics on everything everything sane to do so. Benchmarks tell you when to expect something will let out the magic smoke. Monitoring tells you when it has. Metrics and statistics make it easier to get new kit (with fresh magic smoke) through management.
If you don't have a monitoring system, implement one. Bonus points if you actually do jack the above end-to-end tests into it.
Security:
"chmod 777" (aka grant all access/privileges) is never the solution.
Subscribe to the 'least bit' principle; if it's not installed, copied or otherwise living on the disk, it can't get compromised. "Kitchen sink" OS and software installs may make life easier during the build phase, but you end up paying for it down the track.
Know what every open port on a server is for. Audit them frequently to make sure no new ones appear.
Don't try cleaning a compromised server; it needs to be rebuilt from scratch. Rebuild to a spare server with freshly downloaded media, restoring only the data from backups (as the binaries may be compromised) or clone the compromised host to somewhere isolated for analysis so you can rebuild on the same kit. There's a whole legal nightmare around this, so err on the side of preservation in case you need to pursue legal avenues. (Note: IANAL).
Hardware:
Never assume anything will do what it says on the box. Prove it does what you need, just in case it doesn't. You'll find yourself saying "it almost works" more frequently than you'd expect.
Do not skimp on remote hardware management. Serial consoles and lights out management should be considered mandatory. Bonus points for remotely-controlled power strips for those times when you're out of options.
(Aside: There are two ways to fix a problem at 3am, one involves being warm, working on a laptop over a VPN in your pyjamas, the other involves a thick jacket and a drive to the datacenter/office. I know which one I prefer.)
Project management:
Involve the people that will be maintaining the system from day one of the project lifecycle. The lead times on kit and brain time can and will surprise, and there's no doubt they will (should?) have standards or requirements that will become project dependencies.
Documentation is part of the project. You'll never get time to write the whole thing up after the project has been closed and the system has moved to maintenance, so make sure it's included as effort on the schedule at the start.
Implement planned obsolescence into the project from day one, and start the refresh cycle six months before the switch off day you specified in the project documentation.
Servers have a defined lifetime when they are suitable for use in production. The end of this lifetime is usually defined as whenever the vendor starts to charge more in annual maintenance than it would cost to refresh the kit, or around three years, whichever is shorter. After this time, they're great for development / test environments, but you should not rely on them to run the business. Revisiting the environment at 2 1/2 years gives you plenty of time to jump through the necessary management and finance hoops for new kit to be ordered and to implement a smooth migration before you send the old kit to the big vendor in the sky.
Development:
Backups
Data you're not backing up is data you don't want. This is an immutable law. Make sure your reality matches this.
Backups are harder than they look; some files will be open or locked, whereas others need to be quiesced to have any hope of recovery, and all of these issues need to be addressed. Some backup packages have agents or other methods to deal with open/locked files, other packages don't. Dumping databases to disk and backing those up counts as one form of "quiescing", but it's not the only method.
Backups are worthless unless they're tested. Every few months, pull a random tape out of the archives, make sure it actually has data on it, and the data is consistent.
And most importantly...
Pick your failure modes, or Murphy will... and Murphy doesn't work on your schedule.
Design for failure, document each system's designed weak points, what triggers them and how to recover. It'll make all the difference when something does go wrong.
Don't assume its easy. I know many programmers who think that just because they can setup IIS or Apache on there dev box that they can run a web farm. Understand what the job involves and do your research and planning, don't just think the sysadmin work is the easy thing you can do in 10 minutes to get your app deployed.
Security is not an afterthought. While a hacked app can make the programmer look incompetent, it's (at least) a lost weekend spent verifying, cleaning, and/or restoring from backups for a sysadmin.
For that matter, don't treat backups as version control. They're for disaster recovery, and are not really designed to restore your code because you forgot what you changed.
And stop blindly blaming Windows Updates for your code being broken. I don't care that it worked beforte, tell me why it doesn't work now - then we can see whose fault it is.
How to debug networking issues and watch your program run with sysadmin tools. As a programmer who got started in system administration, I'm amazed by how impotent many programmers become once networking "just stops."
openssl s_client -connect target-host:port
sometime), for manually connecting to network servicesKnow how to troubleshoot problems.
It's very easy to pass the buck (e.g., your network is hosing my communication with the database). It may be the network's fault, but you should have application logs with errors that, using Google or SO, may reveal a problem in an app's configuration.
Everyone likes to blame the hardware, OS, or network, so if you practice a little more due diligence, you'll make the sysadmin a happy person. Because, if nothing else, you might be able to point them in a specific direction as to what might be wrong (as opposed to saying "your network sucks" or something equally helpful).
Document everything you can. Cannot tell you how many times the last sysadmin thought it would be cute not to document something for 'job security' or someone just wanted to get in and get out. Just like a programmer should leave good comments, sysadmins should document. A diagram of the topology would be nice too.
Plan B.
Always have a disaster recovery plan in mind when designing and developing a solution. Recognize single points of failure that can lead to an outage.
Documentation: no need to go nuts, but how the application works, a diagram showing how the bits fit and ways to test each component when it all goes wrong. Sample data and output is nice.
Requirements: what modules does it rely on? Versions? OS?
Monitoring: ideally developers would include monitoring information and tests with the application.
Speaking of packaging, PACKAGING! Nothing worse than a "deployment" which means checking out a new revision of a file from VCS and copying it to a bunch of servers. Too often programmers don't appreciate the complexity of deploying software: there are reasons why versioned, packaged software forms the backbone of most OSes.
If a developer came to me with an RPM which installed first time with concise, comprehensive documentation and some Nagios tests they'd be my new best friend.
I'm surprised that non of the 17 answers give here so far include anything about ensuring your application runs when logged on as a standard user.
Other than the installation process, the application should run fine when logged on with a standard user account.
Backup Backup Backup .... Test the backup .... Always be ready to roll back
OK this is slightly ranting but:
a) When coding, assume that underlying infrastructure could fail, and does not come from happy-happy always-on land. Or Google.
b) We probably don't have the resources to implement anything like the infrastructure you've read about, so take it easy on us when things go down. It's likely we know what needs to be done, but for whatever reason it just hasn't happened yet. We are your partners!
c) Like jhs said above, it would really help if you had a passing familiarity with tools to troubleshoot the infrastructure, such as ping, traceroute (or combining both - mtr), dig, etc. Massive bonus points for even knowing about Wireshark.
d) If you program a computer, you really should know how it connects to the network and the basics like being able to parse the output of ipconfig /all or ifconfig. You should be able to get your internet connection up and running with minimal help.
Otherwise I think Avery pretty much nailed it. Devs who do a little sysadmin are worth their weight in gold! But equally, sysadmins who understand how devs go about things (including versioning, etc.) are pretty much essential in this day and age.
This seems to be in the air at the moment, I've noticed more discussion about the dev/ops relationship in blogs - check out
Keeping Twitter Twittering [1]
[1] http://radar.oreilly.com/2009/05/velocity-preview---keeping-twi.htmlThis may apply only to beginning programmers, but I deal with a few things on every project with some programmers.
"It works on my machine" is not ever a valid statement. It is the programmer's responsibility to create an install program for use on the server, or at least document every connection and dll and add-in that will be required on the server.
(I've heard this multiple times, so please don't laugh) I run the exe on the server from my machine and it works. But, when I run it on the server (Citrix, Terminal Server, etc) it doesn't work. Please understand dll's and ocx's and anything else your program requires and where and how they are registered, and how your program uses them.
These may seem simple, but I deal with it constantly.
Brian
That no one group or function is 'better' than another and that none require 'bigger brains' than each other either. I've seen both sides get all prima-dona'ish in the other's company - you're all trying to achieve the same goals - focus on these similarities and not the fact that you use different tools.
Infrastructure architect turned programmer, might want to roll back that transaction in the future though :)
As someone that has been a sys admin for developers, and a developer myself, the advice given here is not only gold, but should be part of the hiring documentation for new developers for companies all over.
Something that I haven't' seen (yet) explained is that developers really should know the products that they'll use to create the programs that they are paid for. The amount of times that I've had to explain and configure apache servers, eclipse and Visual Studio installs, and database on developer machines is a bit worrisome.