ArsDigita Server Architecturefor reliably delivering various kinds of Web services, by Philip Greenspun for the Web Tools Review |
The ArsDigita Server Architecture is a way of building and delivering Web services cheaply and reliably. Since I've written a whole book about building sites, the focus of this document is on maintaining Web services. However, since you can't maintain something unless you build it first, we'll start with a little bit of background.
Web services are usually built with loose specs, to tight deadlines, by people who aren't the world's best educated or most experienced programmers. Once online, Web applications are subject to severe stresses. Operations get aborted due to network failure. So if your transaction processing house isn't in order, your database will become corrupt. You might get dozens of simultaneous users in a second. So if your concurrency control house isn't in order, your database will become corrupt.
A workable solution?
Aren't we done then? This is proven stuff. Unix has barely changed since the late 1970s, Oracle since the late 1980s, and AOLserver (formerly NaviServer) since 1995. We've launched more than 100 public Web services using this infrastructure.
We are not done. What if a disk drive fills up? Are we notified in advance? Are we notified when it happens? How long does it take to restore service?
That's what's this document is about.
Assuming a secure Unix machine that starts up and shuts down cleanly, you need to think about the file systems. So that a hard disk failure won't interrupt Web services, the ArsDigita Server Architecture mandates mirrored disk drives using Solstice DiskSuite on Solaris or Mirror/UX on HP. The Architecture mandates straight mirroring (RAID 1) in order to retain high performance for the RDBMS. These disk drives should be on separate SCSI chains.
Mirrored disks aren't all that useful when they are 100% full. Unix programs love to write logs. Mailers log every action (see below). Web servers log accesses, errors, and, the way we typically configure AOLserver, database queries. The problem with these logging operations is that they fill up disk drives but not so fast that sysadmins are absolutely forced to write cron jobs to deal with the problem.
For example, in December 1998, the Web services maintained by ArsDigita were generating about 250 MB of log files every day. Some of these were access logs that folks wanted to keep around and these generally get gzipped. Some were error/query logs that had to be separately rolled and removed. In any case, if left unattended, the logs will fill up a 9 GB disk drive sufficiently slowly that a human will delude himself into thinking "I'll get to that when the disk is becoming full." The reality is that every 6-12 months, the disks will fill up and the server will stop serving pages reliably. The monitors described later in this chapter will alert the sysadmins who will log in, rm and gzip a bunch of files, and restart the Web servers. The outage might only last 15 minutes but it is embarrassing when it happens for the fourth or fifth time in three years.
The naive approach to this problem is to build a db-backed Web monitoring service on each Unix machine. Why a Web service? You don't want to have to log into a machine to check it. The first thing the Web service has to do is show a nicely formatted and explained display of free disk space, CPU load, and network status. Tools such as HP GlancePlus that require shell access and X Windows, are generally going to be better for looking at an instantaneous snapshot of the system. So if we just have a page that shows a summary of what you'd see by typing "df -k" and running top or GlacePlus, then that's nothing to write home about.
Let's try to think of something that a Web service can do that a human running shell tools can't. Fundamentally, what computers are great at is remembering to do something every night and storing the results in a structured form. So we want our Unix monitor to keep track of disk space usage over time. With this information, the monitor can distinguish between a safe situation (/usr is 95% full) and an unsafe situation (/mirror8/weblogs/ is 85% full but it was only 50% full yesterday). Then it can know to highlight something in red or scream at the sysadmins via email.
Finishing up the naive design, what we need is an AOLserver that will prepare Web pages for us showing current and historic free disk space, current and historic CPU load, and current and historic RAM usage. In order to keep this history, it will rely on an RDBMS running on the same computer.
Why do I keep referring to this as "the naive design"? First, you might not be running an RDBMS on every computer that is important to your Web operation. You probably have at least one RDBMS somewhere but you might not want anyone connecting to it over the network. Second, you might have 10 computers. Or 100. Are you really diligent enough to check them all out periodically, even if it is as simple as visiting their respective Unix monitors?
Each particular machine needs to run an AOLserver-based monitor. However, the monitor needs to be configurable to either keep its history locally or not. To handle monitors that don't keep local history and to handle the case of the sysadmin who needs to watch 100 machines, the monitor can offer up its current statistics via an XML page. Or a monitor that is keeping history locally can be configured to periodically pull XML pages from monitors that aren't keeping local history. Thus, a sysadmin can look at one page and see a summary of dozens or even hundreds of Unix systems (critical problems on top, short summaries of each host below, drill-down detail pages linked from the summaries).
We can't think of a good name for this so we call it ArsDigita Cassandrix ("Cassandra for Unix"). It is available from http://arsdigita.com/free-tools/cassandrix.html.
In theory, providing humans with many opportunities to tell Oracle what to do will yield higher performance. The human is the one who knows how data are to be accessed and how fast data sets will grow. In practice, humans are lazy, sloppy, and easily distracted by more interesting projects.
What happens is that Oracle becomes unable to update tables or insert new information because a tablespace is full. Each disk drive on which Oracle is installed might have 17 GB of free space but Oracle won't try to use that space unless explicitly given permission. Thus the canonical Oracle installation monitor watches available room in tablespaces. There are lots of things like this out there. There is even the meta-level monitor that will query out a bunch of Oracle databases and show you a summary of many servers (see Chapter 6 of Oracle8 DBA Handbook). Oracle Corporation itself has gradually addressed this need over the years with various incarnations of their Enterprise Manager product. Personally, I haven't found Enterprise Manager useful because you need to have a Windows machine and have SQL*Net enabled on your Oracle server (an expensive-to-manage security risk). Furthermore, once you get all of that stuff set up, Enterprise Manager won't actually answer many of the most important questions, e.g., those having to do with deadlocks or the actual queries being run by conflicting users. As with everything else in the RDBMS world, help is just around the corner, at least if you like to read press releases. Oracle is coming out with a new version of Enterprise Manager. It has an immensely complicated architecture. You can connect via "thin clients" (as far as I can tell, this is how Fortune 500 companies spell "Web browser"). The fact of the matter is that if Oracle or anyone else really understood the problem, there wouldn't be a need for new versions. Perhaps it is time for an open-source movement that will let us add the customizations we need.
What if, instead of installing packaged software and figuring out 6 months from now that it won't monitor what we need to know, we set up an AOLserver with select privileges on the DBA views? To better manage security, we run it with a recompiled version of our Oracle driver that won't send INSERT or UPDATE statements to the database being monitored. DBAs and customers can connect to the server via HTTPS from any browser anywhere on the Internet ("thin client"!). We start with the scripts that we can pull out of the various Oracle tuning, dba, and scripts books you can find at a bookstore. Then we add the capability to answer some of the questions that have tortured us personally (see my Oracle tips page). Then we make sure the whole thing is cleanly extendable and open-source so that we and the rest of the world will eventually have something much better than any commercial monitoring tool.
We can't think of a good name for this monitor either but at this point in the discussion we have to give it a name. So let's call it ArsDigita Cassandracle ("Cassandra for Oracle").
Here's a minimal subset of questions that Cassandracle needs to be able to answer:
ArsDigita Cassandracle is available from http://arsdigita.com/free-tools/cassandracle.html.
This tells init that, when it is in run state 3 or 4 (i.e., up and running), to run the NaviServer daemon (nsd) in interactive mode, with the config filensp:34:respawn:/home/nsadmin/bin/nsd -i -c /home/nsadmin/philg.ini
/home/nsadmin/philg.ini
.
There are a couple of problems with these common practices. First, users and the publisher might not be too thrilled about server errors piling up for 24 hours before anyone notices. Second, server errors that occur infrequently are likely to go unnoticed forever. The site is launched so the developers aren't regularly going through the error log. Users aren't used to perfection on their desktops or from their ISP. So they won't necessarily send complaint email to webmaster. And even if they do, such emails are likely to get lost amongst the thousands of untraceable complaints from Win 3.1 and Macintosh users.
The ArsDigita Server Architecture mandates running the ArsDigita Watchdog server on whichever computer is running the Web service (or another machine that has access to the public server's log files). Watchdog is a separate simple AOLserver that does not rely on the Oracle database (so that an Oracle failure won't also take down the Watchdog server).
Watchdog is a collection of AOLserver Tcl scripts that, every 15 minutes, checks the portion of the log file that it hasn't seen already. If there are errors, these are collected up and emailed to the relevant people. There are a few interesting features to note here. First, the monitored Web services need not be AOLserver-based; Watchdog can look at the error log for any Web server program. Second, if you're getting spammed with things that look like errors but in fact aren't serious, you can specify ignore patterns. Third, Watchdog lets you split up errors by directory and mail notifications to the right people (e.g., send /bboard errors to joe@yourdomain.com and /news errors to jane@yourdomain.com).
Watchdog is an open-source software application, available for download
from http://arsdigita.com/free-tools/watchdog.html.
It is impossible to test Internet connectivity from within your server cluster. It is impossible to test the InterNIC's root servers from a machine that uses your DNS servers as its name server. Thus, you really want to monitor end-to-end connectivity from a machine outside of your network that requests "http://www.yourdomain.com/SYSTEM/test.text" or something (i.e., the request is for a hostname, not an IP address).
The ArsDigita Server Architecture mandates monitoring from our own Uptime service, either the public installation or a private copy.
How to avoid outages of this nature in the first place? Pay your InterNIC bills in advance. We know countless domain owners, including some multi-$billion companies, who've been shut off by the InterNIC because they were a few days late in paying a bill. Sometimes InterNIC sends you a letter saying "if you don't pay by Date X, we'll shut you down" but then you find that they've shut you down on Date X-4.
In 1999, Microsoft failed to pay their Network Solutions renewal for a domain used by the Hotmail service that they acquired. On Christmas Eve the service became unreachable. Michael Chaney, a Linux programmer, debugged the problem and paid the $35 so that he could read his email again. Thus we find the odd situation of a Microsoft-owned service that runs without any Microsoft software (Hotmail runs on Solaris and FreeBSD Unix) and with financial backing from the Linux community.
You can avoid service loss due to power outage by buying a huge, heavy, and expensive uninterruptible power supply. Or you can simply co-locate your server with a company that runs batteries for everyone with a backup generator (above.net and Exodus do this).
Most Internet service providers (ISP) are terribly sloppy. Like any fast-growth industry, the ISP biz is a magnet for greedy people with a dearth of technical experience. That's fine for them. They'll all go public and make $100 million. But it leaves you with an invisible server and talking on the phone with someone they hired last week. Moreover, even if your ISP were reliable, they are connected to the Internet by only one Tier 1 provider, e.g., Sprint. If the Sprint connection to your town gets cut, your server is unreachable from the entire Internet.
There is an alternative: co-locate at AboveNet or Exodus. These guys have peering arrangements with virtually all the Tier 1 providers. So if Sprint has a serious problem, their customers won't be able to see your server, but the other 99% of the Internet will be able to get to your pages just fine.
For public Internet services, the ArsDigita Server Architecture mandates co-location at AboveNet or Exodus. For Intranet services, the ArsDigita Server Architecture mandates location as close as possible to a company's communications hub.
The bottom line? Users get a "server busy" page.
How to defend against this situation? First, by building a robust MTA installation that doesn't get wedged and, if it does, you find out about it quickly. A wedged MTA would generally result in errors being written to the error log and therefore you'd expect to get email from the ArsDigita Watchdog monitor... except that the MTA is wedged so the sysadmins wouldn't see the email until days later. So any monitoring of the MTA needs to be done by an external machine whose MTA is presumably functional.
The external machine needs to connect to the SMTP port (25) of the monitored server every five minutes. This is the kind of monitoring that the best ISPs do. However, I don't think it is adequate because it usually only tests one portion of the average MTA. The part of the MTA that listens on port 25 and accepts email doesn't have anything to do with the queue handler that delivers mail. Sadly, it is the queue handler that usually blows chunks. If you don't notice the problem quickly, you find that your listener has queued up 500 MB of mail and it is all waiting to be delivered but then your /var partition is full and restarting the queue handler won't help... (see beginning of this section).
What you really need to do is monitor SMTP throughput. You need a program that connects to the monitored server on port 25 and specifies some mail to be sent to itself. The monitor includes a script to receive the sent mail and keep track of how long the round-trip took. If mail takes more than a minute to receive, something is probably wrong and a sysadmin should be alerted.
We are going to build a database-backed Web service to do this monitoring, mostly by adapting email handling scripts we needed for the Action Network. In a fit of imagination, we've called it ArsDigita MTA Monitor. It is available from http://arsdigita.com/free-tools/mmon.html.
Even with lots of fancy monitors, it is unwise and unnecessary to rely on any MTA working perfectly. Morever, if an MTA is stuck, it is hard to know how to unstick it quickly without simply dropping everyone's accumulated mail. This might not be acceptable.
My preferred solution is to reprogram applications so that threads
release database handles before trying to send email. In the case of
the insert-into-bboard program above, the thread uses the database
connection to accumulate email messages to send in an
ns_set
data structure. After releasing the database
connections back to AOLserver, the thread proceeds to attempt to send
the accumulated messages.
One minor note: you don't want someone using your server to relay spam (i.e., connect to your machine and send email with a bogus return address to half the Internet). Configuring your machine to deny relay is good Internet citizenship but it can also help keep your service up and running. If your MTA is working to send out 300,000 spam messages and process the 100,000 bounces that come back, it will have a hard time delivering email alerts that are part of your Web service. It might get wedged and put you back into some of the horror scenarios discussed at the beginning of this section.
For email, the ArsDigita Server Architecture mandates
We have built a tool called ArsDigita Traffic Jamme that is
really an AOLserver repeatedly calling ns_httpget
. It is a
quick and dirty little tool that can generate about 10 requests per
second from each load machine.
The software is available from http://www.arsdigita.com/free-tools/tj.html.
If you are ordering a computer and can't figure out how big a machine to get, our practical experience is that you can handle about 10 db-backed requests per second with each 400 MHz SPARC CPU (using AOLserver querying into indexed Oracle8 tables). Our benchmarks show that a 160 MHz HP-PA RISC CPU is just as fast (!). If you're serving static files with a threaded server program like AOLserver, you can saturate a 10 Mbit link with a ridiculously tiny computer. So don't worry much about the extra server overhead from delivering photos, illustrations, and graphics.
What if your one big computer fails the load test?
Some of the folks using this architecture have services that get close to 200 million hits per day. That's 40 million page loads, which is about 1000 page loads per second during peak afternoon hours. By the standard above, it would seem that in mid-1999 you could handle this with 40 of the fastest HP PA CPUs or 100 SPARC CPUs ... Oops! The biggest Sun E10000 can only hold 64 CPUs.Another way to run out of juice is if you're running a public Web collaboration service with private data. In this case, all the pages will be SSL-encrypted. This puts a tremendous CPU load on the Web server, especially if your service revolves around images or, even worse, PDF files with embedded print-resolution images .
The solution? Buy the big computer, but only use it to run Oracle. Buy a rack of small Unix machines and run AOLserver, including the SSL module, on these. Suppose that you end up with 21 computers total. Haven't you violated the fundamental tenets of the philosophy expounded here? What kind of idiot would build a service that depends on 21 computers all being up and running 24x7? Well, let's not go into that right now... but anyway, that's not what the ArsDigita Server Architecture proposes.
Generally we rely on only one source of server-side persistence: Oracle. As noted in the preceding section, we "keep user session state either in cookies on the user's browser or in the RDBMS on the server". Sometimes data are cached in AOLserver's virtual memory but the service doesn't fail if a request misses the cache, it is only slowed down by enough time to do an Oracle query. Given this fact, it doesn't really matter if Joe User talks to Server 3 on his first request and Server 17 on his second. So we can use load-balancing network hardware to give all the machines in a rack of Web servers the same IP address. If one of the servers dies, all future requests will be routed to the other servers. Joe User doesn't depend on 21 computers all being up. Joe User depends on the database server being up, at least 1 out of 20 of the Web servers being up, and the fancy network hardware being up. We've solved our scalability problem without dramatically reducing site reliability. (Sources of fancy network hardware: Alteon, Foundry Networks, and Cisco (Local Director).)
What if you're breaking some of the rules and relying on AOLserver to maintain session state? You have to make sure that if Joe User starts with Server 3 every subsequent request from Joe will also go to Server 3. You don't need fancy network hardware anymore. Give each server a unique IP address. Give each user a different story about what the IP address corresponding to www.yourdomain.com is. This is called "round-robin DNS". A disadvantage of this approach is that if a server dies, the users who've been told to visit that IP address will be denied service. You'll have to be quick on your toes to give that IP address to another machine. Also, you won't get the best possible load balancing among your Web servers. It is possible that the 10,000 people whom your round-robin DNS server directed to Server 3 are much more Web-interested than the 10,000 people whom your round-robin DNS server directed to Server 17. These aren't strong arguments against round-robin DNS, though. You need some kind of DNS server and DNS is inherently redundant. You've eliminated the fancy network hardware (a potential source of failure). You'll have pretty good load balancing. You'll have near state-of-the-art reliability.
[An astute reader will note that I didn't address the issue of what happens when the computer running Oracle runs out of power. This is partly because modern SMP Unix boxes are tremendously powerful. The largest individual HP and Sun machines are probably big enough for any current Web service. The other reason that I didn't address the issue of yoking many computers together to run Oracle is that it is tremendously difficult, complex, and perilous. The key required elements are software from Oracle (Parallel Server), a disk subsystem that can be addressed by multiple computers (typically an EMC disk array), and a few $million to pay for it all.]
How to administer a configuration like this? See "Web Server Cluster Management".
This might sound like an odd perspective coming from ArsDigita, a company whose reputation is based on design, engineering, and programming. But think about your own experience. Have you ever had a flash of inspiration? A moment where you had a great and clever idea that made a project much more elegant?
Congratulations.
Have you ever gone for five straight years without making any mistakes? Neither have we. Here's a letter that I sent a customer after we were all patting ourselves on the back for building an ecommerce system that successfully processed its first few thousand orders:
Most Web services are necessarily maintained by people scattered in space and time. There is no practical way to assemble everyone with the required knowledge in one place and then keep them there 24x7 for five years.We are now running a fairly large-scale software and hardware system. The components include 1) software at factory 2) factory bridge 3) Web server software 4) CyberCash interface 5) data warehouse software 6) disaster recovery software and hardware A lot of valuable knowledge is encapsulated inside of various folks' heads. This wouldn't be such a bad problem if a) we all worked in the same building b) we only needed the knowledge 9 to 5 when everyone was at work But in practice we have people separated in time and space and therefore knowledge separated in time and space. What we therefore need is an on-line community! (Who would have guessed that I would have suggested this.)
Note that corporate IT departments face the same sorts of problems. They need to keep knowledge and passwords where more than one person can get to them. They need to make sure that backups and testing are happening without merely relying on an individual's word and memory. They need to make sure that bugs get identified, tracked, and fixed.
The corporate IT folks have it easy, though, in many ways. Typically their operation need not run 24x7. An insurance company probably has overnight analysis jobs but a database that is down at 2:00 am won't affect operations. By contrast, at an ecommerce site, downtime at 2:00 am Eastern time means that customers in Europe and Asia won't be able to order products.
Another way in which corporate IT folks have it easy is that there is a logical physical place for written system logs and password files. The mainframe is probably located in a building where most of the IT workers have offices. There are elaborate controls on who gets physical access to the mainframe facility and therefore logs may be safely stored in a place accessible to all IT workers.
What do we need our virtual community to do? First, it has to run on a separate cluster and network than the online service. If www.foobar.com is unreachable, any staff member ought to be able to visit the collaboration server and at least get the phone number for the sysadmins and ISP responsible for the site. Thus the only information that a distraught staffer needs to remember is his or her email address, a personal password, and the hostname of the collaboration server. We can standardize on a convention of staff as the hostname, e.g., http://staff.foobar.com.
Second, the community needs to distinguish among different levels of users. You want a brand-new staff person involved with the service to be able to report a bug ("open a ticket"). But you don't want this person to be able to get information that would appropriately be given only to those with the root password on the online server.
Third, the community needs to make the roles of different members explicit. Anyone should be able to ask "Who is responsible for backups?" or "Who are the Oracle dbas?"
Fourth, the community needs to keep track of what routine tasks have been done, by whom, and when. If tasks are becoming overdue, these need to be highlighted and brought to everyone's attention. For example, the community should have visible policies about how often backup tapes are transferred off-site, about how often backup tapes are verified, and about how often Oracle dumps are restored for testing. This is the function closest to the "virtual logbook" idea. It is essential that this be high quality software that is easy for people to use. Consider the corporate IT department where everyone can see the tapes, see people changing them, see people verifying some of the tapes, see people taking some of the tapes off-site. If some of these processes were being ignored, the staff would notice. However, with a Web service, the machine is probably at a co-location service such as AboveNet or Exodus. Most staffs will probably never even lay eyes on the server or the backup tapes. So a miscommunication among staffers or between staff and the ISP could lead to backup tapes never getting changed or verified.
Fifth, the community site needs to keep track of what important admin tasks are going on, who is doing them, and why. For example, suppose that some file systems have become corrupt and Joe Admin is restoring them from tape. Joe has commented out the normal nightly backup script from root's crontab because otherwise there is a danger that the cron job might write bad new data over the important backup tape (remember that Joe will not be physically present at the server; he will have to rely on the ISP to change tapes, set the write-protect switch on the tape, etc.). If Jane Admin does not know this, she might note the absence of the backup script in root's crontab with horror and rush to put it back in. Remember that Jane and Joe may each be working from homes in San Francisco and Boston, respectively. There needs to be a prominent place at staff.foobar.com where Jane will be alerted of Joe's activity.
Sixth, the community software needs to be able to track bug reports and new feature ideas. In order to do a reasonable job on these, the software needs to have a strong model of the software release cycle behind the site. For example, a programmer ought to be able to say, in a structured manner, "fixed in Release 3.2." Separately the community database keeps track of when Release 3.2 is due to go live. Feature request and bug tracking modules tie into the database's user groups and roles tables. If a bug ticket stays open too long, email alerts are automatically sent to managers.
Seventh, the community software needs to at least point to the major documentation of the on-line service and also serve as a focal point for teaching new customer service employees, programmers, db admins, and sysadmins. A question about the service ("how do I give a customer an arbitrary $5 refund?") or the software ("what cron job is responsible for the nightly Oracle exports?") should archived with its answer.
How do we accomplish all this? With the ArsDigita Community System! Plus a few extra modules that are adapted to ticket tracking and have special columns for "backup tapes verified date" and so forth.
The naive approach:
A better way is to prepare by keeping a server rooted at /web/comebacklater/www/ with its own /home/nsadmin/comebacklater.ini file, ready to go at all times. This server is configured as follows:
ns_register_proc
commands that might be invoked by the shared library, e.g., those that
feed *.tcl URLs to the Tcl interpreter), e.g.,
[ns/server/comebacklater/tcl] Library=/web/comebacklater/tcl SharedLibrary=/web/comebacklater/tcl
ns_register_proc POST / comeback ns_register_proc GET / comeback proc comeback {ignore} { ns_returnfile 200 text/html "[ns_info pageroot]/index.html" }
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
ps -ef | grep 'foobar.ini'
for the
production server's PID, then kill PID
)
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
init q
or kill -HUP 1
to instruct
init
to reread /etc/inittab
Another way to look at this is in terms of the infrastructure layers that we introduced at the beginning of this document:
If you have to deal with a very large site - dozens to hundreds of hosts - you'll need more than what Phillip suggests here. Here are a couple of suggestions:1) Become a member of USENIX. The USENIX LISA (System Adminstration for Large System) conferences are an excellent resource, and if you're a USENIX member, you can can the papers on-line.
2) One resource that was talked up at LISA 98 was the cfengine tool. Go check http://www.iu.hioslo.no/cfengine/ for more details.
3) If you have lots of systems, having a model for going about doing it is more important than what tool you use. (Once you know what you're trying to get done, there are many tools that will let you accomplish your goal.) I was very keen on a paper at LISA 98 called "Bootstrapping an Infrastructure"; the authors have made their paper available at http://www.infrastructures.org.
-- Paul Holbrook, January 25, 1999