(Written in July 1999 for www.linuxworld.com; permanently installed now in Web Tools Review)
As noted in Part 1 of this article, AOLserver was designed from the ground up as a tool for building database-backed Web sites. What does that mean architecturally?
AOLserver's ns_db
abstraction layer means that Web scripts
don't have to contain the database username or password. This makes
your site a bit more secure than a CGI-backed site but, more
importantly, it makes it easier to develop toolkits and reuse code from
site-to-site. The abstraction layer also means that your Web scripts
won't depend on the particulars of an RDBMS vendor's C library. For
example, if you decide to move your service from Sybase to Oracle, you
won't have to change your Tcl code. Sadly, however, you probably
will
have to change some of your SQL code because Oracle and Sybase handle
dates differently. SQL is not a very standard standard.
The single log file idea sounds trivial but it is the key to debugging a database-driven site. If you're not happy with what is in the database, you can look through the AOLserver error log and see exactly what AOLserver told the database to do, in sequence.
Currently available database drivers include ones for Informix Universal
Server, Interbase, Oracle, PostgreSQL, Solid, and Sybase. The Solid C
library is basically the ODBC standard and therefore the Solid driver
might well work with any RDBMS vendor's ODBC interface. Nothing in the
AOLserver architecture demands that their ns_db
abstraction/pooling layer be used with a relational database
management system. We've written experimental drivers for the PLS text
database software, for example, and the ideas are the same but you
aren't sending SQL code through the ns_db
calls.
There are instructions on how to write your own database driver for
AOLserver in Appendix B of the AOLserver C Developer's
Guide, distributed from www.aolserver.com. It takes a good
C programmer two or three days to write a driver.
Why Oracle?
In an article about an open-source Web server, why give examples that
use a closed-source relational database management system? Why not use
MySQL (not free but open-source), for example? I never considered using
MySQL because it doesn't try to solve any of the problems that I'm
relying on Oracle to solve, e.g., the ACID transaction processing
guarantees:
An imaginative marketing person might call a computer program a relational database management system even if it did not provide the ACID guarantees. However, all commercial RDBMSes provide these guarantees and therefore asking someone to compare MySQL to Oracle is almost as absurd as asking for a comparison between Emacs and Oracle.
- Atomicity
- Results of a transaction's execution are either all committed or all rolled back. All changes take effect, or none do. Suppose that a user is editing a comment. A Web script tells the database to "copy the old comment value to an audit table and update the live table with the new text". If the hard drive fills up after the copy but before the update, the audit table insertion will be rolled back.
- Consistency
- The database is transformed from one valid state to another valid state. A transaction is legal only if it obeys user-defined integrity constraints. Illegal transactions aren't allowed and, if an integrity constraint can't be satisfied the transaction is rolled back. For example, suppose that you define a rule that postings in a discussion forum table must be tied to a valid user ID. Then you hire Joe Novice to write some admin pages. Joe writes a delete-user page that doesn't bother to check whether or not the deletion will result in an orphaned discussion forum posting. Oracle will check, though, and abort any transaction that would result in you having a discussion forum posting by a deleted user.
- Isolation
- The results of a transaction are invisible to other transactions until the transaction is complete. For example, suppose you have a page to show new users and their photographs. This page is coded in reliance on the publisher's directive that there will be a mugshot for every user and will present a broken image if there is not. Jane Newuser is registering at your site at the same time that Bill Olduser is viewing the new user page. The script processing Jane's registration does inserts into several tables:
users, mugshots, users_demographics
. This may take some time if Jane's mugshot is large. If Bill's query starts before Jane's transaction commits, Bill won't see Jane at all on his new-users page, even if Jane's insertion into some of the tables is complete.- Durability
- Once committed (completed), the results of a transaction are permanent and survive future system and media failures. Suppose your ecommerce system inserts an order from a customer into a database table and then instructs CyberCash to bill the customer $500. A millisecond later, before your server has heard back from CyberCash, someone trips over the machine's power cord. Oracle will not have forgotten about the new order. Furthermore, if a programmer spills coffee into a disk drive, it will be possible to install a new disk and recover the transactions up to the coffee spill, showing that you tried to bill someone for $500 and still aren't sure what happened over at CyberCash.
The open-source purist's only realistic choice for an RDBMS is PostgreSQL, available from www.postgresql.org. In some ways, PostgreSQL has more advanced features than any commercial RDBMS. Most important, the loosely organized unpaid developers of PostgreSQL were able to convert to an Oracle-style multi-version concurrency system (see below), leaving all the rest of the commercial competition deadlocked in the dust. If you've decided to accept John Ousterhout as your personal savoir, you'll be delighted to learn that you run Tcl procedures inside the database. If your business can't live without commercial support for your RDBMS, you can buy it from www.pgsql.com. As of November 1999, however, there were a few flies in the PostgreSQL ointment, most glaringly lack of OUTER JOIN and no enforcement of referential integrity constraints (i.e., PostgreSQL can't be said to do the "C" in ACID).
If you're not going to be an open-source purist, the natural choice is to use what everyone else uses: Oracle. What's good about Oracle?
Concurrency is Oracle's strongest suit relative to its commercial
competitors. In Oracle, readers never wait for writers and writers
never wait for readers. Suppose the publisher at a large site starts a
query at 12:00 PM summarizing usage by user. Oracle might have to spend
an hour sifting through 200 GB of tracking data. The disk drives grind
and one CPU is completely used up until 1:30 PM. Further suppose that
User #356712 comes in at 12:30 PM and changes his email address, thus
updating a row in the users
table. If the usage tracking
query arrives at this row at 12:45 PM, Oracle will notice that the row
was last modified after the query started. Under the "I" in ACID,
Oracle is required to isolate the publisher from the user's update.
Oracle does this by reaching into the rollback segment and producing
data from user row #356712 as it was at 12:00 PM when the query
started. Here's the same scenario in a table:
How would this play out in another RDBMS, for example Microsoft SQL Server? The majority of database management systems take the sophomore computer science student approach to concurrency control. When you're reading, you take read locks on the information that you're about to read. Nobody can write until you release them. When you're writing, you take write locks on the information that you're about to update. Nobody can read or write until you release the locks. In the preceding example, User #356712 would submit his request for the address change at 12:30 PM. The thread on the Web server would be blocked waiting for the read locks to clear. How long would it wait? A full hour with a spinning Netscape icon. Sound like any of the Web sites you've visited?
Time Publisher Public Web Service 12:00 PM Starts a 90-minute query summarizing usage for preceding year -- 12:30 PM usage summary continues to chug away User #356712 updates email address from "joe@foobar.com" to "joe@yahoo.com" 12:45 PM usage summary arrives at User #356712; Oracle reaches into rollback segment and pulls out "joe@foobar.com" for the report, since that's what the value was at 12:00 PM when the query started -- 1:30 PM usage summary report completes --
The reason that RDBMSes appear to be competitive is that in transaction processing benchmarks all the transactions are brief and can be handled in less than one second. Thus every RDBMS vendor can publish impressive reports of tens of thousands of transactions processed per minute. The bottom line in the real world is that Oracle can support vastly more users than competitors due to its unique multi-version approach to implementing concurrency.
Another important thing about Oracle is that you can run procedural programs inside the RDBMS server. For example, suppose that you want to look for all rows in a billion-row table that satisfy a particular regular expression. Working through one billion rows isn't going to be fast, no matter what. The problem takes on terrifying dimensions when you realize that there is no REGEXP operator in standard SQL. You're going to have to pull all billion rows across the network into AOLserver Tcl and then discard the ones that don't satisfy a Tcl REGEXP. You can relax if you're running Oracle 8i, actually. Because the RDBMS includes built-in Java support, you can write a little Java program to look through the database locally.
AOLserver Tcl is the best tool for Web programming 99% of the time. But occasionally you need a procedural language with rich data types. For example, if you have to parse XML or email headers, it is easy to load up a standard Java parser into Oracle and shuffle stuff around within Oracle. In addition to Java, Oracle includes support for the vaguely ADA-inspired language PL/SQL. In many ways, database programming in PL/SQL is more concise and convenient than in Java.
Life with Oracle is pretty sweet once you get through the horror of
installation and administration. But would it not be sweeter yet to run
an open-source RDBMS? Probably, but make sure that data integrity,
concurrency, memory leaks, security, and operating system issues are
solidly under control first. We should not be deceived by the
success of open-source operating systems into thinking that an
open-source RDBMS must necessarily be superior. Linux is great but keep
in mind that the OS problem is a very old one, adequately solved by the
late 1960s. Remember further that dozens of brilliant programmers have
labored for more than a decade to bring us Linux. The RDBMS problem
wasn't really solved well by commercial vendors until around 1990 and
few programmers have worked on open-source RDBMS. The success of Linux
should be inspiration for a future assault on the RDBMS problem, not a
means of deluding ourselves into thinking that open-source software is
always better (or even adequate).
The Basics
Here are some examples of db-backed Web site scripts in the AOLserver
Tcl API. For concreteness, we use the Oracle RDBMS but all of the
examples would work without significant modification in any other SQL
database.
At this point in the script, the values returned from the database are in an ns_set data structure, stored in the Tcl variable# ask AOLserver to give us a database connection from the default pool set db [ns_db gethandle] set sql_query "select first_names, last_name from users where user_id = $user_id" # actually send the query to the RDBMS; tell AOLserver we # expect exactly one row back set selection [ns_db 1row $db $sql_query]
$selection
. The AOLserver ns_set is a data structure for
storing key-value pairs so that they are accessible to both C and Tcl
procedures. Right now, $selection
will contain values for
two keys: first_names
and last_name
. Here's
how a Tcl program can get them out:
This program isn't robust to missing data in the RDBMS. Suppose someone bookmarks this page in June, with an embedded user_id of 37. User #37 is deleted from the database in September and the bookmarked page is reloaded in December. Oracle won't return any rows in response to the page query and AOLserver will throw an error rather than return fromset first_names [ns_set get $selection "first_names"] set last_name [ns_set get $selection "last_name"] ns_return 200 text/plain "User #$user_id is named $first_names $last_name"
ns_db 1row
. The fix? 0or1row
:
Among our circle of programmers, we got tired of some of this typing and programmed ourselves some shortcuts, which we distribute from http://www.photo.net/wtr/thebook/utilities.txt. With these, you can save yourself some repetitive typing and also produce terser code where it will be easier to spot errors. Here are some examples:# ask AOLserver to give us a database connection from the default pool set db [ns_db gethandle] set sql_query "select first_names, last_name from users where user_id = $user_id" # actually send the query to the RDBMS; tell AOLserver we # expect at most one rows back set selection [ns_db 0or1row $db $sql_query] if { $selection == "" } { ns_return 200 text/plain "User #$user_id not found" } else { # we found a row in the RDBMS set first_names [ns_set get $selection "first_names"] set last_name [ns_set get $selection "last_name"] ns_return 200 text/plain "User #$user_id is named $first_names $last_name" }
# with database_to_tcl_string we can fetch one value from the database # and stuff it immediately into a variable set sql_query "select first_names || ' ' || last_name from users where user_id = $user_id" set user_name [database_to_tcl_string $db $sql_query] # or we could have used set_variables_after_query to save a little # typing when setting two set sql_query "select first_names, last_name from users where user_id = $user_id" set selection [ns_db 1row $db $sql_query] # now we can set first_names and last_name in our local # environment merely by calling set_variables_after_query, # which comes up into its calling environment and looks # at the ns_set stored in $selection set_variables_after_query ns_return 200 text/plain "User #$user_id is named $first_names $last_name"
If you're familiar with relational database programming in another language, you'll notice that AOLserver Tcl has exactly the same semantics: open a cursor, fetch each row in succession. This similarity isn't an accident; all the database vendors' C libraries have the same semantics.# ask AOLserver to give us a database connection from the default pool set db [ns_db gethandle] # let's do an ns_db select to open a multi-row cursor set selection [ns_db select $db "select one_line from bboard where user_id = $user_id order by posting_time"] # loop around and fetch the rows while { [ns_db getrow $db $selection] } { set one_line [ns_set get $selection "one_line"] append bboard_contributions "<li> $one_line\n" }
What if you don't want every row from the database? Maybe the query user is extremely loquacious. The 75 most recent messages ought to suffice to give a fellow community member a good idea of what to expect:
# ask AOLserver to give us a database connection from the default pool set db [ns_db gethandle] # let's do an ns_db select to open a multi-row cursor # this time let's ask for the rows sorted by newest first set selection [ns_db select $db "select one_line from bboard where user_id = $user_id order by posting_time desc"] set counter 0 # loop around and fetch the rows while { [ns_db getrow $db $selection] } { incr counter if { $counter > 75 } { # we've already collected up 75 postings, let's close the cursor ns_db flush $db break } set one_line [ns_set get $selection "one_line"] append bboard_contributions "<li> $one_line\n" }
Note: If you're running Oracle 8.1 or later, you can simplify your application code and limit results to the first 75 with the following query: select * from (select posting_time, one_line from bboard where user_id = $user_id order by posting_time desc) where rownum <= 75
Suppose someone trips over the server power cord when Oracle is halfway through processing theset_the_usual_form_variables # classified_ad_id set db [ns_db gethandle] # copy the ad to an audit table ns_db dml $db "insert into classified_ads_audit (...) select ... from classified_ads where classified_ad_id = $classified_ad_id" # delete any bids from the auction system ns_db dml $db "delete from classified_auction_bids where classified_ad_id = $classified_ad_id" # now delete the ad itself ns_db dml $db "delete from classified_ads where classified_ad_id = $classified_ad_id"
delete from
classified_auction_bids
command in this script. Each SQL
statement issued from AOLserver automatically executes as a
transaction. If Oracle is unable to commit, the transaction will be
rolled back and none of the bids will have been touched. However,
we'll have an extra row in the classified_ads_audit
table
because the deletion of the ad never in fact happened. The solution?
Tell the Oracle driver to go back into "commit when I say" mode:
Note that you won't findset_the_usual_form_variables # classified_ad_id set db [ns_db gethandle] ns_db dml $db "begin transaction" # copy the ad to an audit table ns_db dml $db "insert into classified_ads_audit (...) select ... from classified_ads where classified_ad_id = $classified_ad_id" # delete any bids from the auction system ns_db dml $db "delete from classified_auction_bids where classified_ad_id = $classified_ad_id" # now delete the ad itself ns_db dml $db "delete from classified_ads where classified_ad_id = $classified_ad_id" ns_db dml $db "end transaction"
begin transaction
in any Oracle
manual. It is simply an instruction to the driver to turn off
autocommit mode. The end transaction
tells the driver
to commit the work done so far and go back into autocommit mode.
Need to abort a transaction due to an error discovered in Tcl? Use
ns_db dml $db "abort transaction"
The high-level message to take away from this example: If you want to do something for your own benefit, like log, don't do it on the user's time! Return all the bytes that you can to the user's browser, close the connection, and use the remaining life of the thread to accomplish the publisher's goals.set_the_usual_form_variables # adv_key set db [ns_db gethandle] # figure out where the banner is supposed to lead # (on a high-volume site this could be cached with Memoize) # note that instead of using $adv_key, we use $QQadv_key, a version # created by set_the_usual_form_variables where any apostrophes are # escaped so that the whole variable value will be legal as an SQL # string literal set target_url [database_to_tcl_string $db "select target_url from advs where adv_key = '$QQadv_key'"] # tell the user's browser to visit the advertiser's site ns_returnredirect $target_url # close the connection to the client ns_conn close # we've returned to the user but let's keep this thread alive to log # we keep one row per ad PER DAY to count clickthroughs set update_sql "update adv_log set click_count = click_count + 1 where adv_key = '$QQadv_key' and entry_date = trunc(sysdate)" ns_db dml $db $update_sql # ask the Oracle driver how many rows were actually affected by the # last update set n_rows [ns_ora resultrows $db] if { $n_rows == 0 } { # there wasn't already a row there; this is the first # clickthrough of the day; insert a row ns_db dml $db "insert into adv_log (adv_key, entry_date, click_count) values ('$QQadv_key', trunc(sysdate), 1)" } # see if the publisher has configured per-user logging # (useful if you want to ask "of the users who read at # least one article on stain-removal, how many clicked # on the Tide banner ad") if [ad_parameter DetailedPerUserLoggingP adserver 0] { set user_id [ad_get_user_id] if { $user_id != 0 } { # we know who this user is ns_db dml $db "insert into adv_user_map (user_id, adv_key, event_time, event_type) values ($user_id, '$QQadv_key', sysdate, 'c')" } }
Your situation will boil down to one of the following:
In the "one small box with one CPU" situation, you only have to tune your operating system kernel parameters to allow one process to keep many files open simultaneously. Remember that if one AOLserver process is delivering 1000 files simultaneously to 1000 users, it will need to keep 1000 files open (the default limit with many OSes is 64 or 256). | |
In the "one big box with many CPUs" situation, you first have to decide whether you want to let one AOLserver process range among the CPUs. For example, the 4-CPU machine that sits behind http://www.photo.net/ is running an Oracle database and more than 20 AOLservers, each of which supports a Web service. It might be a good thing that a single AOLserver can't consume more than 25% of the machine's resources. On the other hand, if the RDBMS is on a separate computer and the multi-CPU machine is dedicated to one site, it is important to get all the horsepower behind one AOLserver process. For this, you need an operating system with kernel threads, HP-UX 11.0, Solaris, or Windows NT for example. | |
In the "many small boxes" situation, you need some way of sharing the workload so that your service doesn't depend on them all being up and running simultaneously. You can get a load-balancing router that will give all of your servers the same IP address. When a request for a new TCP connection to your Web service comes in, the router hands it to whichever physical machine is least busy. Companies that make these load balancers include Alteon, Cisco, and Foundry Networks (http://www.alteon.com, http://www.cisco.com, and http://www.foundrynet.com). If you don't want to invest in learning about how to run these little boxes, another approach to spreading the load is to run a standard Unix machine as your load balancer. On this computer you run a Web server whose only job is to proxy requests back to a server farm. Because the load balancing server isn't doing anything except shuffling bytes from one Ethernet card to another, you ought to be able to handle about 400 requests per second per CPU via this method. AOLserver 3.0 has 95% of what you need to implement this, via its nsvhr virtual host redirection module. For a look at the feature list of a complete implementation, read the clustering support section of the Zeus Web server literature at http://www.zeus.co.uk. |
An astute reader will note that I didn't address the issue of what happens when the database server runs out of power. This is partly because modern SMP Unix boxes are tremendously powerful. The largest individual HP and Sun machines are probably big enough for any current Web service. The other reason that I didn't address the issue of yoking many computers together to run an RDBMS is that it is tremendously difficult, complex, and perilous. The key required elements are software from the database vendor (e.g., Oracle Parallel Server), a disk subsystem that can be addressed by multiple computers (typically an EMC disk array), an expert team of administrators, and a few $million to pay for it all.
To have a reliable Web service, you need to address every possible source of unreliability. First, admit to yourself that, as much fun as it might be to blame tools, it is your own code that is going to cause the user the most trouble. Accept the reality that Web services are usually built with loose specs, to tight deadlines, by people who aren't the world's best educated or most experienced programmers. Once online, Web applications are subject to severe stresses. For example, operations may get aborted due to network failure. The consequence? If your transaction processing house isn't in order, your database will become corrupt. You might get dozens of simultaneous users in a second. The consequence? If your concurrency control house isn't in order, your database will become corrupt.
A workable starting point is
This tells init that, when it is in run state 3 or 4 (i.e., up and running), to run the NaviServer daemon (nsd) in interactive mode, with the config filensp:34:respawn:/home/nsadmin/bin/nsd -i -c /home/nsadmin/philg.ini
/home/nsadmin/philg.ini
.
Note that this same set of techniques and open-source software would work with any other Web server program, especially easily with a threaded Web server that only requires one Unix process.How could an AOLserver get clogged with threads and need to be killed? Easily. Suppose Siegfried in Stuttgart is paying 10 cents/minute for local phone service to connect his 56K modem. He likes the way that your home page looks but he doesn't want to read it while paying the phone company's rates. He instructs his desktop Web robot to "grab all the pages pointed to by this page recursively, but only if they are on the same server." With the average corporate site, this would be 30 or 40 pages that could all be downloaded within a few minutes. On a site like http://www.photo.net/photo/, however, Siegfried's robot is embarking on a 10 GB download project (500 hours at 56K). It would be fine for AOLserver except that some of these robots seem to aim for breadth-first searching. They'll request as many files as possible at the same time. I've configured the http://www.photo.net/photo/ AOLserver to only spawn 120 threads. At various times I've found European robots requesting 120 large photos at the same time, all going to the same IP address, all stacking up to get through a 28.8 or 56K modem. In this situation Keepalive would restart the server and the administrator would get email. The administrator would probably figure out the situation and block the robot's IP address via the permissions system.
Another denial-of-service scenario occurs when all the database connections are behind tied up by users on slow Internet connections. Here's one way of architecting a Web script:
ns_write
headers and as many page body bytes as possible to the
client so that the user isn't staring at a blank screen
ns_write
bytes to the connection row-by-row as we get them from the
RDBMS
ns_write
the page footer
ns_write
is called the stream is
flushed. For a page displaying hundreds or thousands of rows, the
elapsed time until the entire page is loaded will probably be larger
than if the entire page were accumulated in a string and then
ns_return
'd.
Much more problematic than this performance issue is the user on a slow
connection who can't accept all the packets as fast as AOLserver can
send them. Once this user has been sent a certain number of
unacknowledged packets, ns_write
will wait for awhile.
During this interval, the script is holding onto a database connection.
If you've got eight database connections configured, enough for a
million-hit-per-day site, it only takes eight users on slow links viewing
a page like this to make every other user of the system wait! The
server will still be handling requests for static files quickly, but any
request that requires the database will leave the user staring at a
spinning browser icon.
The solution to this problem is to build pages with the following structure:
append whole_page $one_more_row_from_the_database
ns_db
releasehandle
ns_return
to send the entire page back to the user
What about bugs that manage to escape notice on both the dev and staging servers? After launching a new Web service, it is common practice to check the error logs every hour or so to make sure that users aren't encountering bugs in the scripts. It is also common practice to look at a server log analyzer's report showing which URLs generated server errors during the preceding day.
There are a couple of problems with these common practices. First, users and the publisher might not be too thrilled about server errors piling up for 24 hours before anyone notices. Second, server errors that occur infrequently are likely to go unnoticed forever.
What you really need to do is run a program every 15 minutes that checks the server error log. If it finds any new errors, it emails them to set of programmers and sysadmins. You probably can't rely on humans to be vigilant and responsive, but with a vigilant computer program and responsive humans the users will be reasonably well served. Here's an actual sequence of events from an ecommerce site that we built:
If you've built a Web site called "foobar.com", why not build a community for the folks behind it at "staff.foobar.com"? Run this on a separate computer in a separate network so that when the main server is unreachable, a person need only connect to the staff site to find out where the computer is, who the network administrators are, what the serial number of the computer running the site is, how to reach the system administrators, etc.
For a comprehensive look at all reliability and monitoring issues, with links to free open-source packages, see http://www.photo.net/wtr/arsdigita-server-architecture.html.
telnet webcenters.netscape.com 80 Trying... Connected to webcenters.netscape.com. Escape character is '^]'. HEAD / HTTP/1.0 HTTP/1.0 302 Found Location: http://home.netscape.com MIME-Version: 1.0 Date: Sat, 30 Oct 1999 03:23:06 GMT Server: NaviServer/2.0 AOLserver/2.3.3 Content-Type: text/html Content-Length: 319