New: 2-Dec-96 A bug was found it the patch by Klaus Weide. (See his message and reply below.) -John Heidemann ---------------------------------------------------------------------- X-url: To: new-httpd@hyperreal.com Subject: mmap patch for Apache-1.0.5 Date: Fri, 24 May 1996 11:01:41 -0700 From: John Heidemann At ISI I'm looking at web server performance as part of the LSAM project (http://www.isi.edu/div7/lsam/). As part of this analysis we found an optimization to Apache performance: by using memory-mapped files (rather than stdio), CPU utilization can be reduced when sending large files. The attached patch implements this optimization in Apache-1.0.5. Performance is examined in more detail in the long comment at the beginning of the patch. Although the patch is for Apache-1.0.5, the port to 1.1bX should be fairly easy. If people think that the patch is suitable for inclusion in a future release of Apache (probably 1.2), then I will do the port. Comments? -John Heidemann USC/ISI ---------------------------------------------------------------------- Date: Fri, 22 Nov 1996 11:11:58 -0600 (CST) From: Klaus Weide To: johnh@dash.isi.edu Subject: bug in "mmap patch for Apache-1.0.5"? Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII (I am referring to the version found at as of today.) Hello, I looked over the patch mentioned above, and it appears to me that there is a flaw in the logic which determines the `segment_length' for the (first) mmap() call. This would only be relevant if (1) send_fd_mmap() is called on a FILE with the position indicator different from the start of the file, and (2) the initial `remaining_length' is smaller than a full MMAP_SEGMENT_SIZE. The following is the relevant part of your patch: + /* set up for initial mapping */ + segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1); + o = start_ftell & (MMAP_SEGMENT_SIZE-1); + remaining_length = r->finfo.st_size - start_ftell; + + while (!c->aborted && remaining_length) { + segment_length = MMAP_SEGMENT_SIZE; + if (segment_length > remaining_length) + segment_length = remaining_length; + if (segment_length == 0) + break; + + map = mmap(NULL, (size_t)segment_length, [... , ... ,] + r_fd, (off_t)segment_start); [ rest of while loop ] For example, in the following situation: segment_start | | < -------- -- MMAP_SEGMENT_SIZE ------------ > | v | |------------------------------------------------|-----------------| ^ ^ ^ | | | start_ftell | , r->finfo.st_size | (effective end of mmapped region, *TOO SHORT*) the call to mmap would effectively happen as mmap(NULL, ( - start_ftell ) , ..., segment_start); when it *should* be mmap(NULL, ( - segment_start ) , ..., segment_start); NOTE: I have only looked at your patch and do not know the rest of the apache code. If I have overlooked something obvious, or misunder- stand mmap(), please let me know. [But as far as I and the man pages I consulted know, mmap() doesn't care about a current file position, it always maps from the beginning of a file.] Klaus ---------------------------------------------------------------------- (Message week:6677) X-url: To: Klaus Weide Subject: Re: bug in "mmap patch for Apache-1.0.5"? In-reply-to: Date: Mon, 02 Dec 1996 14:59:31 -0800 From: John Heidemann On Fri, 22 Nov 1996 11:11:58 CST, Klaus Weide wrote: >(I am referring to the version found at > >as of today.) > >... > > I looked over the patch mentioned above, and it appears to me that >there is a flaw in the logic which determines the `segment_length' >for the (first) mmap() call. This would only be relevant if > > (1) send_fd_mmap() is called on a FILE with the position indicator > different from the start of the file, and > (2) the initial `remaining_length' is smaller than a full > MMAP_SEGMENT_SIZE. > >... > There is an error as you outline---thanks for letting me know. The correct code would have map = mmap(NULL, (size_t)(segment_length + o), [... , ... ,] r_fd, (off_t)segment_start); >The following is the relevant part of your patch: >+ /* set up for initial mapping */ >+ segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1); >+ o = start_ftell & (MMAP_SEGMENT_SIZE-1); >+ remaining_length = r->finfo.st_size - start_ftell; >+ >+ while (!c->aborted && remaining_length) { >+ segment_length = MMAP_SEGMENT_SIZE; >+ if (segment_length > remaining_length) >+ segment_length = remaining_length; >+ if (segment_length == 0) >+ break; >+ >+ map = mmap(NULL, (size_t)segment_length, [... , ... ,] >+ r_fd, (off_t)segment_start); > [ rest of while loop ] This bug was not discovered until now because: (1) Apache 1.0 always has start_ftell == 0. (2) mmap rounds mappings up to whole page granularities I've updated the patch on my web page. Note that the patch is (as of today) untested. -John ---------------------------------------------------------------------- Index: http_protocol.c =================================================================== RCS file: /nfs/gost/CVSroot/external/apache/src/http_protocol.c,v retrieving revision 1.1 retrieving revision 1.4 diff -u -u -r1.1 -r1.4 --- http_protocol.c 1996/04/04 17:56:44 1.1 +++ http_protocol.c 1996/05/23 22:58:13 1.4 @@ -532,13 +532,260 @@ return fread (buffer, sizeof(char), bufsiz, r->connection->request_in); } +#ifdef ISI_MMAP +/*********************************************************************** + * + * ISI_MMAP patch + * -------------- + * John Heidemann, + * + * + * Apache 1.0.5 (and NCSA 1.5) use stdio to send out file data. + * Stdio is good for piecing together headers, but it's not + * the best choice for bulk-data transfer because it incurs + * several unnecessary data copies. + * + * With stdio you see the following copies to send out a file: + * disk -> fs/vm-cache -> stdio buffer -> user buffer + * -> stdio buffer -> mbufs -> network device + * (6 copies) + * + * Instead of using stdio, instead we should memory map the file + * and then write that memory directly out to the network. + * Mmap/write eliminates the stdio buffer copies: + * disk -> fs/vm-cache -> mbufs -> network device + * (3 copies) + * With mmap, the data never hits user-space. + * + * + * What is the result of mmaping instead of stdio? + * ----------------------------------------------- + * + * In cases where your web server is CPU-bound and mmaping is + * effective, you should see better performance with mmapping. In + * cases where your web server is not CPU bound, you should see a + * lower CPU utilization. + * + * Mmapping is only effective for ``large'' files; for extremely small + * files the cost of setting up the mmap exceeds the cost of simply + * doing the extra data copies. In this case ``large'' is an + * OS- and hardware- dependent value; for SunOS 4.1.3 on Sparc-10s + * the balance seems to be at about 10k. + * + * When are web servers CPU bound? A Sparc-10 can saturate a 10Mb/sec + * Ethernet with CPU to spare. With Myrinet (a 640Mb/sec network, see + * http://www.myri.com), CPU usage becomes an issue. With Sparc-10s, + * we found that mmaping allows ~2Mb/sec better performance than stdio + * for files larger than 25KB (maximum throughput is 18Mb/sec for 10MB + * files). For Sparc-20/71s we see about the same performance gain + * (maximum throughput is ~39Mb/sec for 10MB files). (These + * measurements are between two unloaded machines with the same CPU + * type connected through a single Myrinet switch. A modified Apache + * server ran on one machine, and a single client ran on the other + * machine, requesting the same file 50 times in a row. Files were + * stored in tmpfs on the server.) + * + * Servers are also CPU bound when there are many clients hitting a + * single server. We ran WebStone (with the ``Silicon Surf'' + * filelist) with and without the mmap patch on a Sparc-20/71 server + * with two Sparc-10 clients over Myrinet. The mmap-enhanced + * server handles about 0.5-1.5 additional connections per second + * as the number of clients varies from 2 to 24. (The total + * number of connections per second ranges from 14.9 to 35.4.) + * + * + * About the implementation + * ------------------------ + * + * The implementation had several goals: + * - minimal changes + * - make the new code look like the old code + * - check all errors + * - fall back on stdio at the slightest problem, if we can + * In general, write something that people will run in a production + * web server. + * + * There is one possible resource leak: mmap segmenets must be released + * upon aborts. I check all error returns, but it looks like + * timer-expirations lead to longjmps. To get around this problem, + * we probably should add the mmap segment to the resource + * pool cleanups. + * + * + * How to use + * ---------- + * + * To use this implementation, apply the patch to Apache 1.0.*, + * add -DISI_MMAP to AUX_CFLAGS in the Makefile or Configuration, + * and re-build. + * + * + * Disclaimer + * ---------- + * + * DISCLAIMER OF WARRANTY. THIS PATCH IS PROVIDED "AS IS". The + * University of Southern California MAKES NO REPRESENTATIONS OR + * WARRANTIES, EXPRESS OR IMPLIED. By way of example, but not + * limitation, the University of Southern California MAKES NO + * REPRESENTATIONS OR WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY + * PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE + * COMPONENTS OR DOCUMENTATION WILL NOT INFRINGE ANY PATENTS, + * COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The University of Southern + * California shall not be held liable for any liability nor for any + * direct, indirect, or consequential damages with respect to any + * claim by the user or distributor of this patch or any + * third party on account of or arising from this Agreement or the use + * or distribution of this patch. + * + */ + + +#include +#include +/* work around deficient system headers (ex. SunOS 4.1.3) */ +#ifndef MAP_FILE +#define MAP_FILE 0 +#endif /* ! MAP_FILE */ + +/* + * On SunOS 4.1.3, the performance tradeoff + * between mmap and stdio + * (as measured by bandwidth over Myrinet between Sparc-10 hosts) + * seems to strike at ~10000B. + * Your mileage may vary. + */ +#define MMAP_THRESHOLD (8*1024) +#define MMAP_SEGMENT_SIZE (8*1024*1024) +/* + * Currently we write data in 32KB chunks, + * 4x more than with fread/fwrite. + * Larger chunks => fewer system calls => lower CPU utilization. + * ...*but* we have a timer going and we don't want the timer + * to expire before we're through (or we'll be sorry). + */ +#define MMAP_WRITE_SIZE (32*1024) +#define MMAP_AGAIN -2 + +/* + * To avoid data copies, + * send_fd_mmap uses mmap/write instead of stdio. + * + * Another interface difference: + * send_fd doesn't necessarily leave either the file passed in (f), + * or r->connection->client in a usable state. + * See the comment at the end for details. + * + * - John Heidemann, , 960411 + */ +long send_fd_mmap(FILE *f, request_rec *r) +{ + int r_fd, w_fd, start_ftell; + caddr_t map; + size_t remaining_length, segment_length; + off_t segment_start; + int total_bytes_sent = 0; + int w, n, o; + conn_rec *c = r->connection; + + /* First, clean up file. */ + fflush(f); + start_ftell = (off_t)ftell(f); + r_fd = fileno(f); + + fflush(c->client); + w_fd = fileno(c->client); + + /* set up for initial mapping */ + segment_start = start_ftell & ~(MMAP_SEGMENT_SIZE-1); + o = start_ftell & (MMAP_SEGMENT_SIZE-1); + remaining_length = r->finfo.st_size - start_ftell; + + while (!c->aborted && remaining_length) { + segment_length = MMAP_SEGMENT_SIZE; + if (segment_length > remaining_length) + segment_length = remaining_length; + if (segment_length == 0) + break; + + map = mmap(NULL, (size_t)(segment_length + o), PROT_READ, MAP_SHARED|MAP_FILE, + r_fd, (off_t)segment_start); + /* + * If mmap failed and we haven't done anythign else yet, + * fall back on stdio by returning MMAP_AGAIN. + * send_fd recognizes this message and picks up. + */ + if (map == (caddr_t) -1) + return total_bytes_sent ? total_bytes_sent : MMAP_AGAIN; + n = segment_length - o; /* bytes to send */ + + /* + * xxx: we write in larger chunks than send_fd, + * possibly therefore requiring larger timeout values. + */ + while (n && !c->aborted) { + w = MMAP_WRITE_SIZE; + if (n < MMAP_WRITE_SIZE) + w = n; + w = write(w_fd, &map[o], w); + if (w == -1) { + munmap(map, segment_length); + return total_bytes_sent; + }; + reset_timeout(r); + total_bytes_sent += w; + n -= w; + o += w; + }; + + (void) munmap(map, segment_length); + remaining_length -= segment_length; + o = 0; /* set up for next pass */ + }; + + /* + * Upon return, whether f or c->client are usable + * is unspecified (and therefore OS dependent). + * + * In most OSes, it should be OK to go back and use them. + * + * In the worst case, they may have to be re-created with code like: + * dup(w_fd); + * fclose(c->client); -- out with the old + * c->client = fdopen(w_fd); -- in with the new + * + * This problem can be fixed in Apache-1.1 which uses it's own + * stdio-equivalent which will have known behavior. + */ + + return total_bytes_sent; +} +#endif /* ISI_MMAP */ + long send_fd(FILE *f, request_rec *r) { char buf[IOBUFSIZE]; long total_bytes_sent; register int n,o,w; conn_rec *c = r->connection; - + +#ifdef ISI_MMAP + /* + * Be very conservative about invoking mmap. + * The file stats must be valid, we must have a regular + * file, and we must have ``enough'' data to send that + * mmapping is worthwhile. If so, try it out. + * If we try it and it doesn't work, fall back + * on stdio if we can. + */ + if (r->finfo.st_mode && + S_ISREG(r->finfo.st_mode) && + r->finfo.st_size - ftell(f) > MMAP_THRESHOLD) { + total_bytes_sent = send_fd_mmap(f, r); + if (total_bytes_sent != MMAP_AGAIN) + return total_bytes_sent; + /* MMAP_AGAIN => fall through and do stdio anyway */ + }; +#endif /* ISI_MMAP */ total_bytes_sent = 0; while (!r->connection->aborted) { while ((n= fread(buf, sizeof(char), IOBUFSIZE, f)) < 1