The Big Picture
To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors which affect Web site usability, but speed is one of the most important. This applies to any webserver, not just Apache, so it is very important that you understand it.
How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when she clicks on a link or presses a Submit button to the moment when the resulting page is fully rendered.
The requests and replies are broken into packets. A request may be made up of several packets, a reply may be many thousands. Each packet has to make its own way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there.
A webserver is only one of the entities the packets see along their way. If we follow them from browser to server and back again, they may travel by different routes through many different entities. Before they are processed by your server the packets might have to go through proxy (accelerator) servers and if the request contains more than one packet they will all have to wait for the last one so that the full request message can be reassembled at the server. Then the whole process is repeated in reverse.
You could work hard to fine tune your webserver's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the Big Picture and to be aware of possible bottlenecks between the server and the Web. Of course there is little that you can do if the user has a slow connection.
You might tune your scripts and webserver to process incoming requests ultra quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses. You will see examples of other issues explored in this chapter.
A Web service is like a car, if one of the parts or mechanisms is broken the car may not go smoothly and it can even stop dead if pushed too far without first fixing it.
System Analysis
Before we try to solve a problem we need to indentify it. In our case we want to get the best performance we can with as little monetary and time investment as possible.
Software Requirements
Covered in the section "Choosing an Operating System".
Hardware Requirements
(META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!)
You need to analyze all of the problem's dimensions. There are several things that need to be considered:
How long does it take to process each request?
How many requests can you process simultaneously?
How many simultaneous requests are you planning to get?
At what rate are you expecting to receive requests?
The first one is probably the easiest to optimize. Following the performance optimization tips in this and other documents allows a professional perl (mod_perl) programmer to exercise their code and improve it.
The second one is a function of RAM. How much RAM is in each box, how many boxes do you have, and how much RAM does each mod_perl process use? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language or whether that will actually cost more than throwing another powerful machine into the rack.
Also ask yourself whether switching to another language will even help. In some applications, for example to link Oracle runtime libraries, a huge chunk of memory is needed so you would save nothing even if you switched from Perl to C.
The last one is important. You need a realistic estimate. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that these numbers might change drastically when you apply code changes and your site becomes popular. Remember that when you get a very high hit rate, the resource requirements don't grow linearly but exponentially!
More coverage is provided in the section "Choosing Hardware".
Essential Tools
In order to improve performance we need measurement tools. The main tool categories are benchmarking and code profiling.
Benchmarking Applications
How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There are many ways to benchmark the two. I'll present a few examples and numbers below. Check out the benchmark
directory of the mod_perl distribution for more examples.
If you are going to write your own benchmarking utility, use the Benchmark
module for heavy scripts and the Time::HiRes
module for very fast scripts (faster than 1 sec) where you will need better time precision.
There is no need to write a special benchmark though. If you want to impress your boss or colleagues, just take some heavy CGI script you have (e.g. a script that crunches some data and prints the results to STDOUT), open 2 xterms and call the same script in mod_perl mode in one xterm and in mod_cgi mode in the other. You can use lwp-get
from the LWP
package to emulate the browser. The benchmark
directory of the mod_perl distribution includes such an example.
See also two tools for benchmarking: ApacheBench and crashme test
Developers Talk
Perrin Harkins writes on benchmarks or comparisons, official or unofficial:
I have used some of the platforms you mentioned and researched others. What I can tell you for sure, is that no commercially available system offers the depth, power, and ease of use that mod_perl has. Either they don't let you access the web server internals, or they make you use less productive languages than Perl, sometimes forcing you into restrictive and confusing APIs and/or GUI development environments. None of them offers the level of support available from simply posting a message to [the mod-perl] list, at any price.
As for performance, beyond doing several important things (code-caching, pre-forking/threading, and persistent database connections) there isn't much these tools can do, and it's mostly in your hands as the developer to see that the things which really take the time (like database queries) are optimized.
The downside of all this is that most manager types seem to be unable to believe that web development software available for free could be better than the stuff that cost $25,000 per CPU. This appears to be the major reason most of the web tools companies are still in business. They send a bunch of suits to give PowerPoint presentations and hand out glossy literature to your boss, and you end up with an expensive disaster and an approaching deadline.
But I'm not bitter or anything...
Jonathan Peterson adds:
Most of the major solutions have something that they do better than the others, and each of them has faults. Microsoft's ASP has a very nice objects model, and has IMO the best data access object (better than DBI to use - but less portable). It has the worst scripting language. PHP has many of the advantages of Perl-based solutions, and is less complicated for developers. Netscape's Livewire has a good object model too, and provides good server-side Java integration - if you want to leverage Java skills, it's good. Also, it has a compiled scripting language - which is great if you aren't selling your clients the source code (and a pain otherwise).
mod_perl's advantage is that it is the most powerful. It offers the greatest degree of control with one of the more powerful languages. It also offers the greatest granularity. You can use an embedding module (eg eperl) from one place, a session module (Session) from another, and your data access module from yet another.
I think the Apache::ASP
module looks very promising. It has very easy to use and adequately powerful state maintenance, a good embedding system, and a sensible object model (that emulates the Microsoft ASP one). It doesn't replicate MS's ADO for data access, but DBI
is fine for that.
I have always found that the developers available make the greatest impact on the decision. If you have a team with no Perl experience, and a small or medium task, using something like PHP, or Microsoft ASP makes more sense than driving your staff into the vertical learning curve they'll need to use mod_perl.
For very large jobs, it may be worth finding the best technical solution, and then recruiting the team with the necessary skills.
Benchmarking a Graphic Hits Counter with Persistent DB Connections
Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). (Sorry, there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase:
Benchmark: timing 100 iterations of cgi, perl... [rate 1:28]
cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu)
perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu)
Benchmark: timing 1000 iterations of cgi,perl... [rate 1:21]
cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu)
perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)
Benchmark: timing 10000 iterations of cgi, perl [rate 1:21]
cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu)
perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu)
We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves.
The source code of the script was available at http://www.realtime.net/~parkerm/perl/conf98/sld006.htm. It's now a dead link. If you know its new location, please let me know.
Benchmarking Scripts with Execution Times Below 1 Second
If you want to get the benchmark results in micro-seconds and not in the tens of milli-seconds you get with the Benchmark
module, you will have to use the Time::HiRes
module, its usage is similar to Benchmark
's.
use Time::HiRes qw(gettimeofday tv_interval);
my $start_time = [ gettimeofday ];
sub_that_takes_a_teeny_bit_of_time();
my $end_time = [ gettimeofday ];
my $elapsed = tv_interval($start_time,$end_time);
print "The sub took $elapsed seconds."
See also the crashme test.
Benchmarking PerlHandlers
The Apache::Timeit
module does PerlHandler
Benchmarking. With the help of this module you can log the time taken to process the request, just like you'd use the Benchmark
module to benchmark a regular Perl script. Of course you can extend this module to perform more advanced processing like putting the results into a database for a later processing. But all it takes is adding this configuration directive inside httpd.conf:
PerlFixupHandler Apache::Timeit
Since scripts running under Apache::Registry
are running inside the PerlHandler these are benchmarked as well.
An example of the lines which show up in the error_log file:
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.04 usr + 0.01 sys = 0.05 CPU)
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
The Apache::Timeit
package is a part of the Apache-Perl-contrib files collection available from CPAN.
Code Profiling Techniques
The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. Probably you will want to optimize those.
Let's write some code to mess with:
META: build a hash and sort it by value, key... then rewrite the comparison subroutine to use the Shwartzian transform.. and more
Think about some more web oriented examples...!
map {push @list, int rand(100)} (1..1000);
sub mysort {
map ...
}
META: remove all the diagnostics section below it's irrelevant here. (just reuse the explanations)
In the diagnostics pragma section, I showed that leaving it in production code is a bad idea, as it significantly slows down the execution time. We verified that by using the Benchmark
module. Now let's see how to use a profiler to find what subroutine diagnostics
spends most of its time in. Once we know it could be a good idea to optimize this part of the code. We won't optimize the code here as this is beyond the scope of this document - and since this is a core Perl module, the chances are that it's already fairly well optimized.
We can use Devel::DProf
to help us. Let's use this code:
diagnostics.pl
--------------
use diagnostics;
test_code();
sub test_code{
for my $i (1..10) {
my $j = $i**2;
}
$a = "Hi";
$b = "Bye";
if ($a == $b) {
$c = $a;
}
}
Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp:
% perl -d:DProf diagnostics.pl
% dprofpp
Total Elapsed Time = 0.993458 Seconds
User+System Time = 0.933458 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
81.5 0.761 0.932 1 0.7610 0.9319 main::BEGIN
12.8 0.120 0.101 3161 0.0000 0.0000 diagnostics::unescape
6.43 0.060 0.060 2 0.0300 0.0300 diagnostics::BEGIN
2.14 0.020 0.020 3 0.0067 0.0067 diagnostics::transmo
1.07 0.010 0.010 2 0.0050 0.0050 Config::FETCH
0.00 0.000 -0.000 2 0.0000 - Exporter::import
0.00 0.000 -0.000 2 0.0000 - Exporter::export
0.00 0.000 -0.000 1 0.0000 - Config::BEGIN
0.00 0.000 -0.000 1 0.0000 - diagnostics::import
0.00 0.000 0.020 3 0.0000 0.0066 diagnostics::warn_trap
0.00 0.000 0.020 3 0.0000 0.0066 diagnostics::splainthis
0.00 0.000 -0.000 1 0.0000 - Config::TIEHASH
0.00 0.000 -0.000 3 0.0000 - diagnostics::shorten
0.00 0.000 -0.000 3 0.0000 - diagnostics::autodescribe
0.00 0.000 0.010 1 0.0000 0.0099 main::test_code
It's not easy to see what is responsible for this enormous overhead, even if main::BEGIN
seems to be running most of the time. To get the full picture we must see the OPs tree, which shows us who calls whom, so we run:
% dprofpp -T
and the output is:
main::BEGIN
diagnostics::BEGIN
Exporter::import
Exporter::export
diagnostics::BEGIN
Config::BEGIN
Config::TIEHASH
Exporter::import
Exporter::export
Config::FETCH
Config::FETCH
diagnostics::unescape
.....................
B<3159 times [diagnostics::unescape] snipped> .
.....................
diagnostics::unescape
diagnostics::import
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
main::test_code
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
So we see that two executions of diagnostics::BEGIN
and 3161 of diagnostics::unescape
are responsible for most of the running overhead.
META: but we see that it might be run only once in mod_perl, so the numbers are better. Am I right? check it!
If we comment out the diagnostics
module, we get:
Total Elapsed Time = 0.079974 Seconds
User+System Time = 0.059974 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.00 0.000 -0.000 1 0.0000 - main::test_code
It is possible to profile code running under mod_perl with the Devel::DProf
module, available on CPAN. However, you must have apache version 1.3b3 or higher and the PerlChildExitHandler
enabled during the httpd build process. When the server is started, Devel::DProf
installs an END
block to write the tmon.out
file. This block will be called at server shutdown. Here is how to start and stop a server with the profiler enabled:
% setenv PERL5OPT -d:DProf
% httpd -X -d `pwd` &
... make some requests to the server here ...
% kill `cat logs/httpd.pid`
% unsetenv PERL5OPT
% dprofpp
The Devel::DProf
package is a Perl code profiler. It will collect information on the execution time of a Perl script and of the subs in that script (remember that print()
and map()
are just like any other subroutines you write, but they come bundled with Perl!)
Another approach is to use Apache::DProf
, which hooks Devel::DProf
into mod_perl. The Apache::DProf
module will run a Devel::DProf
profiler inside each child server and write the tmon.out
file in the directory $ServerRoot/logs/dprof/$$
when the child is shutdown (where $$
is the number of the child process). All it takes is to add to httpd.conf
:
PerlModule Apache::DProf
Remember that any PerlHandler that was pulled in before Apache::DProf
in the httpd.conf
or startup.pl, will not have its code debugging information inserted. To run dprofpp
, chdir to $ServerRoot/logs/dprof/$$
and run:
% dprofpp
Measuring the Memory Usage of Subroutines
With help of Apache::Status
you can find out the size of each and every subroutine.
Build and install mod_perl as you always do, make sure it's version 1.22 or higher.
Configure /perl-status if you haven't already:
<Location /perl-status> SetHandler perl-script PerlHandler Apache::Status order deny,allow #deny from all #allow from ... </Location>
Add to httpd.conf
PerlSetVar StatusOptionsAll On PerlSetVar StatusTerse On PerlSetVar StatusTerseSize On PerlSetVar StatusTerseSizeMainSummary On PerlModule B::TerseSize
Start the server (best in httpd -X mode)
From your favorite browser fetch http://localhost/perl-status
Click on 'Loaded Modules' or 'Compiled Registry Scripts'
Click on the module or script of your choice (you might need to run some script/handler before you will see it here unless it was preloaded)
Click on 'Memory Usage' at the bottom
You should see all the subroutines and their respective sizes.
Now you can start to optimize your code. Or test which of the several implementations is of the least size.
For example let's compare CGI.pm
's OO vs procedural interfaces:
As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k.
Here are the code examples and the numbers:
-
cgi_oo.pl --------- use CGI (); my $q = CGI->new; print $q->header; print $q->b("Hello");
-
cgi_mtd.pl --------- use CGI qw(header b); print header(); print b("Hello");
After executing each script in single server mode (-X) the results are:
-
Totals: 1966 bytes | 27 OPs handler 1514 bytes | 27 OPs exit 116 bytes | 0 OPs
-
Totals: 4710 bytes | 19 OPs handler 1117 bytes | 19 OPs basefont 120 bytes | 0 OPs frameset 120 bytes | 0 OPs caption 119 bytes | 0 OPs applet 118 bytes | 0 OPs script 118 bytes | 0 OPs ilayer 118 bytes | 0 OPs header 118 bytes | 0 OPs strike 118 bytes | 0 OPs layer 117 bytes | 0 OPs table 117 bytes | 0 OPs frame 117 bytes | 0 OPs style 117 bytes | 0 OPs Param 117 bytes | 0 OPs small 117 bytes | 0 OPs embed 117 bytes | 0 OPs font 116 bytes | 0 OPs span 116 bytes | 0 OPs exit 116 bytes | 0 OPs big 115 bytes | 0 OPs div 115 bytes | 0 OPs sup 115 bytes | 0 OPs Sub 115 bytes | 0 OPs TR 114 bytes | 0 OPs td 114 bytes | 0 OPs Tr 114 bytes | 0 OPs th 114 bytes | 0 OPs b 113 bytes | 0 OPs
Note, that the above is correct if you didn't precompile all CGI.pm
's methods at server startup. Since if you did, the procedural interface in the second test will take up to 18k and not 5k as we saw. That's because the whole of CGI.pm
's namespace is inherited and it already has all its methods compiled, so it doesn't really matter whether you attempt to import only the symbols that you need. So if you have:
use CGI qw(-compile :all);
in the server startup script. Having:
use CGI qw(header);
or
use CGI qw(:all);
is essentially the same. You will have all the symbols precompiled at startup imported even if you ask for only one symbol. It seems to me like a bug, but probably that's how CGI.pm
works.
BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing 'my %hash' vs 'my %hash = ()'.
% perl -MO=Terse -e 'my %hash' | wc -l
-e syntax OK
4
% perl -MO=Terse -e 'my %hash = ()' | wc -l
-e syntax OK
10
The first one has less opcodes.
Know Your Operating System
In order to get the best performance it helps to get intimately familiar with the Operating System (OS) the web server is running on. There are many OS specific things that you may be able to optimise which will improve your web server's speed, reliability and security.
The following sections will unveal some of the most important details you should know about your OS.
Sharing Memory
The sharing of memory is one very important factor. If your OS supports it (and most sane systems do), you might save memory by sharing it between child processes. This is only possible when you preload code at server startup. However, during a child process' life its memory pages tend to become unshared.
There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the copy-on-write effect (we will explain this in a moment) will hit you almost at random.
If you are pre-loading many modules you might be able to trade off the memory that stays shared against the time for an occasional fork by tuning MaxRequestsPerChild
. Each time a child reaches this upper limit and dies it should release its unshared pages. The new child which replaces it will share its fresh pages until it scribbles on them.
The ideal is a point where your processes usually restart before too much memory becomes unshared. You should take some measurements to see if it makes a real difference, and to find the range of reasonable values. If you have success with this tuning the value of MaxRequestsPerChild
will probably be peculiar to your situation and may change with changing circumstances.
It is very important to understand that your goal is not to have MaxRequestsPerChild
to be 10000. Having a child serving 300 requests on precompiled code is already a huge overall speedup, so if it is 100 or 10000 it probably does not really matter if you can save RAM by using a lower value.
Do not forget that if you preload most of your code at server startup, the fork to spawn a new child will be very very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process.
During the life of the child its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the copy-on-write happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process.
The recommendation is that MaxRequestsPerChild
should not be too large, otherwise you lose some of the benefit of sharing memory.
See Choosing MaxRequestsPerChild for more about tuning the MaxRequestsPerChild
parameter.
How Shared Is My Memory?
You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See the Formula and the numbers.
How much shared memory do you have? You can see it by either using the memory utility that comes with your system or you can deploy the GTop
module:
print "Shared memory of the current process: ",
GTop->new->proc_mem($$)->share,"\n";
print "Total shared memory: ",
GTop->new->mem->share,"\n";
When you watch the output of the top
utility, don't confuse the RES
(or RSS
) columns with the SHARE
column. RES
is RESident memory, which is the size of pages currently swapped in.
Calculating Real Memory Usage
You have learned how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory.
On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process does a different task, therefore the shared memory is not the same for all processes.
So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation which was verified in the following way. I have calculated the real memory used, by the technique you will see in the moment, and then have stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSes do smart caching so you may not see the memory usage decrease as soon as it actually happens.
This is a technique I've used:
For each process sum up the difference between shared and system memory. To calculate a difference for a single process use:
use GTop; my $proc_mem = GTop->new->proc_mem($$); my $diff = $proc_mem->size - $proc_mem->share; print "Difference is $diff bytes\n";
Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process.
Finally, add the size of the parent process.
Please note that this might be incorrect for your system, so you use this number on your own risk.
I've used this technique to display real memory usage in the module Apache::VMonitor, so instead of trying to manually calculate this number you can use this module to do it automatically.
Is my Code Shared?
How do you find out if the code you write is shared between the processes or not? The code should be shared (except where it is on a memory page with variables that change), but some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value.
So imagine that you have this 10Mb in-memory database that resides in a single variable, you perform various operations on it and want to make sure that the variable is still shared. For example if you do some matching regex processing on this variable and want to use the pos() function, will it make the variable unshared or not?
The Apache::Peek
module comes to rescue, let's write a module called MyShared.pm which we preload at server startup, so all the variables of this module are initially shared by all children.
MyShared.pm
---------
package MyShared;
use Apache::Peek;
my $readonly = "Chris";
sub match{ $readonly =~ /\w/g; }
sub print_pos{ print "pos: ", pos($readonly), "\n"; }
sub dump { Dump($readonly); }
1;
This module declares the package, loads the Apache::Peek
module and defines the $readonly
variable which is supposed to be a big variable, but we will use a small one to simplify this example.
The module also defines three subroutines: match() that does a simple character matching, print_pos() that prints the current position of the matching engine inside the string that was last matched and finally the dump() subroutine that calls the Apache::Peek
module's Dump() function to dump a raw Perl data-type of the $readonly
variable.
Now we write the script that prints the PID of the process and calls the three functions. The goal is to check whether pos() makes the variable dirty and therefore unshared.
share_test.pl
-------
use MyShared;
print "Content-type: text/plain\r\n\r\n";
print "PID: $$\n";
MyShared::match();
MyShared::print_pos();
MyShared::dump();
Before you restart the server in httpd.conf set:
MaxClients 2
for easier tracking. You need at least two servers to compare the print outs of the test program, Having more than two can make the comparison process harder.
Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has been called a different number of times.
In the first window you will see something like that:
PID: 27040
pos: 1
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 1
And in the second window:
PID: 27041
pos: 2
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 2
We see that all the addresses are the same (0x8250e8c
and 0x8271af0
), therefore the variable data structure is almost completely shared. The only difference is in SV.MAGIC.MG_LEN
record, which is not shared.
So given that the $readonly
variable is a big one, its value is still shared between the processes, while part of the variable data structure is non-shared but it's almost insignificant because it takes a very little memory space.
Now if you need to compare more than variable, doing it by hand can be quite time consuming and error prune. Therefore it's better to correct the testing script to dump the Perl data-types into files (e.g /tmp/dump.$$, where $$
is the PID of the process) and then using diff(1) utility to see whether there is some difference.
Surely another way of ensuring that a scalar is sharable (i.e. readonly) is to either use the constant
pragma or readonly
pragma.
Preload Perl Modules at Server Startup
Use the PerlRequire
and PerlModule
directives to load commonly used modules such as CGI.pm
, DBI
and etc., when the server is started. On most systems, server children will be able to share the code space used by these modules. Just add the following directives into httpd.conf
:
PerlModule CGI
PerlModule DBI
But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like:
use DBI;
use Carp;
Then you require()
this startup file in httpd.conf
with the PerlRequire
directive, placing it before the rest of the mod_perl configuration directives:
PerlRequire /path/to/start-up.pl
CGI.pm
is a special case. Ordinarily CGI.pm
autoloads most of its functions on an as-needed basis. This speeds up the loading time by deferring the compilation phase. When you use mod_perl, FastCGI or another system that uses a persistent Perl interpreter, you will want to precompile the functions at initialization time. To accomplish this, call the package function compile() like this:
use CGI ();
CGI->compile(':all');
The arguments to compile()
are a list of method names or sets, and are identical to those accepted by the use()
and import()
operators. Note that in most cases you will want to replace ':all'
with the tag names that you actually use in your code, since generally you only use a subset of them.
You can also preload the Registry scripts. See Preload Registry Scripts.
Preload Perl modules - Real Numbers
(META: while the numbers and conclusions are mostly correct, need to rewrite the whole benchmark section using the GTop library to report the shared memory which is very important and will improve the benchmarks)
(META: Add the memory size tests when the server was compiled with EVERYTHING=1 and without it, does loading everything make a big change in the memory footprint? Probably the suggestion would be as follows: For a development server use EVERYTHING=1, while for production if your server is pretty busy and/or low on memory and every bit is on account, only the required parts should be built in. BTW, remember that apache comes with many modules that are built by default, and you might not need those!)
I have conducted a few tests to benchmark the memory usage when some modules are preloaded. The first set of tests checks the memory used with a Perl module preloaded (CGI.pm
). The second set checks the compile method of CGI.pm
. The third test checks the benefit of preloading a few Perl modules (we see more memory saved) and also the effect of precompiling the Registry modules with Apache::RegistryLoader
.
Hardware and software: The server is Apache 1.3.2 with mod_perl 1.16 running on AIX 4.1.5 RS6000 with 1G RAM.
1. In the first test, I used the following script:
use strict;
use CGI ();
my $q = new CGI;
print $q->header;
print $q->start_html,$q->p("Hello");
Server restarted
Before preloading CGI.pm
: (No other modules preloaded)
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 87004 0.0 0.0 1060 1524 - A 16:51:14 0:00 httpd
httpd 240864 0.0 0.0 1304 1784 - A 16:51:13 0:00 httpd
After running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 188068 0.0 0.0 1052 1524 - A 17:04:16 0:00 httpd
httpd 86952 0.0 1.0 2520 3052 - A 17:04:16 0:00 httpd
Observation: the child httpd has grown by 1268K
Server restarted
After preloading CGI.pm
:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 240796 0.0 0.0 1456 1552 - A 16:55:30 0:00 httpd
httpd 86944 0.0 0.0 1688 1800 - A 16:55:30 0:00 httpd
after running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 86872 0.0 0.0 1448 1552 - A 17:02:56 0:00 httpd
httpd 187996 0.0 1.0 2808 2968 - A 17:02:56 0:00 httpd
Observation: the child httpd has grown by 1168K, 100K less than without the preload -- good!
Server restarted
After CGI.pm
preloaded and compiled with CGI->compile(':all');
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 86980 0.0 0.0 2836 1524 - A 17:05:27 0:00 httpd
httpd 188104 0.0 0.0 3064 1768 - A 17:05:27 0:00 httpd
After running a script which uses CGI's methods (no imports):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 86980 0.0 0.0 2828 1524 - A 17:05:27 0:00 httpd
httpd 188104 0.0 1.0 4188 2940 - A 17:05:27 0:00 httpd
Observation: the child httpd has grown by 1172K - no change! So does CGI->compile(':all')
help us? We probably do not use all the methods which CGI provides, so in real use it's faster.
You might want to compile only the tags you are going to use, then you will definitely see some benefit.
2. The second test attempts to check whether CGI
's compile() method improve things. This is the code under test.
use strict;
use CGI qw(:all);
print header,start_html,p("Hello");
Server restarted
After CGI.pm
was preloaded but NOT compiled with CGI->compile():
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 17268 0.0 0.0 1456 1552 - A 18:02:49 0:00 httpd
httpd 86904 0.0 0.0 1688 1800 - A 18:02:49 0:00 httpd
After running a script which imports ALL the symbols:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 17268 0.0 0.0 1448 1552 - A 18:02:49 0:00 httpd
httpd 86904 0.0 1.0 2952 3112 - A 18:02:49 0:00 httpd
Observation: the child httpd has grown by 1264K
Server restarted
After CGI.pm
was preloaded and compiled with CGI->compile(':all'):
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 86812 0.0 0.0 2836 1524 - A 17:59:52 0:00 httpd
httpd 99104 0.0 0.0 3064 1768 - A 17:59:52 0:00 httpd
After running a script which imports ALL the symbols:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 86812 0.0 0.0 2832 1436 - A 17:59:52 0:00 httpd
httpd 99104 0.0 1.0 4884 3636 - A 17:59:52 0:00 httpd
Observation: the child httpd has grown by 1868K.
Why so much? In fact these results are misleading. If you look at the code you will see that we have called only three of CGI.pm
's methods. The statement use CGI qw(:all)
doesn't compile all the available methods, it just imports their names. This means that we do not use so much memory as if the methods are all compiled. Execute compile()
only on the methods you intend to use and then you will see a reduction in your memory requirements.
3. The third script:
use strict;
use CGI;
use Data::Dumper;
use Storable;
[and many lines of code, lots of globals - so the code is huge!]
Server restarted
Nothing preloaded at startup:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 90962 0.0 0.0 1060 1524 - A 17:16:45 0:00 httpd
httpd 86870 0.0 0.0 1304 1784 - A 17:16:45 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 90962 0.0 0.0 1064 1436 - A 17:16:45 0:00 httpd
httpd 86870 0.0 1.0 4024 4548 - A 17:16:45 0:00 httpd
Observation: the child httpd has grown by 2764K
Server restarted
Preloaded CGI (compiled), Storable, Data::Dumper at startup:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 26792 0.0 0.0 3120 1528 - A 17:19:21 0:00 httpd
httpd 91052 0.0 0.0 3340 1764 - A 17:19:21 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 26792 0.0 0.0 3124 1440 - A 17:19:21 0:00 httpd
httpd 91052 0.0 1.0 6568 5040 - A 17:19:21 0:00 httpd
Observation: the child httpd has grown by 3276K.
Ouch! 512K more!
The reason is that when you preload all of the methods at startup, they are all precompiled. There are many of them and they take up a big chunk of memory. If you don't use the compile() method, only the functions that are used will be compiled. Yes, it will slightly slow down the first response from each process, but the actual memory usage will be lower.
Server restarted
All the above modules plus the above script precompiled at startup with Apache::RegistryLoader
:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 43224 0.0 0.0 3256 1528 - A 17:23:12 0:00 httpd
httpd 26844 0.0 0.0 3488 1776 - A 17:23:12 0:00 httpd
Script using CGI (methods), Storable, Data::Dumper called:
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 43224 0.0 0.0 3252 1440 - A 17:23:12 0:00 httpd
httpd 26844 0.0 1.0 6748 5092 - A 17:23:12 0:00 httpd
Observation: the child httpd has grown even more!
3316K! This does not look good!
Summary:
1. Preloading Perl modules gave good results everywhere.
2. CGI.pm
's compile()
method seems to use even more memory. It's because we never use all of the methods that CGI provides. Only compile()
the tags that you are going to use and you will save the overhead of the first call for each method which has not yet been called. You will also save some memory since the compiled code will be shared with the children.
3. Apache::RegistryLoader
might make scripts load faster on the first request after the child has just started but the memory usage is worse.
Preload Registry Scripts
Apache::RegistryLoader
compiles Apache::Registry
scripts at server startup. It can be a good idea to preload the scripts you are going to use as well, so the code will be shared by the children.
Here is an example of the use of this technique. This code is included in a PerlRequire
'd file, and walks the directory tree under which all registry scripts are installed. For each .pl
file encountered, it calls the Apache::RegistryLoader::handler()
method to preload the script in the parent server, before pre-forking the child processes:
use File::Find 'finddepth';
use Apache::RegistryLoader ();
{
my $perl_dir = "perl/";
my $rl = Apache::RegistryLoader->new;
finddepth(sub {
return unless /\.pl$/;
my $url = "/$File::Find::dir/$_";
print "pre-loading $url\n";
my $status = $rl->handler($url);
unless($status == 200) {
warn "pre-load of `$url' failed, status=$status\n";
}
}, $perl_dir);
}
Note that we didn't use the second argument to handler()
here, as the module's manpage suggests. To make the loader smarter about the URI to filename translation, you might need to provide a trans()
function to translate the uri to filename. URI to filename translation normally doesn't happen until HTTP request time, so the module is forced to roll its own translation. If filename is omitted and a trans()
routine was not defined, the loader will try using the URI relative to ServerRoot.
You should check whether this makes any improvement for you though, I did some testing [ Preload Perl modules - Real Numbers ], and it seems that it takes more memory than when the scripts are called by the child. This is only a first impression and needs further investigation. If you aren't concerned about occasional script invocations taking a little time to respond while they load the code, you might not need it at all!
See also BEGIN blocks
Memory Swapping is Considered Bad
(META: check that you don't have a duplication somewhere in the text, probably the MaxClients tuning section)
When tuning the performance of your box, you must configure the software that you run in such a way that no memory swapping will occur: even during peak hours.
Swap memory is slow since it resides on the hard disc, which is much slower than the RAM. When your machine starts to swap, because it's unable to cope with the number of the processes it has to run, your machine will become slower and slower until it grinds to a halt. When the CPU has to page memory pages in and out things slow down, causing processing demands to go up, which in turn slows down the system even more as more memory is required and this is provided by the kernel using the reserved swap space. This ever worstening spiral will lead the machine to halt, unless the resource demand suddenly drops down and allows the processes to catch up with their tasks and go back to normal memory usage.
For swapping monitoring techniques see the section 'Apache::VMonitor -- Visual System and Apache Server Monitor'.
For the mod_perl specific swapping prevention guideliness see the section 'Choosing MaxClients'.
Increasing Shared Memory With mergemem
mergemem
is an experimental utility for linux, which looks very interesting for us mod_perl users:
http://www.ist.org/mergemem/
It looks like it could be run periodically on your server to find and merge duplicate pages. There are caveats: it would halt your httpds during the merge (it appears to be very fast, but still ...).
This software comes with a utility called memcmp to tell you how much you might save.
[ReaderMeta]: If you have tried this utility, please let us know what do you think about it! Thanks
Forking and Executing Subprocesses from mod_perl
In general you should not fork from your mod_perl scripts, since when you do, you are forking the entire Apache Web server, lock, stock and barrel. Not only is your perl code being duplicated, but so is mod_ssl, mod_rewrite, mod_log, mod_proxy, mod_speling or whatever modules you have used in your server, all the core routines...
A much better approach would be to spawn a sub-process, hand it the information it needs to do the task, and have it detach (close STDIN, STDOUT and STDERR + execute setsid()
). This is wise only if the parent which spawns this process immediately continues, and does not wait for the sub-process to complete. This approach is suitable for a situation when you want to use the Web interface to trigger a process which takes a long time, such as processing lots of data or sending email to thousands of users (no SPAM please!). Otherwise, you should convert the code into a module, and call its functions and methods from a CGI script.
Just making a system()
call defeats the whole idea behind mod_perl. The Perl interpreter and modules would be loaded again for this external program to run if it's a Perl program. Remember that the backticks `program` variant of system() behaves in the same way.
If you really have to use a system() call then the approach to take is this:
use FreezeThaw ();
$params=FreezeThaw::freeze(
[all data to pass to the other process]
);
system("program.pl", $params);
Notice that you do a system() call with arguments separated by commas, rather than passing them all as a single argument. When you use commas, the shell won't try to parse (tokenize) the parameters, therefore you don't have to worry about escaping unsafe shell characters. If you want the shell to parse the variables make sure to run the escape function, for example the one from the String::ShellQuote
package.
and in program.pl
:
use POSIX qw(setsid);
@params=FreezeThaw::thaw(shift @ARGV);
# check that @params is ok
close STDIN;
close STDOUT;
close STDERR;
# you might need to reopen the STDERR, i.e.
# open STDERR, ">/dev/null";
setsid(); # to detach
At this point, program.pl
is running in the "background" while the system()
returns and permits Apache to get on with things.
This has obvious problems:
@params
must not be bigger than whatever limit is imposed by your architecture. This could depend on your shell.The communication is one way only.
However, you might be trying to do the "wrong thing". If what you really want is to send information to the browser and then do some post-processing, look into the PerlCleanupHandler
directive.
If you are interested in more details, here is what actually happens when you fork() and make a system call such as this fragment:
system("echo Hi"),CORE::exit(0) unless fork();
Notice that I use CORE::exit()
and not exit()
, which would be automatically overriden by Apache::exit()
if used in conjunction with Apache::Registry
and friends.
The above code which might be more familiar in this form:
if (fork){
#do nothing
} else {
system("echo Hi");
CORE::exit(0);
}
The fork() gives you two possible execution paths, one for the parent and the other for the child. The child gets some virtual memory, sharing a copy of the program text (read only) and a copy of the data space copy-on-write (remember why you pre-load modules in mod_perl?).
In this example the parent will immediately continue with the code that comes after the fork, while the forked (child) process will execute '
system("echo Hi")
' and then terminate.The only work to be done before the child process goes on its separate way is setting up the page tables for the virtual memory.
Next, Perl will find
/bin/echo
along the search path, and invoke it directly. Perl's system() is not thesystem(3)
call [C-library]. Only when the command has shell metacharacters (like*
,?
) does Perl invoke a real shell (/bin/sh -c on Unix platforms). If there are no shell metacharacters in the argument, it is split into words and passed directly toexecvp()
, which is more efficient.That's a very nice optimization. In other words, only if you do:
system "sh -c 'echo *'"
will the operating system actually exec() a copy of
/bin/sh
to parse your command. But since one is almost certainly already running somewhere, the system will notice that (via the disk inode reference) and replace your virtual memory page table with one pointing to the existing program code plus your data space.Then the shell parses the passed command.
Since it is the
echo
utility, it will execute it as a built-in in the latter example or as/bin/echo
in the former and be done, but this is only an example. You aren't callingsystem("echo Hi")
in your mod_perl scripts, right?
Most real programs (heavy programs executed as a subprocess) would involve repeating the process to load the specified command or script. This might involve some actual demand paging from the program file if you execute new code.
The only place you will see real overhead from this scheme is when the parent process is huge (unfortunately like mod_perl...) and the page table becomes large as a side effect. The whole point of mod_perl is to avoid having to fork() or exec() something on every hit. Perl can do just about anything by itself.
Now let's get to the gory details of forking. Normally, every process has its parent. Many processes are children of the init
process, whose PID
is 1
. When you fork a process you must wait() or waitpid() for it to finish. If you don't wait for it, it becomes a zombie.
A zombie is a process that doesn't have a parent. When the child quits, it reports the termination to its parent. If no parent wait()s to collect the exit status of the child, it gets "confused" and becomes a ghost process, that can be seen, but not killed. It will be killed only when you stop the httpd process that spawned it!
Generally the ps() utility displays these processes with the <defunc>
tag, and you will see the zombies counter increment when doing top(). These zombie processes can take up system resources and are generally undesirable.
So the proper way to do a fork is:
print "Content-type: text/plain\r\n\r\n";
defined (my $kid = fork) or die "Cannot fork: $!";
if ($kid) {
waitpid($kid,0);
print "Parent has finished\n";
} else {
# do something
CORE::exit(0);
}
In most cases the only reason you would want to fork is when you need to spawn a process that will take a long time to complete. So if the server child that spawns this process has to wait for it to finish, you have gained nothing. You can neither wait for its completion, nor continue because you will get yet another zombie process. This is called a blocking call, since the process is blocked to so anything else before this call gets completed.
The simplest solution is to ignore your dead children. This doesn't work everywhere, however.
META: do you know where? tell me!!! It works with linux!:)
$SIG{CHLD} = IGNORE;
When you set the CHLD
signal handler to IGNORE
, all the processes will be collected by the init
process and are therefore prevented from becoming zombies.
Note that you cannot localize this setting with local()
. If you do, it won't have the desired effect.
META: Anyone like to explain why it doesn't work?
The other thing that you must do is to close all the pipes to the connection socket that were opened by the parent process (STDIN
and STDOUT
) and inherited by the child, so the parent will be able to complete the request and free itself for serving other requests. You may need to close and reopen the STDERR
filehandle. It's opened to append to the error_log file as inherited by its parent, so chances are that you will want to leave it untouched.
Of course if your child needs any of the STDIN
, STDOUT
or STDERR
streams you should reopen them. But you must untie the parent process, so you should close them first.
So now the code would look like this:
print "Content-type: text/plain\r\n\r\n";
$SIG{CHLD} = IGNORE;
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
print "Parent has finished\n";
} else {
close STDIN;
close STDOUT;
close STDERR;
# do something time-consuming
CORE::exit(0);
}
Note that waitpid() call has gone. The $SIG{CHLD} = IGNORE;
statement protects us from zombies, as explained above.
Another, more portable, but slightly more expensive solution is to use a double fork approach.
print "Content-type: text/plain\r\n\r\n";
defined (my $kid = fork) or die "Cannot fork: $!\n";
if ($kid) {
waitpid($kid,0);
} else {
defined (my $grandkid = fork) or die "Kid cannot fork: $!\n";
if ($grandkid) {
CORE::exit(0);
} else {
# code here
close STDIN;
close STDOUT;
close STDERR;
# do something long lasting
CORE::exit(0);
}
}
Grandkid becomes a "child of init", i.e. the parent process ID is 1.
Note that the previous two solutions do allow you to know the exit status of the process, but in our case we don't want to.
Another solution is to use a different SIGCHLD handler:
use POSIX 'WNOHANG';
$SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {} };
Which is useful when you fork() more than one process. The handler could call wait() as well, but for a variety of reasons involving the handling of stopped processes and the rare event in which two children exit at nearly the same moment, the best technique is to call waitpid() in a tight loop with a first argument of -1
and a second argument of WNOHANG
. Together these arguments tell waitpid() to reap the next child that's available, and prevent the call from blocking if there happens to be no child ready for reaping. The handler will loop until waitpid() returns a negative number or zero, indicating that no more reapable children remain.
You will probably want to open your own log file in the spawned process and log some information (at least while debugging your code) so you know what has happened.
Check also Apache::SubProcess for better system() and exec() implementations for mod_perl.
META: why is it a better thing to use?
Read perlipc manpage for more information about signal handlers.
OS Specific Parameters for Proxying
Most of the mod_perl enabled servers use a proxy front-end server. This is done in order to avoid serving static objects, and also so that generated output which might be received by slow clients does not cause the heavy but very fast mod_perl servers from idly waiting.
There are very important OS parameters that you might want to change in order to improve the server performance. This topic is discussed in the section: Setting the Buffering Limits on Various OSes
Performance Tuning by Tweaking Apache Configuration
Correct configuration of the MinSpareServers
, MaxSpareServers
, StartServers
, MaxClients
, and MaxRequestsPerChild
parameters is very important. There are no defaults. If they are too low, you will under-use the system's capabilities. If they are too high, the chances are that the server will bring the machine to its knees.
All the above parameters should be specified on the basis of the resources you have. With a plain apache server, it's no big deal if you run many servers since the processes are about 1Mb and don't eat a lot of your RAM. Generally the numbers are even smaller with memory sharing. The situation is different with mod_perl. I have seen mod_perl processes of 20Mb and more. Now if you have MaxClients
set to 50: 50x20Mb = 1Gb. Do you have 1Gb of RAM? Maybe not. So how do you tune the parameters? Generally by trying different combinations and benchmarking the server. Again mod_perl processes can be of much smaller size with memory sharing.
Before you start this task you should be armed with the proper weapon. You need the crashme utility, which will load your server with the mod_perl scripts you possess. You need it to have the ability to emulate a multiuser environment and to emulate the behavior of multiple clients calling the mod_perl scripts on your server simultaneously. While there are commercial solutions, you can get away with free ones which do the same job. You can use the ApacheBench ab
utility which comes with the Apache distribution, the crashme script which uses LWP::Parallel::UserAgent
or httperf
(see the Download page).
It is important to make sure that you run the load generator (the client which generates the test requests) on a system that is more powerful than the system being tested. After all we are trying to simulate Internet users, where many users are trying to reach your service at once. Since the number of concurrent users can be quite large, your testing machine must be very powerful and capable of generating a heavy load. Of course you should not run the clients and the server on the same machine. If you do, your test results would be invalid. Clients will eat CPU and memory that should be dedicated to the server, and vice versa.
See also two tools for benchmarking: ApacheBench and crashme test
Tuning with ab - ApacheBench
ApacheBench (ab) is a tool for benchmarking your Apache HTTP server. It is designed to give you an idea of the performance that your current Apache installation can give. In particular, it shows you how many requests per second your Apache server is capable of serving. The ab tool comes bundled with the Apache source distribution, and it's free. :)
Let's try it. We will simulate 10 users concurrently requesting a very light script at www.example.com:81/test/test.pl
. Each simulated user makes 10 requests.
% ./ab -n 100 -c 10 www.example.com:81/test/test.pl
The results are:
Concurrency Level: 10
Time taken for tests: 0.715 seconds
Complete requests: 100
Failed requests: 0
Non-2xx responses: 100
Total transferred: 60700 bytes
HTML transferred: 31900 bytes
Requests per second: 139.86
Transfer rate: 84.90 kb/s received
Connection Times (ms)
min avg max
Connect: 0 0 3
Processing: 13 67 71
Total: 13 67 74
The only numbers we really care about are:
Complete requests: 100
Failed requests: 0
Requests per second: 139.86
Let's raise the request load to 100 x 10 (10 users, each makes 100 requests):
% ./ab -n 1000 -c 10 www.example.com:81/perl/access/access.cgi
Concurrency Level: 10
Complete requests: 1000
Failed requests: 0
Requests per second: 139.76
As expected, nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50:
% ./ab -n 1000 -c 50 www.example.com:81/perl/access/access.cgi
Complete requests: 1000
Failed requests: 0
Requests per second: 133.01
We see that the server is capable of serving 50 concurrent users at an amazing 133 requests per second! Let's find the upper limit. Using -n 10000 -c 1000
failed to get results (Broken Pipe?). Using -n 10000 -c 500
resulted in 94.82 requests per second. The server's performance went down with the high load.
The above tests were performed with the following configuration:
MinSpareServers 8
MaxSpareServers 6
StartServers 10
MaxClients 50
MaxRequestsPerChild 1500
Now let's kill each child after it serves a single request. We will use the following configuration:
MinSpareServers 8
MaxSpareServers 6
StartServers 10
MaxClients 100
MaxRequestsPerChild 1
Simulate 50 users each generating a total of 20 requests:
% ./ab -n 1000 -c 50 www.example.com:81/perl/access/access.cgi
The benchmark timed out with the above configuration.... I watched the output of ps
as I ran it, the parent process just wasn't capable of respawning the killed children at that rate. When I raised the MaxRequestsPerChild
to 10, I got 8.34 requests per second. Very bad - 18 times slower! You can't benchmark the importance of the MinSpareServers
, MaxSpareServers
and StartServers
with this kind of test.
Now let's reset MaxRequestsPerChild
to 1500, but reduce MaxClients
to 10 and run the same test:
MinSpareServers 8
MaxSpareServers 6
StartServers 10
MaxClients 10
MaxRequestsPerChild 1500
I got 27.12 requests per second, which is better but still 4-5 times slower. (I got 133 with MaxClients
set to 50.)
Summary: I have tested a few combinations of the server configuration variables (MinSpareServers
, MaxSpareServers
, StartServers
, MaxClients
and MaxRequestsPerChild
). The results I got are as follows:
MinSpareServers
, MaxSpareServers
and StartServers
are only important for user response times. Sometimes users will have to wait a bit.
The important parameters are MaxClients
and MaxRequestsPerChild
. MaxClients
should be not too big, so it will not abuse your machine's memory resources, and not too small, for if it is your users will be forced to wait for the children to become free to serve them. MaxRequestsPerChild
should be as large as possible, to get the full benefit of mod_perl, but watch your server at the beginning to make sure your scripts are not leaking memory, thereby causing your server (and your service) to die very fast.
Also it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the test script was heavier, the numbers would be different but the conclusions very similar.
The benchmarks were run with:
HW: RS6000, 1Gb RAM
SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3
Machine running only mysql, httpd docs and mod_perl servers.
Machine was _completely_ unloaded during the benchmarking.
After each server restart when I changed the server's configuration, I made sure that the scripts were preloaded by fetching a script at least once for every child.
It is important to notice that none of the requests timed out, even if it was kept in the server's queue for more than a minute! That is the way ab works, which is OK for testing purposes but will be unacceptable in the real world - users will not wait for more than five to ten seconds for a request to complete, and the client (i.e. the browser) will time out in a few minutes.
Now let's take a look at some real code whose execution time is more than a few milliseconds. We will do some real testing and collect the data into tables for easier viewing.
I will use the following abbreviations:
NR = Total Number of Request
NC = Concurrency
MC = MaxClients
MRPC = MaxRequestsPerChild
RPS = Requests per second
Running a mod_perl script with lots of mysql queries (the script under test is mysqld limited) (http://www.example.com:81/perl/access/access.cgi?do_sub=query_form), with the configuration:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
MaxRequestsPerChild 5000
gives us:
NR NC RPS comment
------------------------------------------------
10 10 3.33 # not a reliable figure
100 10 3.94
1000 10 4.62
1000 50 4.09
Conclusions: Here I wanted to show that when the application is slow (not due to perl loading, code compilation and execution, but limited by some external operation) it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same. Given that all the requests have been served, you have the ability to queue the clients, but be aware that anything that goes into the queue means a waiting client and a client (browser) that might time out!
Now we will benchmark the same script without using the mysql (code limited by perl only): (http://www.example.com:81/perl/access/access.cgi), it's the same script but it just returns the HTML form, without making SQL queries.
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
10 10 26.95 # not a reliable figure
100 10 30.88
1000 10 29.31
1000 50 28.01
1000 100 29.74
10000 200 24.92
100000 400 24.95
Conclusions: This time the script we executed was pure perl (not limited by I/O or mysql), so we see that the server serves the requests much faster. You can see the number of requests per second is almost the same for any load, but goes lower when the number of concurrent clients goes beyond MaxClients
. With 25 RPS, the machine simulating a load of 400 concurrent clients will be served in 16 seconds. To be more realistic, assuming a maximum of 100 concurrent clients and 30 requests per second, the client will be served in 3.5 seconds. Pretty good for a highly loaded server.
Now we will use the server to its full capacity, by keeping all MaxClients
clients alive all the time and having a big MaxRequestsPerChild
, so that no child will be killed during the benchmarking.
MinSpareServers 50
MaxSpareServers 50
StartServers 50
MaxClients 50
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
100 10 32.05
1000 10 33.14
1000 50 33.17
1000 100 31.72
10000 200 31.60
Conclusion: In this scenario there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU.
Now we will change MaxClients
and watch the results: Let's reduce MaxClients
to 10.
MinSpareServers 8
MaxSpareServers 10
StartServers 10
MaxClients 10
MaxRequestsPerChild 5000
NR NC RPS comment
------------------------------------------------
10 10 23.87 # not a reliable figure
100 10 32.64
1000 10 32.82
1000 50 30.43
1000 100 25.68
1000 500 26.95
2000 500 32.53
Conclusions: Very little difference! Ten servers were able to serve almost with the same throughput as 50 servers. Why? My guess is because of CPU throttling. It seems that 10 servers were serving requests 5 times faster than when we worked with 50 servers. In that case, each child received its CPU time slice five times less frequently. So having a big value for MaxClients
, doesn't mean that the performance will be better. You have just seen the numbers!
Now we will start drastically to reduce MaxRequestsPerChild
:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
NR NC MRPC RPS comment
------------------------------------------------
100 10 10 5.77
100 10 5 3.32
1000 50 20 8.92
1000 50 10 5.47
1000 50 5 2.83
1000 100 10 6.51
Conclusions: When we drastically reduce MaxRequestsPerChild
, the performance starts to become closer to plain mod_cgi.
Here are the numbers of this run with mod_cgi, for comparison:
MinSpareServers 8
MaxSpareServers 16
StartServers 10
MaxClients 50
NR NC RPS comment
------------------------------------------------
100 10 1.12
1000 50 1.14
1000 100 1.13
Conclusion: mod_cgi is much slower. :) In the first test, when NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In the same circumstances, mod_perl was capable of 32 requests per second, nearly 30 times faster! In the first test each client waited about 100 seconds to be served. In the second and third tests they waited 1000 seconds!
Tuning with httperf
httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the webserver.
A sample command line is shown below:
httperf --server hostname --port 80 --uri /test.html \
--rate 150 --num-conn 27000 --num-call 1 --timeout 5
This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply.
The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get:
Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
Connection time [ms]: connect 0.3
Request rate: 148.3 req/s (6.7 ms/req)
Request size [B]: 72.0
Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
Reply time [ms]: response 4.6 transfer 0.0
Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
Net I/O: 190.9 KB/s (1.6*10^6 bps)
Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0
Tuning with the crashme Script
This is another crashme suite originally written by Michael Schilli and located at http://www.linux-magazin.de/ausgabe.1998.08/Pounder/pounder.html . I made a few modifications, mostly adding my() operators. I also allowed it to accept more than one url to test, since sometimes you want to test more than one script.
The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for Latency (seconds per request) and Throughput (requests per second). It can do a complete simulation of your favorite Netscape browser :) and give you a better picture.
I have noticed while running these two benchmarking suites, that ab gave me results from two and a half to three times better. Both suites were run on the same machine, with the same load and the same parameters, but the implementations were different.
Sample output:
URL(s): http://www.example.com:81/perl/access/access.cgi
Total Requests: 100
Parallel Agents: 10
Succeeded: 100 (100.00%)
Errors: NONE
Total Time: 9.39 secs
Throughput: 10.65 Requests/sec
Latency: 0.85 secs/Request
And the code:
#!/usr/apps/bin/perl -w
use LWP::Parallel::UserAgent;
use Time::HiRes qw(gettimeofday tv_interval);
use strict;
###
# Configuration
###
my $nof_parallel_connections = 10;
my $nof_requests_total = 100;
my $timeout = 10;
my @urls = (
'http://www.example.com:81/perl/faq_manager/faq_manager.pl',
'http://www.example.com:81/perl/access/access.cgi',
);
##################################################
# Derived Class for latency timing
##################################################
package MyParallelAgent;
@MyParallelAgent::ISA = qw(LWP::Parallel::UserAgent);
use strict;
###
# Is called when connection is opened
###
sub on_connect {
my ($self, $request, $response, $entry) = @_;
$self->{__start_times}->{$entry} = [Time::HiRes::gettimeofday];
}
###
# Are called when connection is closed
###
sub on_return {
my ($self, $request, $response, $entry) = @_;
my $start = $self->{__start_times}->{$entry};
$self->{__latency_total} += Time::HiRes::tv_interval($start);
}
sub on_failure {
on_return(@_); # Same procedure
}
###
# Access function for new instance var
###
sub get_latency_total {
return shift->{__latency_total};
}
##################################################
package main;
##################################################
###
# Init parallel user agent
###
my $ua = MyParallelAgent->new();
$ua->agent("pounder/1.0");
$ua->max_req($nof_parallel_connections);
$ua->redirect(0); # No redirects
###
# Register all requests
###
foreach (1..$nof_requests_total) {
foreach my $url (@urls) {
my $request = HTTP::Request->new('GET', $url);
$ua->register($request);
}
}
###
# Launch processes and check time
###
my $start_time = [gettimeofday];
my $results = $ua->wait($timeout);
my $total_time = tv_interval($start_time);
###
# Requests all done, check results
###
my $succeeded = 0;
my %errors = ();
foreach my $entry (values %$results) {
my $response = $entry->response();
if($response->is_success()) {
$succeeded++; # Another satisfied customer
} else {
# Error, save the message
$response->message("TIMEOUT") unless $response->code();
$errors{$response->message}++;
}
}
###
# Format errors if any from %errors
###
my $errors = join(',', map "$_ ($errors{$_})", keys %errors);
$errors = "NONE" unless $errors;
###
# Format results
###
#@urls = map {($_,".")} @urls;
my @P = (
"URL(s)" => join("\n\t\t ", @urls),
"Total Requests" => "$nof_requests_total",
"Parallel Agents" => $nof_parallel_connections,
"Succeeded" => sprintf("$succeeded (%.2f%%)\n",
$succeeded * 100 / $nof_requests_total),
"Errors" => $errors,
"Total Time" => sprintf("%.2f secs\n", $total_time),
"Throughput" => sprintf("%.2f Requests/sec\n",
$nof_requests_total / $total_time),
"Latency" => sprintf("%.2f secs/Request",
($ua->get_latency_total() || 0) /
$nof_requests_total),
);
my ($left, $right);
###
# Print out statistics
###
format STDOUT =
@<<<<<<<<<<<<<<< @*
"$left:", $right
.
while(($left, $right) = splice(@P, 0, 2)) {
write;
}
Choosing MaxClients
The MaxClients
directive sets the limit on the number of simultaneous requests that can be supported. No more than this number of child server processes will be created. To configure more than 256 clients, you must edit the HARD_SERVER_LIMIT
entry in httpd.h
and recompile. In our case we want this variable to be as small as possible, because in this way we can limit the resources used by the server children. Since we can restrict each child's process size (see Limiting the size of the processes), the calculation of MaxClients
is pretty straightforward:
Total RAM Dedicated to the Webserver
MaxClients = ------------------------------------
MAX child's process size
So if I have 400Mb left for the webserver to run with, I can set MaxClients
to be of 40 if I know that each child is limited to 10Mb of memory (e.g. with Apache::SizeLimit
).
You will be wondering what will happen to your server if there are more concurrent users than MaxClients
at any time. This situation is signified by the following warning message in the error_log
:
[Sun Jan 24 12:05:32 1999] [error] server reached MaxClients setting,
consider raising the MaxClients setting
There is no problem -- any connection attempts over the MaxClients
limit will normally be queued, up to a number based on the ListenBacklog
directive. When a child process is freed at the end of a different request, the connection will be served.
It is an error because clients are being put in the queue rather than getting served immediately, despite the fact that they do not get an error response. The error can be allowed to persist to balance available system resources and response time, but sooner or later you will need to get more RAM so you can start more child processes. The best approach is to try not to have this condition reached at all, and if you reach it often you should start to worry about it.
It's important to understand how much real memory a child occupies. Your children can share memory between them when the OS supports that. You must take action to allow the sharing to happen - See Preload Perl modules at server startup. If you do this, the chances are that your MaxClients
can be even higher. But it seems that it's not so simple to calculate the absolute number. If you come up with a solution please let us know! If the shared memory was of the same size throughout the child's life, we could derive a much better formula:
Total_RAM + Shared__RAM_per_Child * MaxClients
MaxClients = ---------------------------------------------
Max_Process_Size - 1
which is:
Total_RAM - Max_Process_Size
MaxClients = ---------------------------------------
Max_Process_Size - Shared_RAM_per_Child
Let's roll some calculations:
Total_RAM = 500Mb
Max_Process_Size = 10Mb
Shared_RAM_per_Child = 4Mb
500 - 10
MaxClients = --------- = 81
10 - 4
With no sharing in place
500
MaxClients = --------- = 50
10
With sharing in place you can have 60% more servers without buying more RAM.
If you improve sharing and keep the sharing level, let's say:
Total_RAM = 500Mb
Max_Process_Size = 10Mb
Shared_RAM_per_Child = 8Mb
500 - 10
MaxClients = --------- = 245
10 - 8
390% more servers! Now you can feel the importance of having as much shared memory as possible.
Choosing MaxRequestsPerChild
The MaxRequestsPerChild
directive sets the limit on the number of requests that an individual child server process will handle. After MaxRequestsPerChild
requests, the child process will die. If MaxRequestsPerChild
is 0, then the process will live forever.
Setting MaxRequestsPerChild
to a non-zero limit solves some memory leakage problems caused by sloppy programming practices, whereas a child process consumes more memory after each request.
If left unbounded, then after a certain number of requests the children will use up all the available memory and leave the server to die from memory starvation. Note that sometimes standard system libraries leak memory too, especially on OSes with bad memory management (e.g. Solaris 2.5 on x86 arch).
If this is your case you can set MaxRequestsPerChild
to a small number. This will allow the system to reclaim the memory that a greedy child process consumed, when it exits after MaxRequestsPerChild
requests.
But beware -- if you set this number too low, you will lose some of the speed bonus you get from mod_perl. Consider using Apache::PerlRun
if this is the case.
Another approach is to use the Apache::SizeLimit or the Apache::GTopLimit modules. By using either of these modules you should be able to discontinue using the MaxRequestPerChild
, although for some developers, using both in combination does the job. In addition the latter module allows you to kill any servers whose shared memory size drops below a specified limit.
See also Preload Perl modules at server startup and Sharing Memory.
Choosing MinSpareServers, MaxSpareServers and StartServers
With mod_perl enabled, it might take as much as 20 seconds from the time you start the server until it is ready to serve incoming requests. This delay depends on the OS, the number of preloaded modules and the process load of the machine. It's best to set StartServers
and MinSpareServers
to high numbers, so that if you get a high load just after the server has been restarted the fresh servers will be ready to serve requests immediately. With mod_perl, it's usually a good idea to raise all 3 variables higher than normal.
In order to maximize the benefits of mod_perl, you don't want to kill servers when they are idle, rather you want them to stay up and available to handle new requests immediately. I think an ideal configuration is to set MinSpareServers
and MaxSpareServers
to similar values, maybe even the same. Having the MaxSpareServers
close to MaxClients
will completely use all of your resources (if MaxClients
has been chosen to take the full advantage of the resources), but it'll make sure that at any given moment your system will be capable of responding to requests with the maximum speed (assuming that number of concurrent requests is not higher than MaxClients
).
Let's try some numbers. For a heavily loaded web site and a dedicated machine I would think of (note 400Mb is just for example):
Available to webserver RAM: 400Mb
Child's memory size bounded: 10Mb
MaxClients: 400/10 = 40 (larger with mem sharing)
StartServers: 20
MinSpareServers: 20
MaxSpareServers: 35
However if I want to use the server for many other tasks, but make it capable of handling a high load, I'd think of:
Available to webserver RAM: 400Mb
Child's memory size bounded: 10Mb
MaxClients: 400/10 = 40
StartServers: 5
MinSpareServers: 5
MaxSpareServers: 10
These numbers are taken off the top of my head, and shouldn't be used as a rule, but rather as examples to show you some possible scenarios. Use this information with caution!
Summary of Benchmarking to tune all 5 parameters
OK, we've run various benchmarks -- let's summarize the conclusions:
MaxRequestsPerChild
If your scripts are clean and don't leak memory, set this variable to a number as large as possible (10000?). If you use
Apache::SizeLimit
, you can set this parameter to 0 (treated as infinity). You will want this parameter to be smaller if your code becomes unshared over the process' life. AndApache::GTopLimit
comes into the picture with the shared memory limitation feature.StartServers
If you keep a small number of servers active most of the time, keep this number low. Keep it low especially if
MaxSpareServers
is also low, as if there is no load Apache will kill its children before they have been utilized at all. If your service is heavily loaded, make this number close toMaxClients
, and keepMaxSpareServers
equal toMaxClients
.MinSpareServers
If your server performs other work besides web serving, make this low so the memory of unused children will be freed when the load is light. If your server's load varies (you get loads in bursts) and you want fast response for all clients at any time, you will want to make it high, so that new children will be respawned in advance and are waiting to handle bursts of requests.
MaxSpareServers
The logic is the same as for
MinSpareServers
- low if you need the machine for other tasks, high if it's a dedicated web host and you want a minimal delay between the request and the response.MaxClients
Not too low, so you don't get into a situation where clients are waiting for the server to start serving them (they might wait, but not for very long). However, do not set it too high. With a high MaxClients, if you get a high load the server will try to serve all requests immediately. Your CPU will have a hard time keeping up, and if the child size * number of running children is larger than the total available RAM your server will start swapping. This will slow down everything, which in turn will make things even slower, until eventually your machine will die. It's important that you take pains to ensure that swapping does not normally happen. Swap space is an emergency pool, not a resource to be used routinely. If you are low on memory and you badly need it, buy it. Memory is cheap.
But based on the test I conducted above, even if you have plenty of memory like I have (1Gb), increasing
MaxClients
sometimes will give you no improvement in performance. The more clients are running, the more CPU time will be required, the less CPU time slices each process will receive. The response latency (the time to respond to a request) will grow, so you won't see the expected improvement. The best approach is to find the minimum requirement for your kind of service and the maximum capability of your machine. Then start at the minimum and test like I did, successively raising this parameter until you find the region on the curve of the graph of latency and/or throughput against MaxClients where the improvement starts to diminish. Stop there and use it. When you make the measurements on a production server you will have the ability to tune them more precisely, since you will see the real numbers.Don't forget that if you add more scripts, or even just modify the existing ones, the processes will grow in size as you compile in more code. Probably the parameters will need to be recalculated.
KeepAlive
If your mod_perl server's httpd.conf includes the following directives:
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
you have a real performance penalty, since after completing the processing for each request, the process will wait for KeepAliveTimeout
seconds before closing the connection and will therefore not be serving other requests during this time. With this configuration you will need many more concurrent processes on a server with high traffic.
If you use some server status reporting tools, you will see the process in K status when it's in KeepAlive
status.
The chances are that you don't want this feature enabled. Set it Off with:
KeepAlive Off
the other two directives don't matter if KeepAlive
is Off
.
You might want to consider enabling this option if the client's browser needs to request more than one object from your server for a single HTML page. If this is the situation the by setting KeepAlive
Off
then for each page you save the HTTP connection overhead for all requests but the first one.
For example if you have a page with 10 ad banners, which is not uncommon today, you server will work more effectively if a single process serves them all during a single connection. However, your client will see a slightly slower response, since banners will be brought one at a time and not concurrently as is the case if each IMG
tag opens a separate connection.
Since keepalive connections will not incur the additional three-way TCP handshake, turning it off will be kinder to the network.
SSL connections benefit the most from KeepAlive
in case you didn't configure the server to cache session ids.
You have probably followed the advice to send all the requests for static objects to a plain Apache server. Since most pages include more than one unique static image, you should keep the default KeepAlive
setting of the non-mod_perl server, i.e. keep it On
. It will probably be a good idea also to reduce the timeout a little.
One option would be for the proxy/accelerator to keep the connection open to the client but make individual connections to the server, read the response, buffer it for sending to the client and close the server connection. Obviously you would make new connections to the server as required by the client's requests.
Also you should know that KeepAlive
requests only work with responses that contain a Content-Length
header. To send this header do:
$r->header_out('Content-Length', $length);
PerlSetupEnv Off
PerlSetupEnv Off
is another optimization you might consider.
mod_perl fiddles with the environment to make it appear as if the script were being called under the CGI protocol. For example, the $ENV{QUERY_STRING}
environment variable is initialized with the contents of Apache::args(), and the value returned by Apache::server_hostname() is put into $ENV{SERVER_NAME}
.
But %ENV
population is expensive. Those who have moved to the Perl Apache API no longer need this extra %ENV
population, and can gain by turning it Off.
By default it is On.
Note that you can still set enviroment variables. For example when you use the following configuration:
PerlModule Apache::RegistryNG
<Location /perl>
PerlSetupEnv Off
PerlSetEnv TEST hi
SetHandler perl-script
PerlHandler Apache::RegistryNG
Options +ExecCGI
</Location>
and you issue a request (for example http://localhost/perl/setupenvoff.pl) for this script:
setupenvoff.pl
--------------
use Data::Dumper;
my $r = Apache->request();
$r->send_http_header('text/plain');
print Dumper(\%ENV);
you should see something like this:
$VAR1 = {
'GATEWAY_INTERFACE' => 'CGI-Perl/1.1',
'MOD_PERL' => 'mod_perl/1.22',
'PATH' => '/usr/lib/perl5/5.00503:... snipped ...',
'TEST' => 'hi'
};
Notice that we have got the value of the environment variable TEST.
Reducing the Number of stat() Calls Made by Apache
If you watch the system calls that your server makes (using truss or strace while processing a request, you will notice that a few stat() calls are made. For example when I fetch http://localhost/perl-status and I have my DocRoot set to /home/httpd/docs I see:
[snip]
stat("/home/httpd/docs/perl-status", 0xbffff8cc) = -1
ENOENT (No such file or directory)
stat("/home/httpd/docs", {st_mode=S_IFDIR|0755,
st_size=1024, ...}) = 0
[snip]
If you have some dynamic content and your virtual relative URI is something like /news/perl/mod_perl/summary (i.e., there is no such directory on the web server, the path components are only used for requesting a specific report), this will generate five(!) stat() calls, before the DocumentRoot
is found. You will see something like this:
stat("/home/httpd/docs/news/perl/mod_perl/summary", 0xbffff744) = -1
ENOENT (No such file or directory)
stat("/home/httpd/docs/news/perl/mod_perl", 0xbffff744) = -1
ENOENT (No such file or directory)
stat("/home/httpd/docs/news/perl", 0xbffff744) = -1
ENOENT (No such file or directory)
stat("/home/httpd/docs/news", 0xbffff744) = -1
ENOENT (No such file or directory)
stat("/home/httpd/docs",
{st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
You can blame the default installed TransHandler
for this inefficiency. Of course you could supply your own, which will be smart enough not to look for this virtual path and immediately return OK
. But in cases where you have a virtual host that serves only dynamically generated documents, you can override the default PerlTransHandler
with this one:
<VirtualHost 10.10.10.10:80>
...
PerlTransHandler Apache::OK
...
</VirtualHost>
As you see it affects only this specific virtual host.
This has the effect of short circuiting the normal TransHandler
processing of trying to find a filesystem component that matches the given URI -- no more 'stat's!
Watching your server under strace/truss can often reveal more performance hits than trying to optimize the code itself!
For example unless configured correctly, Apache might look for the .htaccess file in many places, if you don't have one and add many open() calls.
Let's start with this simple configuration, and will try to reduce the number of irrelevant system calls.
DocumentRoot "/home/httpd/docs"
<Location /foo/test>
SetHandler perl-script
PerlHandler Apache::Foo
</Location>
The above configuration allows us to make a request to /foo/test and the Perl handler() defined in Apache::Foo
will be executed. Notice that in the test setup there is no file to be executed (like in Apache::Registry
). There is no .htaccess file as well.
This is a typical generated trace.
stat("/home/httpd/docs/foo/test", 0xbffff8fc) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs/foo", 0xbffff8fc) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs",
{st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
open("/.htaccess", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/home/.htaccess", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/home/httpd/.htaccess", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/home/httpd/docs/.htaccess", O_RDONLY) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs/test", 0xbffff774) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs",
{st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
Now we modify the <Directory>
entry and add AllowOverride None, which among other things disables .htaccess files and will not try to open them.
<Directory />
AllowOverride None
</Directory>
We see that the four open() calls for .htaccess have gone.
stat("/home/httpd/docs/foo/test", 0xbffff8fc) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs/foo", 0xbffff8fc) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs",
{st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
stat("/home/httpd/docs/test", 0xbffff774) = -1 ENOENT
(No such file or directory)
stat("/home/httpd/docs",
{st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0
Let's try to shortcut the foo location with:
Alias /foo /
Which makes Apache to look for the file in the / directory and not under /home/httpd/docs/foo. Let's run it:
stat("//test", 0xbffff8fc) = -1 ENOENT (No such file or directory)
Wow, we've got only one stat call left!
Let's remove the last Alias
setting and use:
PerlTransHandler Apache::OK
as explained above. When we issue the request, we see no stat() calls. But this is possible only if you serve only dynamically generated documents, i.e. no CGI scripts. Otherwise you will have to write your own PerlTransHandler to handle requests as desired.
For example this PerlTransHandler will not lookup the file on the filesystem if the URI starts with /foo, but will use the default PerlTransHandler otherwise:
PerlTransHandler 'sub { return shift->uri() =~ m|^/foo| \
? Apache::OK : Apache::DECLINED;}'
Let's see the same configuration using the <Perl>
section and a dedicated package:
<Perl>
package My::Trans;
use Apache::Constants qw(:common);
sub handler{
my $r = shift;
return OK if $r->uri() =~ m|^/foo|;
return DECLINED;
}
package Apache::ReadConfig;
$PerlTransHandler = "My::Trans";
</Perl>
As you see we have defined the My::Trans
package and implemented the handler() function. Then we have assigned this handler to the PerlTransHandler
.
Of course you can move the code in the module into an external file, (e.g. My/Trans.pm) and configure the PerlTransHandler
with
PerlTransHandler My::Trans
in the normal way (no <Perl>
section required.
TMTOWTDI: Convenience and Performance
TMTOWTDI, or "There Is More Than One Way To Do It" is the main motto of Perl. Unfortunately when you come to the point where performance is the goal, you might have to learn what's more efficient and what's not. Of course it might mean that you will have to use something that you don't really like, it might be less convenient or it might be just a matter of habit that one should change.
So this section is about performance trade-offs. For each comparison we will provide the theoretical difference and then run benchmarks to support the theory, since however good the theory its the numbers we get in practice that matter.
The following SW/HW is used for the benchmarking purposes:
HW: Dual Pentium II (Deschutes) 400Mhz 512 KB cache 256MB
RAM (DIMM PC100)
SW: Linux (RH 6.1) Perl 5.005_03
Apache/1.3.12 mod_perl/1.22 mod_ssl/2.6.2 OpenSSL/0.9.5
The relevant Apache configuration:
MinSpareServers 10
MaxSpareServers 20
StartServers 10
MaxClients 20
MaxRequestsPerChild 10000
Apache::Registry versus pure PerlHandler
At some point you have to decide whether to use Apache::Registry
and similar handlers and stick to writing scripts for the content generation or to write pure Perl handlers.
Apache::Registry
maps a request to a file and generates a subroutine to run the code contained in that file. If you use a PerlHandler My::handler instead of Apache::Registry
, you have a direct mapping from request to subroutine, without the steps in between. These steps include:
- stat the $r->filename
- check that it exists and is executable
- generate a Perl package name based on $r->uri
- chdir basename $r->filename
- compare last modified time
- if modified or not compiled, compile the subroutine
- chdir $old_cwd
If you cut out those steps, you cut out some overhead, plain and simple. Do you need to cut out that overhead? We don't know, your requirements determine that.
You should take a look at the sister Apache::Registry
modules that don't perform all all these steps, so you can still choose to stick to using scripts to generate the content.
On the other hand, if you go the pure Perl handler way you will have to add a special configuration directives for each handler, something that you don't do when you go the "scripts" way.
Now let's run benchmarks and compare.
The Light (Empty) Code
First lets see the overhead that Apache::Registry
adds. In order to do that we will use an almost empty script, that only sends a basic header and one word as content.
The registry.pl script running under Apache::Registry
:
benchmarks/registry.pl
----------------------
use strict;
print "Content-type: text/plain\r\n\r\n";
print "Hello";
The Perl Content handler:
Benchmark/Handler.pm
--------------------
package Benchmark::Handler;
use Apache::Constants qw(:common);
sub handler{
$r = shift;
$r->send_http_header('text/html');
$r->print("Hello");
return OK;
}
1;
with settings:
PerlModule Benchmark::Handler
<Location /benchmark_handler>
SetHandler perl-script
PerlHandler Benchmark::Handler
</Location>
so we get Benchmark::Handler
preloaded.
We will use the Apache::RegistryLoader
to preload the script as well, so the benchmark will be fair and only the processing time will be measured. In the startup.pl we add:
use Apache::RegistryLoader ();
Apache::RegistryLoader->new->handler(
"/perl/benchmarks/registry.pl",
"/home/httpd/perl/benchmarks/registry.pl");
And we if we check the Compiled Registry Scripts" section with the help of Apache::Status ( http://localhost/perl-status?rgysubs ), where we see the listing of the already compiled scripts:
Apache::ROOT::perl::benchmarks::registry_2epl
The Heavy Code
We we will see that the overhead is insignificant when the code itself is significantly heavier and slower. Let's leave the above code examples umodified but add some CPU intensive processing operation (it can be also an IO operation or a database query.)
my $x = 100;
my $y = log ($x ** 100) for (0..10000);
Processing and Results
So now we can proceed with the benchmark. We will generate 5000 request with 10 as a concurrency level (i.e. emulating 10 concurrent users):
% ab -n 5000 -c 10 http://localhost/perl/benchmarks/registry.pl
% ab -n 5000 -c 10 http://localhost/benchmark_handler
And the results:
Light code:
Type RPS Av.CTime
------- --- -------
Registry 561 16
Handler 707 13
Heavy code:
Type RPS Av.CTime
------- --- -------
Registry 68 146
Handler 70 141
Reports:
-----------------------------------------------
RPS : Requests Per Second
Av. CTime : Average request processing time (msec) as seen by client
Conclusions
The Light Code
We can see that the average overhead added by
Apache::Registry
is about:16 - 13 = 3 milli-seconds
per request.
Thus the difference in speed is about 19%.
The Heavy Code
If we are looking at the average processing time, we see that the time delta between the two handlers is almost the same and has grown from 3 msec to 5 msec. Which means that the identical heavy code that has been added was running for 130 msec (146-16). It doesn't mean that the added code itself has been running for 130 msec. It means that it took 130 msec for this code to be completed in a multi-process environment where each process gets a time slice to use the CPU.
If we run this extra code under plain Benchmark:
benchmark.pl ------------ use Benchmark; timethis (1_000, sub { my $x = 100; my $y = log ($x ** 100) for (0..10000); }); % perl benchmark.pl timethis 1000: 25 wallclock secs (24.93 usr + 0.00 sys = 24.93 CPU)
We see that it takes about 25 CPU seconds to complete.
The interesting thing is that when the server under test runs on a slow machine the results are completely different. I'll present them here for comparison:
Light code: Type RPS Av.CTime ------- --- ------- Registry 61 160 Handler 196 50 Heavy code: Type RPS Av.CTime ------- --- ------- Registry 12 822 Handler 67 149
You can see that adding the same CPU intensive code to the two handlers under test on the slow machine, enlarges the delta of the average processing time between the two handlers. We'd expect to see the same delta (of 110 msec) in this case, but that's not what's happenning.
The explanation lies in fact that the difference between the machines isn't merely the processor speed. It's possible that there are many other things that different. For example the size of the processor cache. If one machine has a processor cache large enough to hold the whole handler and the other doesn't this can be very significant, given that in our benchmark, 99.9% of the CPU activity was dedicated to running the handler's code.
CGI.pm versus Apache::Request
CGI.pm
is a pure Perl implementation of the most used functions used in CGI coding. Mainly it has two parts -- input processing and HTML generation.
Apache::Request
's core is written in C, giving it a significant memory and performance benefit. It has all the functionality of CGI.pm
except HTML generation functions.
use CGI qw(-compile =
':all')> adds about 1Mb size to the server. CGI.pm
pulls lots of stunts under the covers to provide both a method and function interface, etc. Apache::Request
is a very thin XS layer on top of a C library and only adds a few kbytes size to the server. this C code is much faster and lighter than the Perl equivalent used in CGI.pm
or similar (e.g. CGI_Lite).
This difference might not matter much to you, depending on your requirements.
Let's write two registry scripts that use CGI.pm
and Apache::Request
to process a form's input and print it out. We will use the scripts to benchmark the modules.
benchmarks/cgi_pm.pl
--------------------
use strict;
use CGI;
my $q = new CGI;
print $q->header('text/plain');
print join "\n", map {"$_ => ".$q->param($_) } $q->param;
benchmarks/apache_request.pl
----------------------------
use strict;
use Apache::Request ();
my $r = Apache->request;
my $q = Apache::Request->new($r);
$r->send_http_header('text/plain');
print join "\n", map {"$_ => ".$q->param($_) } $q->param;
We preload both the modules that we are going to benchmark in the startup.pl:
use Apache::Request ();
use CGI qw(-compile :all);
We will preload the both scripts as well:
use Apache::RegistryLoader ();
Apache::RegistryLoader->new->handler(
"/perl/benchmarks/cgi_pm.pl",
"/home/httpd/perl/benchmarks/cgi_pm.pl");
Apache::RegistryLoader->new->handler(
"/perl/benchmarks/apache_request.pl",
"/home/httpd/perl/benchmarks/apache_request.pl");
Now let's benchmark the two:
% ab -n 1000 -c 10 \
'http://localhost/perl/benchmarks/cgi_pm.pl?a=b&c=+k+d+d+f&d=asf&as=+1+2+3+4'
Time taken for tests: 23.950 seconds
Requests per second: 41.75
Connnection Times (ms)
min avg max
Connect: 0 0 45
Processing: 204 238 274
Total: 204 238 319
% ab -n 1000 -c 10 \
'http://localhost/perl/benchmarks/apache_request.pl?a=b&c=+k+d+d+f&d=asf&as=+1+2+3+4'
Time taken for tests: 18.406 seconds
Requests per second: 54.33
Connnection Times (ms)
min avg max
Connect: 0 0 32
Processing: 156 183 202
Total: 156 183 234
Apparently the latter script using Apache::Request
is about 23% faster. If the input is going to be larger the percentage speed up grows as well.
In the above example we have benchmarked the CGI input processing. When the code is much heavier the overhead of using CGI.pm
for input parsing becomes insignificant.
"Bloatware" modules
Perl modules like IO:: are very convenient, but let's see what it costs us to use them. (perl5.6.0 over OpenBSD)
% wc `perl -MIO -e 'print join("\n", sort values %INC, "")'`
124 696 4166 /usr/local/lib/perl5/5.6.0/Carp.pm
580 2465 17661 /usr/local/lib/perl5/5.6.0/Class/Struct.pm
400 1495 10455 /usr/local/lib/perl5/5.6.0/Cwd.pm
313 1589 10377 /usr/local/lib/perl5/5.6.0/Exporter.pm
225 784 5651 /usr/local/lib/perl5/5.6.0/Exporter/Heavy.pm
92 339 2813 /usr/local/lib/perl5/5.6.0/File/Spec.pm
442 1574 10276 /usr/local/lib/perl5/5.6.0/File/Spec/Unix.pm
115 398 2806 /usr/local/lib/perl5/5.6.0/File/stat.pm
406 1350 10265 /usr/local/lib/perl5/5.6.0/IO/Socket/INET.pm
143 429 3075 /usr/local/lib/perl5/5.6.0/IO/Socket/UNIX.pm
7168 24137 178650 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/Config.pm
230 1052 5995 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/Errno.pm
222 725 5216 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/Fcntl.pm
47 101 669 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO.pm
239 769 5005 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/Dir.pm
169 549 3956 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/File.pm
594 2180 14772 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/Handle.pm
252 755 5375 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/Pipe.pm
77 235 1709 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/Seekable.pm
428 1419 10219 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/IO/Socket.pm
452 1401 10554 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/Socket.pm
127 473 3554 /usr/local/lib/perl5/5.6.0/OpenBSD.i386-openbsd/XSLoader.pm
52 161 1050 /usr/local/lib/perl5/5.6.0/SelectSaver.pm
139 541 3754 /usr/local/lib/perl5/5.6.0/Symbol.pm
161 609 4081 /usr/local/lib/perl5/5.6.0/Tie/Hash.pm
109 390 2479 /usr/local/lib/perl5/5.6.0/strict.pm
79 370 2589 /usr/local/lib/perl5/5.6.0/vars.pm
318 1124 11975 /usr/local/lib/perl5/5.6.0/warnings.pm
30 85 722 /usr/local/lib/perl5/5.6.0/warnings/register.pm
13733 48195 349869 total
Incredible. But it's half the size on linux:
% wc `perl -MIO -e 'print join("\n", sort values %INC, "")'`
[similar lines snipped]
6618 25068 176740 total
Moreover, that requires 116 happy trips through the kernel's namei(). It syscalls open() a remarkable 57 times, 17 of which failed but leaving 38 that were successful. It also syscalled read() a curiously identical 57 times, ingesting a total of 180,265 plump bytes. To top it off, this increases your resident set size by two megabytes! (1.5Mb on linux).
Happy mallocking...
It seems that CGI.pm
suffers from the same disease:
% wc `perl -MCGI -le 'print for values %INC'`
1368 6920 43710 /usr/local/lib/perl5/5.6.0/overload.pm
6481 26122 200840 /usr/local/lib/perl5/5.6.0/CGI.pm
7849 33042 244550 total
You have 16 trips through namei, 7 successful opens, 2 unsuccessful ones, and 213k of data read in.
The following numbers show memory sizes (virtual and resident) for v5.6.0 of Perl on four different operating systems, The three calls each are without any modules, with just -MCGI, and with -MIO (never with both):
OpenBSD FreeBSD Redhat Solaris
vsz rss vsz rss vsz rss vsz rss
Raw Perl 736 772 832 1208 2412 980 2928 2272
w/ CGI 1220 1464 1308 1828 2972 1768 3616 3232
w/ IO 2292 2580 2456 3016 4080 2868 5384 4976
Anybody who's thinking of choosing one of these might do well to stare at those numbers for a while.
Apache::args versus Apache::Request::params
Let's write two registry scripts that use Apache::args
and Apache::Request::params
to process the form's input and print it out. Notice that Apache::args
is considered identical to Apache::Request::params
only when you have a single valued keys, in case of multivalued keys (e.g. when using checkbox groups) you will have to write some more code, since if you do a simple:
%params = $r->args;
only the last value will be stored and the rest will collapse, something that you will solve with Apache::Request::params
as:
@values = $q->params('key');
In addition Apache::Request
has many more functions that ease input processing, like handling file uploads.
Therefore assuming that the only functionality that you need is the parsing of the key-value pairs, and assuming that every key has a single value, we will compare a slightly modified script from the previous section (apache_request.pl) and write a new one that uses args()
:
benchmarks/apache_request.pl
----------------------------
use strict;
use Apache::Request ();
my $r = Apache->request;
my $q = Apache::Request->new($r);
$r->send_http_header('text/plain');
print join "\n", $q->param;
benchmarks/apache_args.pl
-------------------------
use strict;
my $r = Apache->request;
$r->send_http_header('text/plain');
print join "\n", $r->args;
Now let's benchmark the two:
% ab -n 1000 -c 10 \
'http://localhost/perl/benchmarks/apache_request.pl?a=b&c=k&d=asf&as=1'
Time taken for tests: 16.961 seconds
Requests per second: 58.96
Connnection Times (ms)
min avg max
Connect: 0 0 20
Processing: 150 168 343
Total: 150 168 363
% ab -n 1000 -c 10 \
'http://localhost/perl/benchmarks/apache_args.pl?a=b&c=k&d=asf&as=1'
Time taken for tests: 17.154 seconds
Requests per second: 58.30
Connnection Times (ms)
min avg max
Connect: 0 2 136
Processing: 68 168 202
Total: 68 170 338
Apparently the two run at the same speed.
Using $|=1 Under mod_perl and Better print() Techniques.
As you know, local $|=1;
disables the buffering of the currently selected file handle (default is STDOUT
). If you enable it, ap_rflush()
is called after each print()
, unbuffering Apache's IO.
If you are using multiple print()
calls (_bad_ style in generating output) or if you just have too many of them, then you will experience a degradation in performance. The severity depends on the number of print() calls that you make.
Many old CGI scripts were written like this:
print "<BODY BGCOLOR=\"black\" TEXT=\"white\">";
print "<H1>";
print "Hello";
print "</H1>";
print "<A HREF=\"foo.html\"> foo </A>";
print "</BODY>";
This example has multiple print()
calls, which will cause performance degradation with $|=1
. It also uses too many backslashes. This makes the code less readable, and it is also more difficult to format the HTML so that it is easily readable as the script's output. The code below solves the problems:
print qq{
<BODY BGCOLOR="black" TEXT="white">
<H1>
Hello
</H1>
<A HREF="foo.html"> foo </A>
</BODY>
};
I guess you see the difference. Be careful though, when printing a <HTML>
tag. The correct way is:
print qq{<HTML>
<HEAD></HEAD>
<BODY>
}
If you try the following:
print qq{
<HTML>
<HEAD></HEAD>
<BODY>
}
Some older browsers expect the first characters after the headers and empty line to be <HTML>
with no spaces before the opening left angle-bracket. If there are any other characters, they might not accept the output as HTML and print it as a plain text. Even if it works with your browser, it might not work for others.
One other approach is to use `here' documents, e.g.:
print <<EOT;
<HTML>
<HEAD></HEAD>
<BODY>
EOT
Now let's go back to the $|=1
topic. I still disable buffering, for two reasons:
I use relatively few
print()
calls. I achieve this by arranging for myprint()
statements to print multiline HTML, and not one line perprint()
statement.I want my users to see the output immediately. So if I am about to produce the results of a DB query which might take some time to complete, I want users to get some text while they are waiting. This improves the usability of my site. Ask yourself which you like better: getting the output a bit slower, but steadily from the moment you've pressed the Submit button, or having to watch the "falling stars" for a while and then get the whole output at once, even if it's a few milliseconds faster - assuming the browser didn't time out during the wait.
An even better solution is to keep buffering enabled, and use a Perl API rflush()
call to flush the buffers when needed. This way you can place the first part of the page that you are going to send to the user in the buffer, and flush it a moment before you are going to do some lenghty operation, like a DB query. So you kill two birds with one stone: you show some of the data to the user immediately, so she will feel that something is actually happening, and you have no performance hit from disabled buffering.
use CGI ();
my $r = shift;
my $q = new CGI;
print $q->header('text/html');
print $q->start_html;
print $q->p("Searching...Please wait");
$r->rflush;
# imitate a lenghty operation
for (1..5) {
sleep 1;
}
print $q->p("Done!");
Conclusion: Do not blindly follow suggestions, but think what is best for you in each case.
Performance Oriented Perl Coding
One of the components of the mod_perl server's performance is the Perl code that you use. If you write code that runs slowly, the overall performance is slower. This section is intended to give you some hints to make your code, whose main purpose is to generate webpages, run faster. Bear in mind that the performance considerations might be totally different when you use Perl for other tasks.
Global vs Fully Qualified Variables
It's always a good idea to avoid global variables where possible. Some variables must be either global, such as a module's @ISA
or $VERSION
variables or else fully qualified such as @MyModule::ISA), so that Perl can see them.
A combination of strict
and vars
pragmas keeps modules clean and reduces a bit of noise. However, the vars
pragma also creates aliases, as does Exporter
, which eat up more memory. When possible, try to use fully qualified names instead of use vars
.
For example write:
package MyPackage;
use strict;
@MyPackage::ISA = qw(...);
$MyPackage::VERSION = "1.00";
instead of:
package MyPackage;
use strict;
use vars qw(@ISA $VERSION);
@ISA = qw(...);
$VERSION = "1.00";
Also see Using Global Variables and Sharing Them Between Modules/Packages.
Avoid Importing Functions
When possible, avoid importing a module's functions into your name space. The aliases which are created can take up quite a bit of memory. Try to use function interfaces and fully qualified names like Package::function
or $Package::variable
instead. For benchmarks see Object Methods Calls Versus Function Calls.
Note: method interfaces are a little bit slower than function calls. You can use the Benchmark
module to profile your specific code.
Object Methods Calls Versus Function Calls
Which subroutine calling form is more efficient: OOP methods or functions?
The Overhead with Light Subroutines
Let's do some benchmarking. We will start doing it using empty methods, which will allow us to measure the real difference in the overhead each kind of call introduces. We will use this code:
bench_call1.pl
--------------
package Foo;
use strict;
use Benchmark;
sub bar { };
timethese(50_000, {
method => sub { Foo->bar() },
function => sub { Foo::bar('Foo');},
});
The two calls are equivalent, since both pass the class name as their first parameter; function does this explicitly, while method does this transparently.
The benchmarking result:
Benchmark: timing 50000 iterations of function, method...
function: 0 wallclock secs ( 0.80 usr + 0.05 sys = 0.85 CPU)
method: 1 wallclock secs ( 1.51 usr + 0.08 sys = 1.59 CPU)
We are are interested in the 'total CPU times' and not the 'wallclock seconds'. It's possible that the load on the system was different for the two tests while benchmarking, so the wallclock times give us no useful information.
We see that the method calling type is almost twice as slow as the function call, 0.85 CPU compared to 1.59 CPU real execution time. Why does this happen? Because the difference between functions and methods is the time taken to resolve the pointer from the object, to find the module it belongs to and then the actual method. The function form has one parameter less to pass, less stack operations, less time to get to the guts of the subroutine.
perl5.6+ does better method caching, Foo->method()
is a little bit faster (some constant folding magic), but not Foo->$method()
. And the improvement does not address the @ISA
lookup that still happens in either case.
The Overhead with Heavy Subroutines
But that doesn't mean that you shouldn't use methods. Generally your functions do something, and the more they do the less significant is the time to perform the call, because the calling time is effectively fixed and is probably a very small overhead in comparison to the execution time of the method or function itself. Therefore the longer execution time of the function the smaller the relative overhead of the method call. The next benchmark proves this point:
bench_call2.pl
--------------
package Foo;
use strict;
use Benchmark;
sub bar {
my $class = shift;
my ($x,$y) = (100,100);
$y = log ($x ** 10) for (0..20);
};
timethese(50_000, {
method => sub { Foo->bar() },
function => sub { Foo::bar('Foo');},
});
We get a very close benchmarks!
function: 33 wallclock secs (15.81 usr + 1.12 sys = 16.93 CPU)
method: 32 wallclock secs (18.02 usr + 1.34 sys = 19.36 CPU)
Let's make the subroutine bar even slower:
sub bar {
my $class = shift;
my ($x,$y) = (100,100);
$y = log ($x ** 10) for (0..40);
};
And the result is amazing, the method call convention was faster than function:
function: 81 wallclock secs (25.63 usr + 1.84 sys = 27.47 CPU)
method: 61 wallclock secs (19.69 usr + 1.49 sys = 21.18 CPU)
In case your functions do very little, like the functions that generate HTML tags in CGI.pm
, the overhead might become a significant one. If your goal is speed you might consider using the function form, but if you write a big and complicated application, it's much better to use the method form, as it will make your code easier to develop, maintain and debug, saving programmer time which, over the life of a project may turn out to be the most significant cost factor.
Are All Methods Slower than Functions?
Some modules' API is misleading, for example CGI.pm
allows you to execute its subroutines as functions or as methods. As you will see in a moment its function form of the calls is slower than the method form because it does some voodoo work when the function form call is used.
use CGI;
my $q = new CGI;
$q->param('x',5);
my $x = $q->param('x');
versus
use CGI qw(:standard);
param('x',5);
my $x = param('x');
As usual, let's benchmark some very light calls and compare. Ideally we would expect the methods to be slower than functions based on the previous benchmarks:
bench_call3.pl
---------------
use Benchmark;
use CGI qw(:standard);
$CGI::NO_DEBUG = 1;
my $q = new CGI;
my $x;
timethese
(20000, {
method => sub {$q->param('x',5); $x = $q->param('x'); },
function => sub { param('x',5); $x = param('x'); },
});
The benchmark is written is such a way that all the initializations are done at the beginning, so that we get as accurate performance figures as possible. Let's do it:
% ./bench_call3.pl
function: 51 wallclock secs (28.16 usr + 2.58 sys = 30.74 CPU)
method: 39 wallclock secs (21.88 usr + 1.74 sys = 23.62 CPU)
As we can see methods are faster than functions, which seems to be wrong. The explanation lays in the way CGI.pm
is implemented. CGI.pm
uses some fancy tricks to make the same routine act both as a method and a plain function. The overhead of checking whether the arguments list looks like a method invocation or not, will mask the slight difference in time for the way the function was called.
If you are intrigued and want to investigate further by yourself the subroutine you want to explore is called self_or_default. The first line of this function short-circuits if you are using the object methods, but the whole function is called if you are using the functional forms. Therefore, the functional form should be slightly slower than the object form.
Imported Symbols and Memory Usage
There is a real memory hit when you import all of the functions into your process' memory. This can significantly enlarge memory requirements, particularly when there are many child processes.
In addition to polluting the namespace, when a process imports symbols from any module or any script it grows by the size of the space allocated for those symbols. The more you import (e.g. qw(:standard) vs qw(:all)) the more memory will be used. Let's say the overhead is of size X. Now take the number of scripts in which you deploy the function method interface, let's call that Y. Finally let's say that you have a number of processes equal to Z.
You will need X*Y*Z size of additional memory, taking X=10k, Y=10, Z=30, we get 10k*10*30 = 3Mb!!! Now you understand the difference.
Let's benchmark CGI.pm
using GTop.pm
. First we will try it with no exporting at all.
use GTop ();
use CGI ();
print GTop->new->proc_mem($$)->size;
1,949,696
Now exporting a few dozens symbols:
use GTop ();
use CGI qw(:standard);
print GTop->new->proc_mem($$)->size;
1,966,080
And finally exporting all the symbols (about 130)
use GTop ();
use CGI qw(:all);
print GTop->new->proc_mem($$)->size;
1,970,176
Results:
import symbols size(bytes) delta(bytes) relative to ()
--------------------------------------
() 1949696 0
qw(:standard) 1966080 16384
qw(:all) 1970176 20480
So in my example above X=20k => 20K*10*30 = 6Mb. You will need 6Mb more when importing all the CGI.pm
's symbols than when you import none at all.
Generally you use more than one script, run more than one process and probably import more symbols from the additional modules that you deploy. So the real numbers are much bigger.
The function method is faster in the general case, because of the time overhead to resolve the pointer from the object.
If you are looking for performance improvements, you will have to face the fact that having to type My::Module::my_method
might save you a good chunk of memory if the above call must not be called with a reference to an object, but even then it can be passed by value.
I strongly endorse Apache::Request (libapreq) - Generic Apache Request Library. Its core is written in C, giving it a significant memory and performance benefit. It has all the functionality of CGI.pm
except the HTML generation functions.
Concatenation or List
When the strings are small, it's almost doesn't matter whether a concatination or a list is used:
use Benchmark;
open my $fh, '>', '/dev/null';
my($one, $two, $three, $four) = ('a'..'d');
timethese(500_000,
{
concat => sub {
print $fh "$one$two$three$four";
},
list => sub {
print $fh $one, $two, $three, $four;
},
});
Benchmark: timing 500000 iterations of concat, list...
concat: 8 wallclock secs ( 6.63 usr + 0.04 sys = 6.67 CPU)
list: 8 wallclock secs ( 6.49 usr + 0.01 sys = 6.50 CPU)
When the strings are big lists are faster:
use Benchmark;
open my $fh, '>', '/dev/null';
my($one, $two, $three, $four) = map { $_ x 1000 } ('a'..'d');
timethese(100_000,
{
concat => sub {
print $fh "$one$two$three$four";
},
list => sub {
print $fh $one, $two, $three, $four;
},
});
Benchmark: timing 100000 iterations of concat, list...
concat: 13 wallclock secs (11.88 usr + 0.51 sys = 12.39 CPU)
list: 11 wallclock secs (10.13 usr + 0.21 sys = 10.34 CPU)
A list is almost 17% faster than concatination.
Also when you use "string" you use interpolation (since ""
is an operator in Perl), which turns into concatination, which uses more memory and is slower than using a list. When you use 'string' there is no interpolation, therefore it's faster and you have to use a list if you need to pass more than one argument.
There will be exceptions, like "string\n" where you cannot use single quotes. But if you do 'string',"\n" readability gets hurt. And we want want our code to be readable and maintainable.
[ReaderMETA]: Please send more mod_perl relevant Perl performance hints
Cached stat() Calls by Perl
When you do a stat() (or its variations -M
-- last modification time, -A
-- last access time, -C
-- last inode-change time, etc), the information is cached. If you need to make an additional check for the same file, use the _
magic variable and save the overhead of an unnecessary stat() call. For example when testing for existence and read permissions you might use:
my $filename = "./test";
# three stat() calls
print "OK\n" if -e $filename and -r $filename;
my $mod_time = (-M $filename) * 24 * 60 * 60;
print "$filename was modified $mod_time seconds before startup\n";
or the more efficient:
my $filename = "./test";
# one stat() call
print "OK\n" if -e $filename and -r _;
my $mod_time = (-M _) * 24 * 60 * 60;
print "$filename was modified $mod_time seconds before startup\n";
Two stat() syscalls saved!
Apache::Registry and Derivatives Specific Notes
These are the sections that deal solely with Apache::Registry
and derived modules, like Apache::PerlRun
and Apache::RegistryBB
. No Perl handlers code is discussed here, so if you don't use these modules, feel free to skip this section.
Be carefull with symbolic links
As you know Apache::Registry
caches the scripts based on their URI. If you have the same script that can be reached by different URIs, which is possible if you have used symbolic links, you will get the same script cached twice!
For example:
% ln -s /home/httpd/perl/news/news.pl /home/httpd/perl/news.pl
Now the script can be reached through the both URIs /news/news.pl
and /news.pl
. It doesn't really matter until you advertise the two URIs, and users reach the same script from both of them.
To detect this, use the /perl-status handler to see all the compiled scripts and their packages. In our example, when requesting: http://localhost/perl-status?rgysubs you would see:
Apache::ROOT::perl::news::news_2epl
Apache::ROOT::perl::news_2epl
after both the URIs have been requested from the same child process that happened to serve your request. To make the debugging easier see run the server in single mode.
Improving Performance by Prevention
There are two ways to improve performance: one is by tuning to squeeze the most out of your hardware and software; and the other is preventing certain bad things from happening, e.g. memory leaks, unshared memory, Denial of Service (DoS) attacks, etc.
Memory leakage
Scripts under mod_perl can very easily leak memory! Global variables stay around indefinitely, lexically scoped variables (declared with my()
) are destroyed when they go out of scope, provided there are no references to them from outside that scope.
Perl doesn't return the memory it acquired from the kernel. It does reuse it though!
Reading In A Whole File
open IN, $file or die $!;
local $/ = undef; # will read the whole file in
$content = <IN>;
close IN;
If your file is 5Mb, the child which served that script will grow by exactly that size. Now if you have 20 children, and all of them will serve this CGI, they will consume 20*5M = 100M of RAM in total! If that's the case, try to use other approaches to processing the file, if possible. Try to process a line at a time and print it back to the file. If you need to modify the file itself, use a temporary file. When finished, overwrite the source file. Make sure you use a locking mechanism!
Copying Variables Between Functions
Now let's talk about passing variables by value. Let's use the example above, assuming we have no choice but to read the whole file before any data processing takes place. Now you have some imaginary process()
subroutine that processes the data and returns it. What happens if you pass the $content
by value? You have just copied another 5M and the child has grown in size by another 5M. Watch your swap space! Now multiply it again by factor of 20 you have 200M of wasted RAM, which will apparently be reused, but it's a waste! Whenever you think the variable can grow bigger than a few Kb, pass it by reference!
Once I wrote a script that passed the contents of a little flat file database to a function that processed it by value -- it worked and it was fast, but after a time the database became bigger, so passing it by value was expensive. I had to make the decision whether to buy more memory or to rewrite the code. It's obvious that adding more memory will be merely a temporary solution. So it's better to plan ahead and pass variables by reference, if a variable you are going to pass might eventually become bigger than you envisage at the time you code the program. There are a few approaches you can use to pass and use variables passed by reference. For example:
my $content = qq{foobarfoobar};
process(\$content);
sub process{
my $r_var = shift;
$$r_var =~ s/foo/bar/gs;
# nothing returned - the variable $content outside has already
# been modified
}
If you work with arrays or hashes it's:
@{$var_lr} dereferences an array
%{$var_hr} dereferences a hash
We can still access individual elements of arrays and hashes that we have a reference to without dereferencing them:
$var_lr->[$index] get $index'th element of an array via a ref
$var_hr->{$key} get $key'th element of a hash via a ref
For more information see perldoc perlref
.
Another approach would be to use the @_
array directly. This has the effect of passing by reference:
process($content);
sub process{
$_[0] =~ s/foo/bar/gs;
# nothing returned - the variable $content outside has been
# already modified
}
From perldoc perlsub
:
The array @_ is a local array, but its elements are aliases for
the actual scalar parameters. In particular, if an element
$_[0] is updated, the corresponding argument is updated (or an
error occurs if it is not possible to update)...
Be careful when you write this kind of subroutine, since it can confuse a potential user. It's not obvious that call like process($content);
modifies the passed variable. Programmers (the users of your library in this case) are used to subroutines that either modify variables passed by reference or expressly return a result (e.g. $content=process($content);
).
Work With Databases
If you do some DB processing, you will often encounter the need to read lots of records into your program, and then print them to the browser after they are formatted. I won't even mention the horrible case where programmers read in the whole DB and then use Perl to process it!!! Use a relational DB and let the SQL do the job, so you get only the records you need!
We will use DBI
for this (assume that we are already connected to the DB--refer to perldoc DBI
for a complete reference to the DBI
module):
$sth->execute;
while(@row_ary = $sth->fetchrow_array;) {
# do DB accumulation into some variable
}
# print the output using the the data returned from the DB
In the example above the httpd_process will grow by the size of the variables that have been allocated for the records that matched the query. Again remember to multiply it by the number of the children your server runs!
A better approach is not to accumulate the records, but rather to print them as they are fetched from the DB. Moreover, we will use the bind_col()
and $sth->fetchrow_arrayref()
(aliased to $sth->fetch()
) methods, to fetch the data in the fastest possible way. The example below prints an HTML table with matched data, the only memory that is being used is a @cols
array to hold temporary row values:
my @select_fields = qw(a b c);
# create a list of cols values
my @cols = ();
@cols[0..$#select_fields] = ();
$sth = $dbh->prepare($do_sql);
$sth->execute;
# Bind perl variables to columns.
$sth->bind_columns(undef,\(@cols));
print "<TABLE>";
while($sth->fetch) {
print "<TR>",
map("<TD>$_</TD>", @cols),
"</TR>";
}
print "</TABLE>";
Note: the above method doesn't allow you to know how many records have been matched. The workaround is to run an identical query before the code above where you use SELECT count(*) ...
instead of 'SELECT * ...
, to get the number of matched records. It should be much faster, since you can remove any SORTBY and similar attributes.
For those who think that $sth->rows will do the job, here is the quote from the DBI
manpage:
rows();
$rv = $sth->rows;
Returns the number of rows affected by the last database altering
command, or -1 if not known or not available. Generally you can
only rely on a row count after a do or non-select execute (for some
specific operations like update and delete) or after fetching all
the rows of a select statement.
For select statements it is generally not possible to know how many
rows will be returned except by fetching them all. Some drivers
will return the number of rows the application has fetched so far
but others may return -1 until all rows have been fetched. So use of
the rows method with select statements is not recommended.
As a bonus, I wanted to write a single sub that flexibly processes any query. It would accept conditions, a call-back closure sub, select fields and restrictions.
# Usage:
# $o->dump(\%conditions,\&callback_closure,\@select_fields,@restrictions);
#
sub dump{
my $self = shift;
my %param = %{+shift}; # dereference hash
my $rsub = shift;
my @select_fields = @{+shift}; # dereference list
my @restrict = shift || '';
# create a list of cols values
my @cols = ();
@cols[0..$#select_fields] = ();
my $do_sql = '';
my @where = ();
# make a @where list
map { push @where, "$_=\'$param{$_}\'" if $param{$_};} keys %param;
# prepare the sql statement
$do_sql = "SELECT ";
$do_sql .= join(" ", @restrict) if @restrict; # append restriction list
$do_sql .= " " .join(",", @select_fields) ; # append select list
$do_sql .= " FROM $DBConfig{TABLE} "; # from table
# we will not add the WHERE clause if @where is empty
$do_sql .= " WHERE " . join " AND ", @where if @where;
print "SQL: $do_sql \n" if $debug;
$dbh->{RaiseError} = 1; # do this, or check every call for errors
$sth = $dbh->prepare($do_sql);
$sth->execute;
# Bind perl variables to columns.
$sth->bind_columns(undef,\(@cols));
while($sth->fetch) {
&$rsub(@cols);
}
# print the tail or "no records found" message
# according to the previous calls
&$rsub();
} # end of sub dump
Now a callback closure sub can do lots of things. We need a closure to know what stage are we in: header, body or tail. For example, we want a callback closure for formatting the rows to print:
my $rsub = eval {
# make a copy of @fields list, since it might go
# out of scope when this closure is called
my @fields = @fields;
my @query_fields = qw(user dir tool act); # no date field!!!
my $header = 0;
my $tail = 0;
my $counter = 0;
my %cols = (); # columns name=> value hash
# Closure with the following behavior:
# 1. Header's code will be executed on the first call only and
# if @_ was set
# 2. Row's printing code will be executed on every call with @_ set
# 3. Tail's code will be executed only if Header's code was
# printed and @_ isn't set
# 4. "No record found" code will be executed if Header's code
# wasn't executed
sub {
# Header
if (@_ and !$header){
print "<TABLE>\n";
print $q->Tr(map{ $q->td($_) } @fields );
$header = 1;
}
# Body
if (@_) {
print $q->Tr(map{$q->td($_)} @_ );
$counter++;
return;
}
# Tail, will be printed only at the end
if ($header and !($tail or @_)){
print "</TABLE>\n $counter records found";
$tail = 1;
return;
}
# No record found
unless ($header){
print $q->p($q->center($q->b("No record was found!\n")));
}
} # end of sub {}
}; # end of my $rsub = eval {
You might also want to check the section Limiting the Size of the Processes and Limiting the Resources Used by httpd children.
Limiting the Size of the Processes
Apache::SizeLimit
allows you to kill off Apache httpd processes if they grow too large.
Configuration:
In your startup.pl:
use Apache::SizeLimit;
$Apache::SizeLimit::MAX_PROCESS_SIZE = 10000;
# in KB, so this is 10MB
In your httpd.conf:
PerlFixupHandler Apache::SizeLimit
See perldoc Apache::SizeLimit
for more details.
By using this module, you should be able to avoid using the Apache configuration directive MaxRequestsPerChild
, although for some folks, using both in combination does the job.
Keeping the Shared Memory Limit
Apache::GTopLimit
module allows you to kill off Apache httpd processes if they grow too large (just like Apache::SizeLimit
) or have too little shared memory left. See Apache::GTopLimit.
Limiting the Resources Used by httpd Children
Apache::Resource
uses the BSD::Resource
module, which in turn uses the C function setrlimit()
to set limits on system resources such as memory and cpu usage.
To configure:
PerlModule Apache::Resource
# set child memory limit in megabytes
# (default is 64 Meg)
PerlSetEnv PERL_RLIMIT_DATA 32:48
# set child CPU limit in seconds
# (default is 360 seconds)
PerlSetEnv PERL_RLIMIT_CPU 120
PerlChildInitHandler Apache::Resource
If you configure Apache::Status
, it will let you review the resources set in this way.
The following limit values are in megabytes: DATA
, RSS
, STACK
, FSIZE
, CORE
, MEMLOCK
; all others are treated as their natural unit. Prepend PERL_RLIMIT_
for each one you want to use. Refer to the setrlimit
man page on your OS for other possible resources.
A resource limit is specified as a soft limit and a hard limit. When a soft limit is exceeded a process may receive a signal (for example, if the CPU time or file size is exceeded), but it will be allowed to continue execution until it reaches the hard limit (or modifies its resource limit). The rlimit structure is used to specify the hard and soft limits on a resource. (See the manpage for setrlimit for your OS specific information.)
If the value of the variable is of the form S:H
, S
is treated as the soft limit, and H
is the hard limit. If it is just a single number, it is used for both soft and hard limits.
OS Specific notes
Note that under Linux malloc() uses mmap() instead of brk(). This is done to conserve virtual memory - that is, when you malloc a large block of memory, it isn't actually given to your program until you initialize it. The old-style brk() syscall obeyed resource limits on data segment size as set in setrlimit() - mmap() doesn't.
Apache::Resource
's defaults put caps on data size and stack size. Linux's current memory allocation scheme doesn't honor these limits, so if you just do
PerlSetEnv PERL_RLIMIT_DEFAULTS On
PerlModule Apache::Resource
PerlChildInitHandler Apache::Resource
Your Apache processes are still free to use as much memory as they like.
However, BSD::Resource
also has a limit called RLIMIT_AS
(Address Space) which limits the total number of bytes of virtual memory assigned to a process. Happily, Linux's memory manager does honor this limit.
Therefore, you can limit memory usage under Linux with Apache::Resource
-- simply add a line to httpd.conf:
PerlSetEnv PERL_RLIMIT_AS 67108864
This example sets a hard and soft limit of 64Mb of total address space.
Debug
To debug add:
<Perl>
$Apache::Resource::Debug = 1;
require Apache::Resource;
</Perl>
PerlChildInitHandler Apache::Resource
and look in the error_log to see what it's doing.
Refer to perldoc Apache::Resource
and man 2 setrlimit
for more info.
Limiting the Number of Processes Serving the Same Resource
If you want to limit number of Apache children that could simultaneously be serving the (nearly) same resource, you should take a look at the mod_throttle_access
module.
It solves the problem of too many concurrent request accessing the same URI, if for example it's very CPU intensive. For example you have three base URIs:
/perl/news/
/perl/webmail/
/perl/morphing/
The first two URIs are response critical as people want to read news and their email. The third URI is very CPU and RAM intensive image morphing service, provided as a bonus to your users. Since you don't want users to abuse this service, you better set some limits on the number of concurrent requests for this resource, since if you don't--the other two critical resources can be hurt.
The following setting:
<Location "/perl/morphing">
<Limit PUT GET>
MaxConcurrentReqs 10
</Limit>
</Location>
will allow only 10 concurrent requests under the URI /perl/morphing and of methods PUT and GET to be processed at one time.
Limiting the Request Rate Speed (Robot Blocking)
A limitation of using pattern matching to identify robots is that it only catches the robots that you know about, and then only those that identify themselves by name. A few devious robots masquerade as users by using user agent strings that identify themselves as conventional browsers. To catch such robots, you'll have to be more sophisticated.
Apache::SpeedLimit
comes to your aid, see:
http://www.modperl.com/chapters/ch6.html#Blocking_Greedy_Clients
Perl Modules for Performance Improvement
These sections are about Perl modules that improve performance without requiring changes to your code. Mostly you just need to tweak the configuration file to plug these modules in.
Sending Plain HTML as Compressed Output
See Apache::GzipChain - compress HTML (or anything) in the OutputChain
Caching Components with HTML::Mason
META: complete the full description
HTML::Mason
is a system that makes use of components to build HTML pages.
If most of your output is generated dynamically, but each finished page can be separated into different components, HTML::Mason
can cache those components. This can really improve the performance of your service and reduce the load on the system.
Say for example that you have a page consisting of five components, each generated by a different SQL query, but for four of the five components it's the same four queries for each user so you don't have to rerun them again and again. Only one component is generated by a unique query and will not use the cache.
META: HTML::Mason docs (v 8.0) said Mason was 2-3 times slower than pure mod_perl, implying that the power & convenience made up for this.
META: Should also mention Embperl (especially since its C + XS)
Efficient Work with Databases under mod_perl
Most of the mod_perl enabled servers work with database engines, so in this section we will learn about two things: how mod_perl makes working with databases faster and a few tips for a more efficient DBI coding in Perl. (DBI provides an identical Perl interface to many database implementations.)
Persistent DB Connections
Another popular use of mod_perl is to take advantage of its ability to maintain persistent open database connections. The basic approach is as follows:
# Apache::Registry script
-------------------------
use strict;
use vars qw($dbh);
$dbh ||= SomeDbPackage->connect(...);
Since $dbh
is a global variable for the child, once the child has opened the connection it will use it over and over again, unless you perform disconnect()
.
Be careful to use different names for handlers if you open connections to different databases!
Apache::DBI
allows you to make a persistent database connection. With this module enabled, every connect()
request to the plain DBI
module will be forwarded to the Apache::DBI
module. This looks to see whether a database handle from a previous connect()
request has already been opened, and if this handle is still valid using the ping method. If these two conditions are fulfilled it just returns the database handle. If there is no appropriate database handle or if the ping method fails, a new connection is established and the handle is stored for later re-use. There is no need to delete the disconnect()
statements from your code. They will not do anything, the Apache::DBI
module overloads the disconnect()
method with a NOP. When a child exits there is no explicit disconnect, the child dies and so does the database connection. You may leave the use DBI;
statement inside the scripts as well.
The usage is simple -- add to httpd.conf
:
PerlModule Apache::DBI
It is important to load this module before any other DBI
, DBD::*
and ApacheDBI*
modules!
db.pl
------------
use DBI;
use strict;
my $dbh = DBI->connect( 'DBI:mysql:database', 'user', 'password',
{ autocommit => 0 }
) || die $DBI::errstr;
...rest of the program
Preopening Connections at the Child Process' Fork Time
If you use DBI
for DB connections, and you use Apache::DBI
to make them persistent, it also allows you to preopen connections to the DB for each child with the connect_on_init()
method, thus saving a connection overhead on the very first request of every child.
use Apache::DBI ();
Apache::DBI->connect_on_init("DBI:mysql:test",
"login",
"passwd",
{
RaiseError => 1,
PrintError => 0,
AutoCommit => 1,
}
);
This is a simple way to have Apache children establish connections on server startup. This call should be in a startup file require()d
by PerlRequire
or inside a <Perl> section. It will establish a connection when a child is started in that child process. See the Apache::DBI
manpage for the requirements for this method.
Caching prepare() Statements
You can also benefit from persistent connections by replacing prepare() with prepare_cached(). That way you will always be sure that you have a good statement handle and you will get some caching benefit. The downside is that you are going to pay for DBI to parse your SQL and do a cache lookup every time you call prepare_cached().
Be warned that some databases (e.g PostgreSQL and Sybase) don't support caches of prepared plans. With Sybase you could open multiple connections to achieve the same result, although this is at the risk of getting deadlocks depending on what you are trying to do!
Handling Timeouts
META: this is duplicated in the databases.pod -- should be resolved!
Some databases disconnect the client after a certain period of inactivity. This problem is known as the morning bug. The ping()
method ensures that this will not happen. Some DBD
drivers don't have this method, check the Apache::DBI
manpage to see how to write a ping()
method.
Another approach is to change the client's connection timeout. For mysql users, starting from mysql-3.22.x you can set a wait_timeout
option at mysqld server startup to change the default value. Setting it to 36 hours will fix the timeout problem in most cases.
mod_perl Database Performance Improving
Analysis of the Problem
A common web application architecture is one or more application servers which handle requests from client browsers by consulting one or more database servers and performing a transform on the data. When an application must consult the database on every request, the interaction with the database server becomes the central performance issue. Spending a bit of time optimizing your database access can result in significant application performance improvements. In this analysis, a system using Apache, mod_perl, DBI
, and Oracle will be considered. The application server uses Apache and mod_perl to service client requests, and DBI
to communicate with a remote Oracle database.
In the course of servicing a typical client request, the application server must retrieve some data from the database and execute a stored procedure. There are several steps that need to be performed to complete the request:
1: Connect to the database server
2: Prepare a SQL SELECT statement
3: Execute the SELECT statement
4: Retrieve the results of the SELECT statement
5: Release the SELECT statement handle
6: Prepare a PL/SQL stored procedure call
7: Execute the stored procedure
8: Release the stored procedure statement handle
9: Commit or rollback
10: Disconnect from the database server
In this document, an application will be described which achieves maximum performance by eliminating some of the steps above and optimizing others.
Optimizing Database Connections
A naive implementation would perform steps 1 through 10 from above on every request. A portion of the source code might look like this:
# ...
my $dbh = DBI->connect('dbi:Oracle:host', 'user', 'pass')
|| die $DBI::errstr;
my $baz = $r->param('baz');
eval {
my $sth = $dbh->prepare(qq{
SELECT foo
FROM bar
WHERE baz = $baz
});
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
$sth->finish;
my $sph = $dbh->prepare(qq{
BEGIN
my_procedure(
arg_in => $baz
);
END;
});
$sph->execute;
$sph->finish;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
$dbh->disconnect;
# ...
In practice, such an implementation would have hideous performance problems. The majority of the execution time of this program would likely be spent connecting to the database. An examination shows that step 1 is comprised of many smaller steps:
1: Connect to the database server
1a: Build client-side data structures for an Oracle connection
1b: Look up the server's alias in a file
1c: Look up the server's hostname
1d: Build a socket to the server
1e: Build server-side data structures for this connection
The naive implementation waits for all of these steps to happen, and then throws away the database connection when it is done! This is obviously wasteful, and easily rectified. The best solution is to hoist the database connection step out of the per-request lifecycle so that more than one request can use the same database connection. This can be done by connecting to the database server once, and then not disconnecting until the Apache child process exits. The Apache::DBI
module does this transparently and automatically with little effort on the part of the programmer.
Apache::DBI
intercepts calls to DBI
's connect and disconnect methods and replaces them with its own. Apache::DBI
caches database connections when they are first opened, and it ignores disconnect commands. When an application tries to connect to the same database, Apache::DBI
returns a cached connection, thus saving the significant time penalty of repeatedly connecting to the database. You will find a full treatment of Apache::DBI
at Persistent DB Connections
When Apache::DBI
is in use, none of the code in the example needs to change. The code is upgraded from naive to respectable with the use of a simple module! The first and biggest database performance problem is quickly dispensed with.
Utilizing the Database Server's Cache
Most database servers, including Oracle, utilize a cache to improve the performance of recently seen queries. The cache is keyed on the SQL statement. If a statement is identical to a previously seen statement, the execution plan for the previous statement is reused. This can be a considerable improvement over building a new statement execution plan.
Our respectable implementation from the last section is not making use of this caching ability. It is preparing the statement:
SELECT foo FROM bar WHERE baz = $baz
The problem is that $baz
is being read from an HTML form, and is therefore likely to change on every request. When the database server sees this statement, it is going to look like:
SELECT foo FROM bar WHERE baz = 1
and on the next request, the SQL will be:
SELECT foo FROM bar WHERE baz = 42
Since the statements are different, the database server will not be able to reuse its execution plan, and will proceed to make another one. This defeats the purpose of the SQL statement cache.
The application server needs to make sure that SQL statements which are the same look the same. The way to achieve this is to use placeholders and bound parameters. The placeholder is a blank in the SQL statement, which tells the database server that the value will be filled in later. The bound parameter is the value which is inserted into the blank before the statement is executed.
With placeholders, the SQL statement looks like:
SELECT foo FROM bar WHERE baz = :baz
Regardless of whether baz
is 1 or 42, the SQL always looks the same, and the database server can reuse its cached execution plan for this statement. This technique has eliminated the execution plan generation penalty from the per-request runtime. The potential performance improvement from this optimization could range from modest to very significant.
Here is the updated code fragment which employs this optimization:
# ...
my $dbh = DBI->connect('dbi:Oracle:host', 'user', 'pass')
|| die $DBI::errstr;
my $baz = $r->param('baz');
eval {
my $sth = $dbh->prepare(qq{
SELECT foo
FROM bar
WHERE baz = :baz
});
$sth->bind_param(':baz', $baz);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
$sth->finish;
my $sph = $dbh->prepare(qq{
BEGIN
my_procedure(
arg_in => :baz
);
END;
});
$sph->bind_param(':baz', $baz);
$sph->execute;
$sph->finish;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
# ...
Eliminating SQL Statement Parsing
The example program has certainly come a long way and the performance is now probably much better than that of the first revision. However, there is still more speed that can be wrung out of this server architecture. The last bottleneck is in SQL statement parsing. Every time DBI
's prepare() method is called, DBI
parses the SQL command looking for placeholder strings, and does some housekeeping work. Worse, a context has to be built on the client and server sides of the connection which the database will use to refer to the statement. These things take time, and by eliminating these steps the time can be saved.
To get rid of the statement handle construction and statement parsing penalties, we could use DBI
's prepare_cached() method. This method compares the SQL statement to others that have already been executed. If there is a match, the cached statement handle is returned. But the application server is still spending time calling an object method (very expensive in Perl), and doing a hash lookup. Both of these steps are unnecessary, since the SQL is very likely to be static and known at compile time. The smart programmer can take advantage of these two attributes to gain better database performance. In this example, the database statements will be prepared immediately after the connection to the database is made, and they will be cached in package scalars to eliminate the method call.
What is needed is a routine that will connect to the database and prepare the statements. Since the statements are dependent upon the connection, the integrity of the connection needs to be checked before using the statements, and a reconnection should be attempted if needed. Since the routine presented here does everything that Apache::DBI
does, it does not use Apache::DBI
and therefore has the added benefit of eliminating a cache lookup on the connection.
Here is an example of such a package:
package My::DB;
use strict;
use DBI;
sub connect {
if (defined $My::DB::conn) {
eval {
$My::DB::conn->ping;
};
if (!$@) {
return $My::DB::conn;
}
}
$My::DB::conn = DBI->connect(
'dbi:Oracle:server', 'user', 'pass', {
PrintError => 1,
RaiseError => 1,
AutoCommit => 0
}
) || die $DBI::errstr; #Assume application handles this
$My::DB::select = $My::DB::conn->prepare(q{
SELECT foo
FROM bar
WHERE baz = :baz
});
$My::DB::procedure = $My::DB::conn->prepare(q{
BEGIN
my_procedure(
arg_in => :baz
);
END;
});
return $My::DB::conn;
}
1;
Now the example program needs to be modified to use this package.
# ...
my $dbh = My::DB->connect;
my $baz = $r->param('baz');
eval {
my $sth = $My::DB::select;
$sth->bind_param(':baz', $baz);
$sth->execute;
while (my @row = $sth->fetchrow_array) {
# do HTML stuff
}
my $sph = $My::DB::procedure;
$sph->bind_param(':baz', $baz);
$sph->execute;
$dbh->commit;
};
if ($@) {
$dbh->rollback;
}
# ...
Notice that several improvements have been made. Since the statement handles have a longer life than the request, there is no need for each request to prepare the statement, and no need to call the statement handle's finish method. Since Apache::DBI
and the prepare_cached() method are not used, no cache lookups are needed.
Conclusion
The number of steps needed to service the request in the example system has been reduced significantly. In addition, the hidden cost of building and tearing down statement handles and of creating query execution plans is removed. Compare the new sequence with the original:
1: Check connection to database
2: Bind parameter to SQL SELECT statement
3: Execute SELECT statement
4: Fetch rows
5: Bind parameters to PL/SQL stored procedure
6: Execute PL/SQL stored procedure
7: Commit or rollback
It is probably possible to optimize this example even further, but I have not tried. It is very likely that the time could be better spent improving your database indexing scheme or web server buffering and load balancing.
Using 3rd Party Applications
It's been said that no one can do everything well, but one can do something specific extremely well. This seems to be true for many software applications, when you don't try to do everything but instead concentrate on something specific you can do it really well.
Based on the above introduction, while the mod_perl server can do many many things, there are other applications (or Apache server modules) that can do some specific operations faster or do a really great job for the mod_perl server by unloading it when doing some operations by themselves.
Let's take a look at a few of these.
Proxying the mod_perl Server
Proxy gives you a great performance increase in most cases. It's discussed in the section Adding a Proxy Server in http Accelerator Mode.
Upload/Download of Big Files
You don't want to tie up your precious mod_perl backend server children doing something as long and dumb as transfering a file. The user won't really see any important performance benefits from mod_perl anyway, since the upload may take up to several minutes, and the overhead saved by mod_perl is typically under one second.
If some particular script's main functionality is the uploading or downloading of big files, you probably want it to be executed on a plain apache server under mod_cgi.
This of course assumes that the script requires none of the functionality of the mod_perl server, such as custom authentication handlers.
Perl Build Options
The perl interpreter lays in the brain of the mod_perl server and if we can optimize perl into doing things faster under mod_perl we make the whole server faster. Generally, optimizing the perl interpreter means enabling or disabling some command line options. Let's see a few important ones.
-DTWO_POT_OPTIMIZE and -DPACK_MALLOC Perl Build Options
Newer Perl versions also have build time options to reduce runtime memory consumption. These options might shrink the size of your httpd by about 150k -- quite a big number if you remember to multiply it by the number of chidren you use.
The -DTWO_POT_OPTIMIZE
macro improves allocations of data with size close to a power of two; but this works for big allocations (starting with 16K by default). Such allocations are typical for big hashes and special-purpose scripts, especially image processing.
Perl memory allocation is by bucket with sizes close to powers of two. Because of these the malloc() overhead may be big, especially for data of size exactly a power of two. If PACK_MALLOC
is defined, perl uses a slightly different algorithm for small allocations (up to 64 bytes long), which makes it possible to have overhead down to 1 byte for allocations which are powers of two (and appear quite often).
Expected memory savings (with 8-byte alignment in alignbytes
) is about 20% for typical Perl usage. Expected slowdown due to additional malloc() overhead is in fractions of a percent and hard to measure, because of the effect of saved memory on speed.
You will find these and other memory improvement details in perl5004delta.pod
.
Important: both options are On by default in perl versions 5.005 and higher.
-Dusemymalloc Perl Build Option
You have a choice to use the native or Perl's own malloc() implementation. The choice depends on your Operating System. Unless you know which of the two is better on yours, you better try both and compare the benchmarks.
To build without Perl's malloc(), you can use the Configure command:
% sh Configure -Uusemymalloc"
Note that:
-U == undefine usemymalloc (use system malloc)
-D == define usemymalloc (use Perl's malloc)
It seems that Linux still defaults to system malloc so you might want to configure Perl with -Dusemymalloc. Perl's malloc is not much of a win under linux, but makes a huge difference under Solaris.
2 POD Errors
The following errors were encountered while parsing the POD:
- Around line 3489:
alternative text 'Using Global Variables and Sharing Them Between Modules/Packages' contains non-escaped | or /
- Around line 3892:
alternative text '/perl-status' contains non-escaped | or /