NAME log_profile.pl - Analyze the log file of the NCSA or Apache Webserver. Version 6.0b Programmed by Kenneth J. Lanfear, U.S. Geological Survey DISCLAIMER Although this program has been used by the USGS, no warranty, expressed or implied, is made by the USGS or the United States Government as to the accuracy and functioning of the program and related program material, nor shall the fact of distribution constitute any such warranty, and no responsibility is assumed by the USG in connection therewith. SYNOPSIS log_profile.pl [-f rc_file] [-p status_count] [-s] [-t temp_name] [-u userfile] [-z robotlog] begintime endtime server_name out_file log_files where: -f rc_file = File with run-time control information [$rc_file] -g page_file = Save a file of pages hit. -p status_count = Output record count every 'status_count' input lines [$rpt] -r referfile = Save a file of referrers -s = Run silently -t temp_name = Pathname to temp file [$tf_name] -u userfile = Save a file of users -z robotlog = Save a file of robots that accessed the pages begintime = Time to start analysis, in the format DD/MMM/YYYY:HH:MM:SSZZZZZ e.g., 01/Oct/1999:00:00:00:-0500 ZZZZZ is the offset from GMT used by the server. Default is -0400, for U.S. Eastern Standard Time endtime = Time to end analysis server_name = Name of server to put into report out_file = Name of output file log_files = One or more names of log files for input. The log files may be gzipped. This documentation is for version 6.0 DESCRIPTION This program analyzes the log files of the NCSA Webserver, (version 1.3 and higher) or the Apache webserver and prepares a report on usage. The report is oriented towards understanding who is using the server and what they are looking at. The program offers these capabilities: -- Users. Differentiate between "in-house" and "outside" users. Classify users as government, educational, commercial, major network (America On Line, Compuserve, Prodigy), foreign, and other (unidentified IP address). Save a list of users, with their category identified. -- Visits. Determine "visits," where a visit is a sequence of hits by the same address, each separated by no more than 30 minutes (default). -- Pages. Include or exclude specified pages. Rank and list hits on pages. Rank and list pages that are the "first contact" page on a visit. -- Multiple logs. Analyze multiple logs, and, optionally, differentiate the pages in each log. -- Robots. Identify hits by robots (Users who access the robots.txt file plus 20 other pages.) and exclude them from analyais, analyze for robots only, and/or compile a separate report on robot usage. -- Referrers and agents. Identify, rank, and list referrers and agents. -- User habits. Determine "clicks per visit" and "images per click" for each class of user. The output file contains Hypertext Markup Language (HTML) tags for displaying the statistics on the World Wide Web. It is recommended that the server be configured so that the referer_log and agent_log files are combined with the access_log file. Otherwise, log_profile.pl can not provide information about referrers and agents. A companion program, extract_monthly_log.pl, can be run prior to log_profile.pl to extract entries for a particular month. OPTIONS e Read the entire log. If this not set, the program will assume the log records are in time order and will stop reading a log file (and go on to the next) when it encounters a record past the endtime. You should set this option if any log is not sorted by time. Note: some httpd programs do not record logs in exactly the same order as the time records. Usually the difference is only a few seconds. This could cause a few records at the end of a log to be ignored unless the e option is set. If the log already has been "clipped" to the desired time period, this option has no effect. f rc_file A run-time control file with information on additional options to be used when the program runs. See the RC_FILE section below for more information. g page_file Save a file of pages hit. Each line will contain: page number page name (or alias, if applicable) number of hits p status_count Output a progress report every 'status_count' records when processing the input logs. [1000] s Run silently. Don't output progress information. t temp_name Use 'temp_name' as the first part of the path to the temporary files. As log_profile.pl reads the logs, it may save information in temporary files, which, by default, are placed in the same directory from which log_profile.pl is running. Use 'temp_name' to place the temporary files in another directory. [log_profile.tmp.$$] u userfile Save a file of user names. Each line of the file will contain the following tab-delimited information IP_address classification number of days number of visits number of clicks number of pages number of hits on page 1 ... number of hits on page userfile_page where classification is assigned as gov, edu, etc. by log_profile.pl The value of userfile_page is as explained below in the RC_FILE instructions z robotlog Save a file of robots that accessed the pages. The file will report characteristics of each robot and will include a list of each robot's URL's called, in order. ARGUMENTS begintime Day to start analysis DDMMMYYYYZZZZ DD = day of month (01, 02, ... 31) MMM = Month (Jan Feb Mar ... Dec) YYYY = Year (e.g. 1997) ZZZZZ = the offset from GMT used by the server. If you don't know this, you can find it in any log entry. 01Oct1997-0400 will start analysis at 12 midnight Eastern Daylight Time October 1, 1997 endtime Day to end analysis. See above. 31Oct1997-0500 will end the analysis at 11:59:59PM Eastern Standard Time October 31, 1997 server_name Name of server to put into report. For more than one word, use underscores, which will be removed. The_Best_Server will be printed as "The Best Server" out_file Name of output file. ".html" will be appended. log_files One or more names of log files for input. The log files may be gzipped. Order is not important. RC_FILE USE AND FORMAT Additional run-time control information can be provided to log_profile.pl with an "rc" file. There is some overlap between optional items on the command line and values that can be set in the rc file. The logic that the program uses is: - If the "-f rc_file" command line argument is spcified, use that rc file. - If there is overlap of a particular option between the command line and the rc file, the value on the command line takes precedence. Lines in the rc file fall in to four catagories: 1. Lines that start with a "#" are comments and are ignored 2. Empty lines are ignored. 3. Some lines set a single value, such as "sort size = 1000000". 4. Some lines set a list of values. In this case, the value name and an equal sign appear on the first line, then there are multiple lines containing the values, and the list terminates with a line that has nothing but a period (.) in the first column. See the example below. The value keywords may be in either upper or lower case. Where it makes sense, such as in the include and exclude lists, it's legitmate for the list value to contain a regular expression. A domain is the last component of the address. A subdomain is all components except the last component and the host. In the address "ws01.wr.usgs.gov", "gov" is the domain and "wr.usgs" is the subdomain. Possible values in the rc file are: bgcolor = Background color for web page. Default is white, #ffffff Header = [a list] Additional HTML code that will be placed at the top of the results page. (Program provides the standard tags.) Use this to include your organization's banner or similar content. Terminate list with a line containing only a period. Footer = [a list] Additional HTML code that will be placed at the bottom of the results page. Terminate list with a line containing only a period. ok_results = [a list] List all results codes that will be accepted. You must list all acceptable codes. Terminate list with a line containing only a period. Default is to accept code >= 200 and < 400 report freq = n Same as -p command line option visit length = Visit length, in seconds. Any sequence of hits by the same IP, with each hit occurring less than visitlength seconds after the previous one, will be counted as one visit. Default is 1800 seconds. temp file = file_name Same as -t command line option user file = file_name Same as -u command line option userfile_pages = n Show the numbers of hits on the first n pages in the user_file. This option has no effect unless a user_file is defined as above or by the -u option. The default value for userfile_pages is the number or page aliases assigned. refer file = file name Same as the -r command line option The file will have [referrels\treferrer] robot file = file_name Same as the -z command line option sort size = n Change the default sorting size parameter. Default is 30000. Larger is faster, but increases the risk of filling memory. top count text pages = n Output the top n text pages refereneced. top count image pages = n Output the top n image pages refereneced. top count first contact = n Output the top n first contact pages. top count referrers = n Output the top n referers. top count agents = n Output the top n agents. page rules = [a list] List the rules for including or excluding pages from analysis. The format is: action expression where action = include | exclude expression = a regular expression in Perl. Do not include it in slashes. Terminate the list with a period. The rules are evaluated in order. When a match is found, the indicated action is taken. If no match is found, the default is to include the page. Example: page rules = include \/watuse\/ exclude \/images\/ . The above example will include any page with /watuse/ in its pathname without further checks. Any page with /image/ (but not /watuse/) will be excluded. Note that a page with /watuse/images/ will be included becuse rule 1 is checked first. The backslash indicates that the following character (the slash, in this example) is to be taken literally. If you want to change the default to exclude any page that does not pass a test, put this as the last test: exclude . page alias = [a list] List an alias for any page matching the test. The test works the same as for page rules. The format is test alias Example: page alias = \/applicationA/ Application_A \/another_application\/subsetB\/ Subproject_B . The default is to leave the page unchanged. If you want to assign all remaining pages to a single category, use this as the last test: . Other Page alias names are listed first among the pages in the page file. That is, if you supply n unique page alias names, the first n pages listed will be these aliases. page required = [a list] List pages that must be hit for a user to be counted. File format is the same as for page rules. The page alias rules will be applied before this test, so the match must be made against the alias name. A user that does not match a page on this list will be ignored. If no list is given, then all users are accepted. default domain = domain_name A domain name to be appended to users with no qualifiers. These users typically are from the webserver itself. If "usgs.gov" is the default domain name, then users in the "usgs.gov", "nbs.gov", and "nwrc.gov" domains will be considered in-house as well. domain rules = [a list] List the rules for including or excluding domains. This works with same as page rules described above, except that the comparision will be case-insensitive. The default is to include. inhouse rules = [a list] List those domains that are considered "in house." This works the same as the page domain rules described above, but the default is to exclude. If "usgs.gov" is the default domain name (see above), then users in the "usgs.gov", "nbs.gov", and "nwrc.gov" domains will be considered in-house as well. You do not need to specify inhouse rules if this is the case, and the program runs faster if you do not. referrer rules = [a list] List the rules for including or excluding pages based on the referrer. This works the same as page rules described above. The default is to include. referrer alias = [a list] List an alias for any referrer matching the test. The test works the same as for page alias described above, but is case insensitive. Any alias you set here will be carried into a refer file, if one is written. A refer alias can be useful for finding referrers that typically access your site from pages with slightly different names. This is very common with search engines. The alias serves to consolidate the references from a single source. The program automatically checks for certain common referrers. USGS EPA Surf_Your_Watershed Other EPA NOAA Yahoo InfoSeek AltaVista Lycos Excite There is no need to repeat this list in the referrer alias, and the program runs faster if you do not. options = mqyADPRTUYZ m Multiple log option. Use this if you have several logs and you want to differentiate them. Otherwise, all are simply concatenated. In analyzing pages under this option, log_profile.pl will prepend the log file name to the page name. The multiple log option is particularly useful if you are analyzing logs from several sites. It ensures that each site's index.html page, and other common page names, will be treated separately. y Analyze for robot users only. To be considered a robot, a user must access the robots.txt file and have 20 or more hits. A Skip output of agents information. C Skip output of rc file rules. D Skip output of definitional information. H Do not include pagehit hinformation in user file (if any). P Skip output of page information. Q Do not truncate URL after a question mark. Truncating a URL strips the arguments from calls to scripts. If you do not truncate, then each call with different arguments will be treated as a unique URL. Same logic is applied to referrers. [Default is to truncate.] R Skip output of refers information. T Skip output of access times information. U Skip output of user information. Y Exclude robot users from the analysis. See definition of robot in "y" option. Z Skip output of robot information Sample rc file: ================================== Header = XYZ Corporation Banner></a>
.

Footer =
<address>
<a href= Return to XYZ Corp. home page . report freq = 1000 sort size = 30000 temp file = stuff default top count = 40 top count tp = 40 top count ip = 40 top count fr = 40 top count rf = 40 top count agents = 40 page rules = exclude \/testpages\/ . default domain = xyzcorp.com domain rules = exclude foo\.bar\.com$ . inhouse rules = exclude \.extranet\.xyzcorp\.com$ include \.xyzcorp\.com$ . referrer alias = \.usgs\.gov USGS \/www.epa.gov\/surf\/ EPA_Surf_Your_Watershed options = mZ ================================== EXAMPLES log_profile.pl 01Jun1995:00:00:00-0500 01Jul1995:00:00:00-0500 Water_Webserver stats_Jun95 access_log access_log_old Analyzes access_log and access_log_old, compiles the statistics for any entry during June, 1995, and writes the report to stats_Jun95.html. log_profile.pl -f Water.rc 01Jun1995:00:00:00-0500 01Jul1995:00:00:00-0500 Water_Webserver stats_Jun95 access_log access_log_old Does the same as the above example, but uses the Water.rc file for additional controls. DIAGNOSTICS Progress is reported for every 20,000 line of log_files read.