Showing posts with label Apache. Show all posts
Showing posts with label Apache. Show all posts

Sunday, April 20, 2014

awk and log parsing


Find the number of total unique visitors:

cat access.log | awk '{print $1}' | sort | uniq -c | wc -l

2. Find the number of unique visitors today:

cat access.log | grep `date '+%e/%b/%G'` | awk '{print $1}' | sort | uniq -c | wc -l

3. Find the number of unique visitors this month:

cat access.log | grep `date '+%b/%G'` | awk '{print $1}' | sort | uniq -c | wc -l

4. Find the number of unique visitors on arbitrary date – for example March 22nd of 2007:

cat access.log | grep 22/Mar/2007 | awk '{print $1}' | sort | uniq -c | wc -l

5. (based on #3) Find the number of unique visitors for the month of March:

cat access.log | grep Mar/2007 | awk '{print $1}' | sort | uniq -c | wc -l

6. Show the sorted statistics of “number of visits/requests” “visitor’s IP address”:

cat access.log | awk '{print "requests from " $1}' | sort | uniq -c | sort

7. Similarly by adding “grep date”, as in above tips, the same statistics will be produces for “that” date:

cat access.log | grep 26/Mar/2007 | awk '{print "requests from " $1}' | sort | uniq -c | sort

Most Common 404s (Page Not Found)
cut -d'"' -f2,3 /var/log/apache/access.log | awk '$4=404{print $4" "$2}' | sort | uniq -c | sort -rg

2 - Count requests by HTTP code

cut -d'"' -f3 /var/log/apache/access.log | cut -d' ' -f2 | sort | uniq -c | sort -rg

3 - Largest Images
cut -d'"' -f2,3 /var/log/apache/access.log | grep -E '\.jpg|\.png|\.gif' | awk '{print $5" "$2}' | sort | uniq | sort -rg

4 - Filter Your IPs Requests
tail -f /var/log/apache/access.log | grep

5 - Top Referring URLS
cut -d'"' -f4 /var/log/apache/access.log | grep -v '^-#39; | grep -v '^http://www.yoursite.com' | sort | uniq -c | sort -rg

6 - Watch Crawlers Live
For this we need an extra file which we'll call bots.txt. Here's the contents:


Bot
Crawl
ai_archiver
libwww-perl
spider
Mediapartners-Google
slurp
wget
httrack


This just helps is to filter out common user agents used by crawlers.
Here's the command:
tail -f /var/log/apache/access.log | grep -f bots.txt

7 - Top Crawlers
This command will show you all the spiders that crawled your site with a count of the number of requests.
cut -d'"' -f6 /var/log/apache/access.log | grep -f bots.txt | sort | uniq -c | sort -rg


How To Get A Top Ten
You can easily turn the commands above that aggregate (the ones using uniq) into a top ten by adding this to the end:
| head

That is pipe the output to the head command.
Simple as that.

Zipped Log Files
If you want to run the above commands on a logrotated file, you can adjust easily by starting with a zcat on the file then piping to the first command (the one with the filename).

So this:
cut -d'"' -f3 /var/log/apache/access.log | cut -d' ' -f2 | sort | uniq -c | sort -rg
Would become this:
zcat /var/log/apache/access.log.1.gz | cut -d'"' -f3 | cut -d' ' -f2 | sort | uniq -c | sort -rg

Saturday, November 20, 2010

MOD_PHP or FASTCGI ?

When we load PHP into Apache as a module (using mod_php), each Apache process we run will also contain a PHP interpreter which in turn will load all the compiled in libraries which themselves are not exactly small.

This means that even if the Apache process that just started will only serve images, it will contain a PHP interpreter with all assigned libraries. That in turn means that said Apache process uses a lot of memory and takes some time to start up (because PHP and all the shared libraries it's linked to need to be loaded). Wasted energy if the file that needs to be served in an image or a CSS file.

FastCGI in contrast loads the PHP interpreter into memory, keeps it there and Apache will only use these processes to serve the PHP requests.

That means that all the images and CSS, flashes and whatever other static content we may have can be served by a much smaller Apache process that does not contain a scripting language interpreter and that does not link in a bunch of extra libraries (think libxml, libmysqlclient, and so on).

Even if we only serve pages parsed by PHP - maybe because we process our stylesheets with PHP and because we do something with the served images - we are theoretically still better off with FastCGI as Apache will recycle its processes here and then (though that's configurable) while FastCGI processes stay there.

And if we go on and need to load-balance your application, FastCGI still can provide advantages: In the common load balancing scenario, we have a reverse proxy or a load balancer and a bunch of backend servers actually doing the work. In that case, if we use FastCGI, the backend servers will be running our PHP application and noting else. No web server loading an interpreter loading our script. Just the interpreter and our script. So we safe a whole lot of memory by not loading another web server in the backend (Yes. FastCGI works over the network).

Monday, December 15, 2008

Apache Virtual Hosts

When Apache is started, it scans the configuration file (/etc/httpd/conf/httpd.conf) to determine its settings. It generates a table of the server's IP addresses with a hash (known as a vhost address set) containing the associated domain names. With the Apache daemon (httpd) running and listening at the appropriate ports (usually just 80), it's ready to receive requests from clients.

When a browser goes looking for a document that a user has requested, it first has a domain name server translate the domain name entered to an IP address. The browser then sends the user's request to the IP address. As of HTTP 1.1, the browser must also send to the web server the domain name that the user entered; it's no longer to be implied. This requirement makes virtual hosting possible. If Apache has no vhosts, it will use the main server's DocumentRoot directory (often set to /var/www/html). However, if Apache has been configured for vhosts, it will compare the client's request to the ServerName of each vhost with the same IP address and port that the request came in on. The accompanying vhost directives of the first ServerName that matches the client's request will be applied.

Within a vhost block--between and tags in httpd.conf--many directives may be given, but only two are typically required: the ServerName and the DocumentRoot directives. The ServerName directive provides the domain name. The DocumentRoot directive sets the root directory for the domain. If Apache finds a vhost with a ServerName that matches a client request, it will look in the root directory specified by the DocumentRoot directive for files. If it finds what was requested, it will send copies to the client.


















http://www.onlamp.com/pub/a/apache/2003/07/24/vhosts.html?page=2