Joseph Jude

Technology, Psychology, and Story Telling

Auditing a website with nodejs

Posted: Tags: code,nodejs,martech

It is not enough to put up a site online and churn out content. You should audit your site regularly to check its security, performance, and SEO compliance. But, auditing a site is time-consuming, and tedious. You should automate as much of these tasks as possible.

There are plenty of tools available in the market for conducting a site audit. What if you want a customized audit?

There are two fantastic nodejs packages to automate site audit: request and cheerio. In this article, we will use these two packages to retrieve a site and parse its content for audit.

First use request module to retrieve the site.

const Request = require("request");

Request('http://google.com', function (err, response, body) {
  if (err || response.statusCode !== 200) {
    // handle err
  }
  else {
    // success
  }
})
  

All the interesting bits are in response and body. We will parse the body using cheerio. Cheerio is a server-side implementation of jquery, so it uses jquery type API for querying the html content.

Let us get the title of the webpage.

const Cheerio = require("cheerio");
let $ = Cheerio.load(body);
console.log("Title: ", $("title").text())

Similarly, we can get all the headers, images, links from a webpage.

let bodyImages = $("img");
let h1 = $("h1");
let h3 = $("h3");

These images and headers are returned as json array objects. You can loop through the keys to get the attributes of these elements. Let's say we want to fetch details of all the images in the page, then we would do this:

let htmlObject = $("img");
let elemArray = [];
let keys = Object.keys(htmlObject);
keys.forEach(key => {
    if (htmlObject[key].attribs) {
        elemArray.push(htmlObject[key].attribs);
    }
});

This returns:

[ { height: '32',
    src: '/images/hpp/ic_wahlberg_product_core_48.png8.png',
    width: '32' },
  { height: '410',
    src: '/images/nav_logo242.png',
    width: '167',
    alt: 'Google',
    onload: 'google.aft&&google.aft(this)' } ]

We can go on like this for all the interesting elements.

$("meta") will fetch the meta objects; $("body > script") will fetch the scripts within the body and so on.

response object also has some interesting information. We can obtain information about server headers from response object.

let headers = {};
for (let key in response.headers) {
    headers[key] = response.headers[key];
}

This will return something similar to this:

{ date: 'Sat, 05 Nov 2016 02:27:14 GMT',
  expires: '-1',
  'cache-control': 'private, max-age=0',
  'content-type': 'text/html; charset=UTF-8',
  p3p: 'CP="This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info."',
  'content-encoding': 'gzip',
  server: 'gws',
  'x-xss-protection': '1; mode=block',
  'x-frame-options': 'SAMEORIGIN',
  'set-cookie': [ 'NID=90=UoNko0gZPR5r8DZE_aVSrku5LFthYp79ozw31bNrpWOnRgiza9XaFWBmWtHVzd_LBjaoj-Ekq--rpXDvDfN6siAfzVPxtlOpOF3H6I08YMUH-LAmp-vkFo-CRvs4AD1N; expires=Sun, 07-May-2017 02:27:14 GMT; path=/; domain=.google.co.in; HttpOnly' ],
  'alt-svc': 'quic=":443"; ma=2592000; v="36,35,34"',
  connection: 'close',
  'transfer-encoding': 'chunked' }

If you are interested about these headers, checkout Mozilla Wiki for more information.

This is all ok in an ideal condition. This is like going to a zoo to look at animals. Animals in zoo are isolated and behave well. When you get into a wild forest, you need to consider many different behaviors and guard against them.

Request provides many options to handle different behaviors of such wild domains.

What if, a domain takes a long time to respond. You don't want your program to get stuck. So request gives you a timeout option.

Some domains redirect. Some of these are good and you want to follow those redirects. Say, you are connecting to http site and it redirects to https, you want to follow it. Request automatically follows these redirects upto 10. You can restrict these redirect to a lower number using, maxRedirects.

Many domains implement gzip compression. If they do, you want to take advantage of it. It is good for you as well as for the domain.

So let us implement all these options to modify the original request. The full code looks like this:

const Cheerio = require("cheerio");
const Request = require("request");

let url = 'http://google.com';

let headers = {
  "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0",
  "Accept-Encoding": "gzip"
};
let options = {
  url: url,
  gzip: true,
  headers: headers,
  maxRedirects: 5,
  timeout: 2000
};
console.log("going to fetch ", url);
Request(options, function (err, response, body) {
  if (err || response.statusCode !== 200) {
    console.log('fetching ', url, ' returned error: ', err);
  }
  else {
    console.log('done fetching ', url);

    let $ = Cheerio.load(body);
    console.log("Title: ", $("title").text())
    let htmlObject = $("meta");
    let elemArray = [];
    let keys = Object.keys(htmlObject);
    keys.forEach(key => {
      if (htmlObject[key].attribs) {
        elemArray.push(htmlObject[key].attribs);
      }
    });

    console.log(elemArray);

    let headers = {};
    for (let key in response.headers) {
      headers[key] = response.headers[key];
    }
    console.log(headers)
  }
  ;
});

You should audit a site only for educational purposes or when you have permission of the site owner. Auditing a site can be treated as crawling, and crawling is illegal in some forms.

This code is offered only for educational purpose. Use it at your own risk.



Like the post? Retweet it. Got comments? Reply.

Comments

comments powered by Disqus