Joseph Jude

Consult . Code . Coach

What I learnt about security and SEO by auditing 25000 sites


tech . seo . security . martech

Most write out of authority, authority in the field. I don't. I am a learner. I write for the unlearned about things in which I am unlearned myself. ― C.S. Lewis.

There is an easy way to learn about web-security and SEO. Read the authorities in those fields like, Troy Hunt for security and Brian Dean for SEO. When you have read them, you'll know everything to know about security & SEO.

Then there is a hard way. Audit top 25k sites to know what is working in the field.

I took up the hard-way. I collected information from these sites using nodejs, stored them, and analyzed.

Summary of my findings:

Now to the details.

Security

http vs https

Google announced, in August 2014, https will be a ranking signal. Despite this, only 25% of the sites uses https. 19639 sites (of the top 26k) still are still on http. Even sites like bbc, bestbuy, and backlinko are still running on http. (e-commerce site of bestbuy is served under https, though).

http vs https

Revenues from ads could be one of the reasons, popular sites are still on http. Until recently, Google ads weren't served on https. But, I found a Google Adsense FAQ, saying Google can serve ads over https now. May be this will increase migration to https.

Between Start SSL, SSL Mate, and Let’s Encrypt, you should be able to find a certificate authority to use in your site.

Masking Server Information

Every website emits "meta-information" about itself in the form of server headers. Two such headers inform technology used behind a website. One is server and another is x-powered-by. A 3rd meta information is part of html meta tags—generator. A mass sniffing program use these data to create database of websites and their servers. Hackers, then, look for exploitable bugs in those servers and launch attacks.

So, security practitioners recommend masking server details. But even OWASP, which is a non-profit organization to improve software security, emits this information (it's webserver is Apache; and application server is mediawiki 1.23.15).

The most popular web-servers are:

ServerCount
nginx11853
apache6424
IIS1940

Popular application-servers are:

ServerCount
php5586
asp.net1863
phusion passenger158
easyengine48

Many of these servers emit even the version numbers!

Gap Or Opportunity

I used to think that website owners are ignorant and lazy about web-security. As I was analyzing these data, I realized site owners have to know and follow many aspects to keep their site popular - creating engaging content, adhering to ever-changing google algorithms, implementing the latest SEO and SERP techniques (schema, microdata, AMP), and so on.

Content marketing isn't easy. But experts in that industry has succeeded in breaking down the complex concepts. However, security pundits have either blamed, or acted as elites. They have not connected with masses to explain the concepts, benefits, and impact.

Gap or Opportunity

This is a great opportunity for the content marketing experts and security pundits to work together to improve web-security. Think about it. If Neil Patel, Brian Dean, and Rand Fishkin talk about web-security in their blogs and workshops, web-security will rapidly improve. They also need to understand importance of security. Only Moz is on https.

Performance

There are two key aspects to performance - keep your content size small and enable gzip compression.

Let us look at the content size. I picked only the html size. Not css, images and other assets. When I plotted the size, I was surprised that some of the popular sites have more than 1000 kb html size.

Scatter Plot of html sizes of top sites

If I split them into buckets, it looks like this:

HTML size distribution in top sites

These sites have more than 2000 kb as html size.

http://sarkarinaukriblog.com
http://seasonvar.ru
http://gogy.com
http://flyscoot.com
http://slickguns.com
http://eztravel.com.tw
http://analog.com

The other factor in performance is gzip compression. This is the only factor that was universally implemented. If you are interested in knowing about gzip compression and how to enable it, read this fantastic guide by Kalid Azad.

scripts

Scripts are necessary evils. They are needed for tracking site analytics, google tag manager and so on. But having too many of them impacts performance. Despite performance hit, many sites include many scripts (even in head section).

Look at Gap. It has 89 scripts in the head section. Compare that with Amazon. It has only 10 scripts in the head section.

It is not so bad. Only 5750 sites (out of 25,000 sites) have more than 10 scripts in their header.

SEO

There is only one long-term strategy for a site to be popular — solve a problem for an audience.

But search engines (mainly google) are the gate-keepers in this game, and they operate by technical aspects of a site.

There are many factors in "technical aspects" of SEO. I have limited my audit to content side. If you want a comprehensive information on technicality of SEO, check-out the detailed post by Mattias.

meta tags

15 most used meta tags are:

What does this reveal?

  1. Even though Google says, keywords don't have any impact on search results, they are widely used.
  2. Everyone uses Google Webmaster tool.
  3. Facebook Open Graph is popular than Twitter Cards.

You should implement at least these meta tags in your site (except generator, for the reasons I explained in the previous section).

What are the meta tags important for Google? Google itself has answered it. Go read it.

Want to know about all the meta tags and how to implement them? Read them at Meta Tags. What an apt domain name!

Kevin Suttle has also organized all meta tags.

On a related note, Josh Buchea has a list of items that can go into head section.

title

Title is an important meta tag for SEO. SEO experts recommend to keep the title between 55 - 60 characters.

Google, Trello, and BBC have their names as title in their pages and so the length is less than 10.

On the other end, TollyWood, Ali Express, Chicago Tribune, and Diggo have more than 100 characters in their title.

Since the later sites are popular, I'm assuming, you don't have to obsess over title length.

h1

Again, the recommendation on h1 is to keep only one h1 tag in a page.

So I was surprised to find Smart Passive Income has 19 h1s on its homepage! I read and listen to Pat often. I never looked at the number of h1s in his site.

Same is the case with Taiga, a popular agile project management tool. It also has 19 h1s on its homepage.

They are not alone. Almost half of the sites have more than one h1 in their homepage.

Web Standards

None of the top 1000 sites passed W3C validator test. Yes, none.

If I can be popular, why comply to some stupid standard?

Funny thing. W3C validator documentation describes how to get the validator output in json. So if you invoked validator on this site, it will return a json object with errors. But if invoked the same validator on qq, then it will return a html document. Go check it, by clicking on those two links.

Trivia

Complete University Guide is the longest domain name in the top ranking sites.

Some of the other longer domain names I could recognize are:

Digital Photography School
Income Tax Filing
Content Marketing Institute

Do you have any other questions? Tweet or comment. I will find the answers.


Like the post? Retweet it. Got comments? Reply.

What I learnt about security and SEO by auditing 25000 sites by @jjude : https://t.co/pcXMaWpSA2 pic.twitter.com/8Se4ANCqSw

— Joseph Jude (@jjude) November 17, 2016
Share this on: Twitter / /

Comments

comments powered by Disqus