What I learned about security and SEO by auditing 25000 sites
Auditing 25k sites to know the state of security and seo in the wild, wild web
Most write out of authority, authority in the field. I don't. I am a learner. I write for the unlearned about things in which I am unlearned myself. ― C.S. Lewis.
There is an easy way to learn about web-security and SEO. Read the authorities in those fields like, Troy Hunt for security and Brian Dean for SEO. When you have read them, you'll know everything to know about security & SEO.
Then there is a hard way. Audit top 25k sites to know what is working in the field.
Summary of my findings:
- Nobody cares about security
- Nobody cares about web standards
- Everyone gzips their site
- There is no standard way for SEO
- Facebook open graph is popular
- Keywords are still used
- I pity the web-masters
Now to the details.
http vs https
Google announced, in August 2014, https will be a ranking signal. Despite this, only 25% of the sites uses https. 19639 sites (of the top 26k) still are still on http. Even sites like bbc, bestbuy, and backlinko are still running on http. (e-commerce site of bestbuy is served under https, though).
Revenues from ads could be one of the reasons, popular sites are still on http. Until recently, Google ads weren't served on https. But, I found a Google Adsense FAQ, saying Google can serve ads over https now. May be this will increase migration to https.
Masking Server Information
Every website emits "meta-information" about itself in the form of server headers. Two such headers inform technology used behind a website. One is server and another is x-powered-by. A 3rd meta information is part of html meta tags—generator. A mass sniffing program use these data to create database of websites and their servers. Hackers, then, look for exploitable bugs in those servers and launch attacks.
So, security practitioners recommend masking server details. But even OWASP, which is a non-profit organization to improve software security, emits this information (it's webserver is Apache; and application server is mediawiki 1.23.15).
The most popular web-servers are:
Popular application-servers are:
Many of these servers emit even the version numbers!
Gap Or Opportunity
I used to think that website owners are ignorant and lazy about web-security. As I was analyzing these data, I realized site owners have to know and follow many aspects to keep their site popular - creating engaging content, adhering to ever-changing google algorithms, implementing the latest SEO and SERP techniques (schema, microdata, AMP), and so on.
Content marketing isn't easy. But experts in that industry has succeeded in breaking down the complex concepts. However, security pundits have either blamed, or acted as elites. They have not connected with masses to explain the concepts, benefits, and impact.
This is a great opportunity for the content marketing experts and security pundits to work together to improve web-security. Think about it. If Neil Patel, Brian Dean, and Rand Fishkin talk about web-security in their blogs and workshops, web-security will rapidly improve. They also need to understand importance of security. Only Moz is on https.
There are two key aspects to performance - keep your content size small and enable gzip compression.
Let us look at the content size. I picked only the html size. Not css, images and other assets. When I plotted the size, I was surprised that some of the popular sites have more than 1000 kb html size.
If I split them into buckets, it looks like this:
These sites have more than 2000 kb as html size.
http://sarkarinaukriblog.com http://seasonvar.ru http://gogy.com http://flyscoot.com http://slickguns.com http://eztravel.com.tw http://analog.com
The other factor in performance is gzip compression. This is the only factor that was universally implemented. If you are interested in knowing about gzip compression and how to enable it, read this fantastic guide by Kalid Azad.
Scripts are necessary evils. They are needed for tracking site analytics, google tag manager and so on. But having too many of them impacts performance. Despite performance hit, many sites include many scripts (even in head section).
It is not so bad. Only 5750 sites (out of 25,000 sites) have more than 10 scripts in their header.
There is only one long-term strategy for a site to be popular — solve a problem for an audience.
But search engines (mainly google) are the gate-keepers in this game, and they operate by technical aspects of a site.
15 most used meta tags are:
What does this reveal?
- Even though Google says, keywords don't have any impact on search results, they are widely used.
- Everyone uses Google Webmaster tool.
- Facebook Open Graph is popular than Twitter Cards.
You should implement at least these meta tags in your site (except generator, for the reasons I explained in the previous section).
What are the meta tags important for Google? Google itself has answered it. Go read it.
Title is an important meta tag for SEO. SEO experts recommend to keep the title between 55 - 60 characters.
Since the later sites are popular, I'm assuming, you don't have to obsess over title length.
Again, the recommendation on h1 is to keep only one h1 tag in a page.
So I was surprised to find Smart Passive Income has 19 h1s on its homepage! I read and listen to Pat often. I never looked at the number of h1s in his site.
Same is the case with Taiga, a popular agile project management tool. It also has 19 h1s on its homepage.
They are not alone. Almost half of the sites have more than one h1 in their homepage.
None of the top 1000 sites passed W3C validator test. Yes, none.
If I can be popular, why comply to some stupid standard?
Funny thing. W3C validator documentation describes how to get the validator output in json. So if you invoked validator on this site, it will return a json object with errors. But if invoked the same validator on qq, then it will return a html document. Go check it, by clicking on those two links.
Complete University Guide is the longest domain name in the top ranking sites.
Some of the other longer domain names I could recognize are:
Do you have any other questions? Tweet or comment. I will find the answers.