Nevyan's how to's: seo

Showing posts with label seo. Show all posts

Sunday, February 03, 2019

SEO Crawling and Indexing basics

Our journey into SEO starts with the terms of crawling and indexing. So let’s point out the main differences between them.
For more information, you can take a look at my search engine optimization course

In order for a specific URL to show up, when searching in Google’s index it has to be discovered. This is the main purpose of the crawling process. So when a particular web site is being submitted to Google for indexing the crawler first fetches its main entry point which is usually index.html file and then it tries to discover as much as possible internal pages following the web page links. Next for each discovered page the crawler makes an HTTP request just like you do in a browser and parses all the found content so it can gather readable information.
The process of parsing includes: removing all the HTML tags, scripts, styles and the so-called “stop words”. Stop words usually represent commonly used words and just because they bring noise to the information they are being discarded. After the cleanup, algorithms of machine learning are trying to understand the topic of the content based on information they have learned from previous websites.
You may ask yourself what does the Google index look like?
Although the real index and all the factors which Google takes into account before ranking a web page for a specific keyword remain a secret. We could represent the index as a table-like distributed database structure which holds the following columns: term or keyword and path to a document (from the website documents) where the keyword is found. In other words, this table is used to map the relevance of the keyword towards a particular web document, so in a case of search, the search engine could easily determine where (in which document) a specific word could be found. Now that we have a grasp on the index structure let’s take a look at the actual process of Indexing:
The search engine performs two main tasks: first is to find meaningful words from the parsed content related to its topic and the second is to associate them together with the path to the document in its existing index.
You may ask how the engine knows if a particular keyword is relevant?
The local relevancy of a word or a combination of words for a particular document is calculated using techniques such as Inverted index, Latent Semantic Indexing, and others:

The Inverted index is a data structure where all the unique words found in a document are mapped to a list of documents where they also appear.
In LSI the mapping additionally considers relations between keywords and concepts contained in a collection of text documents.

When a meaningful keyword is found, it is then linked to its source document or multiple documents forming a new data entry in the table structure. Now the second step of the indexing process is when the search engine tries to fit the data entry within the existing index. The comparison process takes into account more than 200 factors that feed Machine Learning algorithms. Additionally, human evaluators are being used to determine the relevancy of a particular keyword for a particular document.
One more thing: in this process, you have to know that all the cached copies of the documents are archived and stored in another place.

Crawling and indexing continued
There are ways to control the crawler’s access to a website as well as to choose which pages to be taken into account for indexing. This can be done by placing a meta tag in the head section of a particular web page as well as by creating and using a robots.txt file inside the website’s root directory.
There are two properties: index and follow which are related to indexing and crawling processes that we will discuss:
Meta: index says that the page should be added to the index database.
Meta: follow controls crawling of the inner links inside the web page.
If we would like to restrict the crawler’s access to content we would use “nofollow” attribute.
If we would like the web page not to be a part of the search engine index, and to be excluded from the search results we use the “noindex” attribute.
The robots.txt file is primarily used for excluding crawling of a web page. It is a text file containing information on which pages/domains/directories to be included/or excluded from a particular website. It is one per website and resides in its root directory.
Robots.txt example:

# Rule 1
User-agent: Googlebot
Disallow: /nogooglebot/

# Rule 2
User-agent: *
Allow: /

Now let's see the difference between using robots.txt and meta tags with an example of crawl blocking.
The next example is showing how we can have a website but some of its web pages to continue displaying: A description for this result is not available because of this site’s robots.txt file.

First, let's discuss why we are having this situation. Apparently, this website has in its description the tag: meta robots=”index” so it has been indexing appropriately,
but in the robots.txt file we have disallowed this web page, so it's denying the display of the description of the result. In this way, the web page is added to the index (have been discovered), but has not being crawled (link index)

Before proceeding with the next examples, lets first be clear on what is link juice?

It is actually the value passed from one page or site to another through hyperlinks. Search engines see it as a vote or a signal by other pages that the page they are linking to is valuable.

In the figure on the left side, you can see how one page can pass link juice to another.

We will discuss the other four cases which prevent the flow of link juice:

If the main page returns 404 or not found, its link juice will not be calculated and transferred to the other page.
If the page we are linking to cannot be found also the link juice remains in the source page
For the next two cases, we have to see what does ...robots.txt file example photo
do. It is a plain text document file, where we can describe directories and files to be allowed or disallowed from crawling when a particular search engine visits our website.
When a page is disallowed in the robots.txt file its internal links would not be passing link juice to the destination page
And finally, if a page has a “nofollow” attribute placed on the link, the link will not flow it to the destination page.

Penguin update
This update main goal is to prevent the exchange of bad linking practices between websites. Such schemes are used by SEO 'specialists' in order to increase a particular website reputation. As well as in reverse direction, to negatively affect specific website by having lots of low-quality website links pointing to the targeted website. Here is how to clean up such situations. We can go to the search console and from there to see who is linking to us. Then we check all the listed domains by hand for issues. Other free websites that are also very helpful for finding back-linking sites are NeilPatel's website as well as backlinkwatch. After obtaining the spammy websites list we just create a plain text file disavow.txt where we place all the links following the format: domain: spammy.com. The last step is to upload the file to google's disavow tool. You will have to wait some days before the penalty imposed by this kind of negative SEO will be released.

Panda/topical/quality update
This penalty affects the entire website and even a single webpage could cause it.
Here are a few ways on how to remedy the situation if your website is being targeted by the Panda update:
If you have very similar articles just merge them or add more relevant content to the shorter ones. For articles aim for having about 1000 words of content.
In order to identify what might be the source of the problem, especially if you have lots of pages, you can group your categories into subdomains. Then the search console, will allow you to inspect them by domain so you can gain insight into which categories perform better and decide whether to correct the weak pages or just disallow the whole category. You can become even more granular by using sitemaps of the whole site pages. After submitting the sitemap in search console, you will have information on which pages are being fully indexed and which are having problems. The benefit of this technique is that it will show you which categories are not performing well to the level of individual URLs.
In case you have an article which spans in multiple pages, you can add rel next and rel prev inside your HTML markup, so Google can treat those pages as a group of pages.
More techniques:
First, identify the top 5 pages receiving most impressions and at the same time having very low CTR (clicks). The improvement action in such a case for you is just to correct their meta description and title in order to help them become more attractive to the visitors.
Transfer power from higher to lower-ranking pages, or just analyze the good rank pages and how they differ from the lower-ranking pages. When done you can either delete the weak pages or merge them into the powerful pages.
For comments: choose to be displayed only after a user performs an action such as clicking on a button. On one hand this improves the UI and on other it prevents SPAM. In many cases webmasters choose just to disallow commenting on pages altogether.
The most effective yet longest technique for dealing with the Panda update is to allow only high-quality pages inside the Google index. Start with a few ones which already have good click-through rate and impressions, and then gradually allow more pages to the quality indexed list by rewriting, updating or adding new information inside.

Top-heavy update
The update targets websites that use too much of an advertisement inside the first screen a particular user sees. This way you might have great content, but if it is being occupied with advertisement the penalty will be triggered.
The suggestions in such cases are to have only: 1 ad above the fold (before reaching the vertical scroll height), and it should be of less than 300x250px and 1 per page (for mobile devices). The alternative is to use the auto ad placement available from Google Adsense.

Monday, September 09, 2013

Remove duplicates from country specific Blogger urls

Here is how to clean up your blog from possible duplicates coming from country-specific Blogger URLs. First off what are these URLs and how are they produced. For example, if your original blog URL is: nevyan.blogspot.com. and you happen to visit the website from India you'll be served with the exact same content, but this time coming from domain .in like nevyan.blogspot.in
That if left unchanged will surely create duplicate content in Google's index - because you're having 2 domains that serve exactly the same content. So here is how to check and fix your blog, avoiding such duplication issues.

0. Fix your canonical tags, and don't rely on fancy JavaScript redirects, which are not properly followed by search engines and lead to 2nd refresh of the webpage.
Just open up your blogger template and place in the section:

<link expr:href="data:blog.canonicalUrl" rel="canonical">

This way the search engine will know which is the legitimate version of the accessed content.
like if google crawls nevyan.blogspot.in the canonical tag will reveal its representative URL as: nevyan.blogspot.com

1. Check if you have already indexed content from domains outside yours.
by using the following search in google:

"nevyan.blogspot" -com

or by manually checking all the domains found here: http://dishingtech.blogspot.com/2012/06/list-of-blogger-country-domains-for.html:

2. For each domain found extract the indexed links:

site:nevyan.blogspot.name_of_domain

site:http://nevyan.blogspot.ca will list each of the indexed results coming from root domain Canada

3. Next open Google Webmaster tools and authenticate the domains as yours from: 'Add domain', and go to Google Index -> Remove URLs

4. Type the URLs and wait for the removal process to finish.
Note: in order for the removal to be successful be sure that those URLs have either meta noindex tag or are disallowed in custom robots.txt file

Cheers!

Monday, August 26, 2013

Recovering from the Google's Penguin and Panda updates

Google Panda

As you may know, there is an ongoing algorithmic update named "Panda" or "farmer", aimed at filtering out: thin / made for Adsense / copied content from Google's search index. The algorithm mainly pushes websites down in SERPS, aiming to replace them with author-owned content websites, which could be more useful to the visitors.
The actual filter has an automatic semantic algorithm, which is triggered by various factors in order to classify content as useful. Triggering factors remain secret, but it is known that websites who serve content of mixed quality (both bad and good) to their visitors will have their SEO rankings impaired.

The algorithm uses additional information from:

- google chrome browser addon, allowing the user to block non-quality sites
- blocking site features provided in google search
- social sharing: not the actual number of likes, but the links that a website gets as a result of facebook share or retweet.

Here is a sample table showing traffic decline after the Panda update has been applied to popular affected websites, which is somewhere between 70-90%. (note that the company providing the results is measuring a particular set of words, and not the whole long tail traffic of the mentioned websites)

Another triggering factor of the update is the lack of user satisfaction measured with the amount of time spent on one page before the user clicks on the back button returning to search results in order to find another website.

Here is what you can do in order to get your website less affected from the update:

1. Use noindex meta tag on thin or of low-quality content. Example: 2nd level category pages, containing only links with no actual content.
2. Delete all the copied content and make the pages return header status 404 not-found.
3. Reduce the number of adverts or additional page elements, appearing before the actual content. (above the fold), or make sure that they're blending within the content:

Site layouts that highlight content		Site layout that pushes content below the fold

4. Use google analytics to find webpages, with the lowest time on-page, and use the same methods as 1) to restrict them.
5. By default use meta noindex to all new pages until they have quality content.

6. Don't use white text on a black background.

7. Ensure maximum user time to be spent on a page.

Next, follow tips on website recovery from Penguin / Panda update. They are not well known to the general audience that's why I decided to share them here.

Redirects
Check with curl command if all your website resources respond with 200 OK, and are not using 301 or 302 redirects. The reason is that curl shows different/real/not cached header status codes than the default's browser/firebug.
Usage from a command prompt: curl -IL http://your_website.com
Next do check your main page, category pages, inner content pages, JavaScript and CSS files. Just make sure that they return 200 OK response.
The same procedure can be done using the Screaming Frog SEO spider.

DNS
Check the latest propagated DNS settings of your website. Most of the DNS checking websites, as well as nslookup command, provide cached DNS name server responses, even if you clear (flush) your system cache. That's why I recommend checking your website name servers using http://dns.squish.net/ website. Also make sure that the IP address of your DNS server (its A record), matches the NS entry of your domain server. After making sure that there are no problems go to google webmaster tools and fetch your URL via fetch as google.

Removing duplicate content
When doing URL removals in webmaster tools, make sure that you type the URL manually from the keyboard or if it is too long to use the dev tools console to explore the URL and copy and paste it directly into the input box. If you just select & copy the URL with the mouse from the search results, there's a high chance for some invisible characters to be added next to it like %20%80%B3, etc. This will make your request legitimate, but don't be surprised if you see the already removed URL reappearing in the index. For this reason, a few days after the removal you should re-check google's index to see if there are any occurrences of the URL still present. Some not so obvious URLs to be indexed are your .js, .swf or .css files.
In order to effectively remove content, it should either return 404-not found result, or 1) has noindex, nofollow meta attribute and 2) is not present in the robots.txt file.
Also if you have access to .htaccess file here is how to add canonical tags to .swf or .json files:
Header set Link: '<http: canonical_file.swf="" domain.com="">; rel="canonical"'</http:>
This way any existing link power will accumulate over the canonical url.

Robots.txt, noindex, nofollow meta tags
In order to reveal indexed but hidden from google SERPS content, you should release everything from your robots.txt file (just leave it empty), and stop all .htaccess redirects that you have.
This way you will reveal hidden parts of the website (like indexed 301 redirect pages), which are in the index, and will be able to clear them up.
The other reason behind this change is that, if you have URLs in robots.txt they won't be crawled and their link rank recalculated as you like, even if those are being marked with the meta tag noindex, nofollow. After the cleaning procedure, don't forget to restore the contents of your robots.txt file.

Google Penguin

This update previously known as anti-spam or over-optimization, now is more dynamic and is being recalculated much more often. If you are not sure which exactly Google quality update affected your website it's very hard and slow to make random changes and wait for feedback i.e. traffic increase. So here I'm suggesting one convenient logical method on how to find out:
First, get your site visit logs from Google Analytics or any other statistical software. We are interested in the number of total unique visitors. Then open this document: http://moz.com/google-algorithm-change. Next, check whether on the mentioned dates or few days after starting from February 23, 2011 you have a slight drop in traffic. If true your pages have been hit and it's time to take measures against this specific update. But first let's make sure that your content conforms with the following techniques, which are recommended in forums, articles, video tutorials and in non-free SEO materials.

Over-optimization:
- keyword density above 5%, stuffed bold/emphasized(em) keywords in meta keywords, description, title as well as h1...h6, img title and alt attributes
- leaving intact the link to the blog theme author/powered by
- adding links with keywords variations in the anchor text on the left, top or bottom section of the website (which in reality receive almost no clicks)
- creating multiple low quality pages targeting same keywords, thus trying to be perceived as authority on the subject
- creating duplicate content by using multiple redirects that (301, 302, JavaScript, .htaccess etc...)
- overcrowding the real content with ads so that it appears way low behind the page flow, thus requiring lots of scrolling activity to be viewed
- using keywords in: page url and in domain name

Link schemes:
- being a member of web-ring networks or modern link exchanges. Some sophisticated networks have complicated algorithms which choose and point proper links to your newly created content thus making it appear as solid
- having low perceived value / without branding or gained trust during the years, but receiving (purchased) multiple links from better ranking websites
- Penguin 1 was targeting sites having unnatural / paid links analyzing only the main page of a particular website. With the time 'clever' SEOs started buying links not only pointing to the main, but to inner pages or categories - leading to Penguin 2 where such inner pages, having spammy link profiles have been taken into account.

- more and more people are using their comments to sneak in their links. The old/new spamming technique goes like this: they say something half meaningful about your content or something that sounds like a praise and then put their website link in the middle of that sentence. Please don't approve such comments and report them as SPAM.

Solutions:
- rewrite or consolidate the questionable content with few other posts
- delete the low quality page if it already has a 'not so clean' backlink profile and the back-linking site doesn't want to remove its links
-use Google Disavow tool to clean up website's back-linking profile

Don't forget that like every Panda update Google makes its changes continuously using multiple tests and corrections over the algorithms. So some fluctuations while examining the SERPS are possible to be observed.

Good luck!

Sunday, April 08, 2012

Improving Adsense eCPM / RPM and CPC

There are some things that anyone could do to increase his/her income when using Adsense.
You can learn a bit more about the SEO topic from my online course.

1. Decrease your ad units count
Second and third ad spot clicks are not so-profitable as the first ones. Also when having 15 ad links on a page and the user clicks on only 1 then your eCPM will start to get low, because this way the advert impressions are growing but the clicks are staying the same.

2. Rearrange ad spots
Look at your AdSense performance tab: and if you have higher CPC on your middle positioned ad(ie it is getting more clicks) then place it to appear first in your HTML code, then re-position it with CSS to the place where it's getting those clicks.

3. Add more useful content to the page
You can use Google Analytics or other web statistics software to see the average time of stay on your pages. This way you might find the pages that need to be enriched or rewritten. And if you manage to increase visitors' stay time then your eCPM will surely grow!

Next, let's discuss how to increase your Adsense earnings by improving the eCPM ( RPM ) parameter.

4. Direct hits and RPM
By definition, RPM is ' the amount of revenue you can expect to earn from AdSense for every 1000 impressions shown on your site ' which is something totally different from how much 1000 impressions on your site actually cost!

And as you can see from this video:

in order to have high eCPM you'll have to ensure unique visitors are clicking on your ads.

But first, let's see how to recognize some of the actions that "repeated" users (or our direct traffic hits) perform. They usually:
- type the URL in the browser / open a new tab or access the URL from bookmarks.
- come from links found in email/newsletter, Word or PDF files.
- come from redirects (301 header redirect, javascript or meta refresh tags)

As you might have seen in the video above there's an inverse connection between your eCPM value and the traffic you have. In other words: receiving more mixed(not unique) traffic in effect will only lower your eCPM.

Filtering direct hits
So as you can see our first priority becomes not the obvious one to get more traffic in order to display more ads, but to display relevant ads to our organic public segment only ( i.e. users coming from search engine queries). This way we'll be displaying fewer ads, but the eventual incoming clicks are going be much more valuable.

Here is a simple PHP script that will filter out most of the direct hits and display advertisement only to users coming from google:
Assuming that your AdSense code is placed in the variable $ads;

<?if (strstr($_SERVER['HTTP_REFERER'], "google")) { echo $ads; }
?>

or the more generic one: displaying ads on the referral only traffic by filtering out hits generated from your own domain:

<?
$referer = parse_url($_SERVER['HTTP_REFERER']);
$my_domain = parse_url($_SERVER['HTTP_HOST']);
if (!strstr($referer['host'], $my_domain['host'])) {echo $ads; }
?>

5. Increasing bounce rate
I know that this may sound very frustrating like everyone out there is suggesting just the opposite, but after watching the video you may start thinking about it.

6. Update your ads with the asynchronous version of the code
This will improve your page loading times as well as display the ads faster to the end-user. Just go to My ads, click on the preferred ad -> get code and from the select box choose the Beta async version.

7. Don't mix ad channels
If you are using 1 ad code in different websites placed at different positions this will interfere and alter the ad basic parameters, and with time will limit your overall cost per click value. This is especially true if you are using targeting options.

That's it, don't forget to write unique content and I hope this post helps you actually increase your AdSense earnings!

Saturday, April 07, 2012

SEO thin content check

Thin content is not easy to be explained, but because it became more popular during the Panda update here are some things that you can do in order to represent your website in a more favorable light to the search engines.
You can learn a bit more about the SEO topic from my online course.
Some examples and fixes of thin content follow:

1. Target: Duplicate content caused by sessions, referral or page order/filtering parameters appended to the end of the page like: ?orderby=desc that doesn't change the actual content on the page or just reorders the same content. Also if your website has AJAX back button navigation, or just a login system with session IDs appended to the end of the URL, as well as frames with tracking ids attached. Just look at the different URLs on the picture below, representing same content: duplicate content from url

URL parameters, like session IDs or tracking IDs, cause duplicate content, because the same page is accessible through numerous URLs.

Solution (to session appended URLs):
After long searching the following technique from webmasterworld's member JDmorgan succeeded to get ~90% of my website content fully indexed. Here is how to implement this technique on practice using apache .htaccess.
Just put the following lines in your .htaccess file and test:

1) Allow only .html pages to be spidered

#allow only .html requests
RewriteCond %{query_string} .
RewriteRule ^([^.]+)\.html$ http://your_website.com/$1.html? [R=301,L]

2) Remove all the sessionid from the URL parameters, when a page is being called by bots

#remove URL sessionids
RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
RewriteCond %{HTTP_USER_AGENT} Teoma
RewriteCond %{QUERY_STRING} ^(([^&]+&)+)*PHPSESSid=[0-9a-f]*&(.*)$
RewriteRule ^$ http://your_web_site.com/?%1%3 [R=301,L]

2. Target: 301 header redirects chain
A chain of 301 redirects could cause you a loss of PageRank i.e. lead to thin content. So please check that your 301 redirects are final i.e. they point to an end page and not to another redirect page. You can use LiveHTTPHeaders extension to do this kind of check.

Solution: fix your redirects!

3. Target: Because it is thin
Pages with content < 150 words or 10 visits during the whole year. You can check out the latter with Google analytics by looking at your content pages, ordered by page-views setting time range of 1 year backward. Find and fix those URLs!

Solution: Either remove/nofollow or block with robots.txt or rewrite/merge the content.

4. Target: Heavy internal linking:
By placing multiple links on a page to pages/tags/categories you are reducing the particular page's power. This way only a few pages supported by lots of incoming internal links are considered as not thin by Google Panda algorithm.

Solution: You need to clean up the mistaken links on that page by adding rel = "nofollow" to the outgoing links or better remove (rearrange to bottom) the whole section (tag cloud, partner links, etc...) from your website.

5. Target: Percentage of URLs having thin content
Google maintains two indexes: primary and supplemental. Everything that looks thin or not worthy (i.e. doesn't have enough backlinks) goes to the supplemental. Factor when determining thin content is the percentage of indexed and available via search to its supplemental pages a particular website might have. So the more pages you maintain in Google's primary index the better. It is possible that your new (already fixed) and old (thin) content now fights for position on Google's search. Remember that the old content already has Google's trust with its earlier creation date and links pointing to it, but it is still thin!

Solution: Either redirect the old to the new URL via 301 permanent redirect or log in at Google's Webmaster tools then from Tools->Remove URL typed your old URLs and wait. But before this you'll have to manually add meta noindex, nofollow to them and remove all restrictions in your robots.txt file in order to get the Google to apply the index,nofollow attributes.

Q: How to find thin content URLs more effectively?
Sometimes when you try to find indexed thin content via: site:http://yourwebsite.com you won't see their full list.

Solution:

use the parameter "-" in your query:
First, do a site search site and then consecutively remove the known and valid URLs from the results.
"site:http://yourwebsite.com -article"
will remove all URLs like article-5.html, article-100.html, etc... This way you'll see the thin content pages more quickly.
when you know the thin content page name just do
site: http://yourwebsite.com problematic_parameter
( ie.:"site:http://yourwebsite.com mode" this will show all of the indexed modes of your website like: mode=new_article, mode=read_later, mode=show_comment etc... Find out the wrong ones and do a removal request upon them. )

Enjoy and be welcomed to share your experience!
---
P.S. If you don't have access to .htaccess file you could achieve the above functionality using the canonical tag - just take a look at these SEO penalty checklists series.
More information on the dynamic URLs effect to search engines as well as how to manage them using yahoo's site explorer you can find here: https://web.archive.org/web/20091004104302/http://help.yahoo.com/l/us/yahoo/search/siteexplorer/dynamic/dynamic-01.html

Monday, July 19, 2010

SEO iframes and redirects

Hidden redirects
Do you know what's the difference between these two custom error not-found pages? (where to find them? hint: look in your .htaccess file)

ErrorDocument 404 http://your_website.com/error404.php

ErrorDocument 404 error404.php

It appears that the first line returns 302 Found header code and then redirects to your 404 page, which is a really bad thing from an SEO standpoint and gets penalized. The second line gives you the normal 404 pages returning a proper 404 header code.

Too many 301 redirects
Can you recognize this code?

RewriteRule (.*) http://www.newdomain.com/$1 [R=301,L]

You may think that it is OK when you redirect your old to a new domain (in case of having Panda penalty applied) via 301 temporary header redirect. But what happens if the old domain already has some kind of penalty applied. Well, it automatically transfers to your new domain, because as you've might noticed 301 is actually a PERMANENT redirect and transfers all the weight from the previous domain. So go, check and fix those two cases and be really careful!

Usage of iframes between subdomains
On one website(~500pages) with over 300 pages indexed in Google, I've used an iframe linking to other sub-domain in order to display relevant content. When I removed the iframe almost immediately, in less than 24 hours my indexed results grew from 300 to 360.
But why?
I started searching on the forums and it appeared that Google penalty filter was triggered by such a huge usage of iframes (mistakenly taken as poisoning attack). Here is a short explanation from Matt Cutts on this:

"Essentially, our search algorithm saw a large area on the blog that was due to an IFRAME included from another site and that looked spammy to our automatic classifier."
link: http://groups.google.com/group/Google_Webmaster_Help-Indexing/browse_thread/thread/68692a9aefae425f

Solution:
Remove all the iframes that you have or replace them with ajax calls or just static HTML content.
Wait a few days and run: site:http://yourwebsite.com to see the difference in the results!

Good luck!

Thursday, December 03, 2009

SEO keyword density, canonical, duplicate content

Above is a sample screenshot is taken from Google's Webmaster Tools Keywords report. You may ask why should we need it when we can use Firefox integrated Show keyword density function?
Well, one benefit is that this function shows specific keyword significance across your pages. Let me explain what do this means:

Suppose that you are optimizing content for the keyword 'cars'. It's a normal practice to repeat 'cars' 2-3 times, style it in bold, etc... Everything's good as long as you do it naturally. The moment you overstuff your page with this keyword it will get penalized and lose its current ranking in Google SERPS. So you have to be careful with such repetitions.
Moreover in the report, you can see the overall website keyword significance. And because Google likes thematic websites it is really important for these keywords to reflect your website purpose or theme. Otherwise, you're just targeting the wrong visitors and shouldn't be puzzled by the high abandonment rate.

But, enough with the theory now let's discuss how you can fix some things up:

Check every individual webpage keyword density online via Webconfs and correct(reduce) words, that are being used over 2%. Again this % depends on your local keyword concurrency. So tolerable levels can vary up and down.

Add the 'canonical' tag to all your website pages:

< link rel="canonical" href="http://www.example.com/your_preferred_webpage_url.html" />

(and make sure to specify the URL that you really prefer!). This will reveal to the search engine what your legitimate webpage is. More info: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

Blogger users can achieve adding canonical with the following code at the head section in the Template:

<b:if cond='data:blog.pageType == "item"'>
<link expr:href='data:blog.url' rel='canonical'/>
</b:if>

(it will remove the parameters appended at the end of the URL such as http://nevyan.blogspot.com/2016/12/test.html?showComment=1242753180000
and specify the original authority page: http://nevyan.blogspot.com/2016/12/test.html )

Next to prevent duplicate references of your archive( i.e .../2009_01_01_archive.html) and label pages( i.e. /search/label/...) from getting indexed just add:

<b:if cond='data:blog.pageType == "archive"'>
<meta content='noindex,follow' name='robots'/>
</b:if>
<b:if cond='data:blog.pageType == "index"'>
<b:if cond='data:blog.url != data:blog.homepageUrl'>
<meta content='noindex,follow' name='robots'/>
</b:if>
</b:if>

To prevent indexing of mobile (duplicates) of the original pages:
<b:if cond="data:blog.isMobile">
<meta content='noindex,nofollow' name='robots'/>
</b:if>

And working solution blocking even the /search/tags from indexing, allowing only homepage and posts to be indexed:
<b:if cond="data:blog.pageType == "index" and data:blog.url != data:blog.homepageUrl">
<meta content='noindex,follow' name='robots'/>
</b:if>

Sunday, April 15, 2007

Deoptimising - a new way of SEO

I'll start this post with the assumption that you could have experienced problems with the Google ranking algorithms.
Let's say you are writing new content and for certain keywords instead of first it appears immediately on the 3rd page or even at the end of search results. A possible reason might be that your website could have lost its trust-rank and have the so-called 950 penalty applied at runtime by Google. In order to restore our rankings here is what you can do. And in the meantime, you can learn a bit more about the SEO topic from my online course.

Check whether you need to deoptimize:
If you want to see all the non-supplemental pages from your just type in google:
site:www.yourwebsite.com/*

Then to see just the supplemental pages from your site type:
site:www.yourwebsite.com/ -site:www.yourwebsite.com/*

Keep the ratio below 50%

Also try the automated supplemental ratio tool: http://www.mapelli.info/tools/supplemental-index-ratio-calculator

What to do next:
1. Validate your website

2. If you use plenty of H1, H2, H3 tags remove most of them or replace 'H1' with stylized 'H2' or 'span' and 'strong' tags.

3. Don't use same data in: 'title', 'h1', 'meta description' tags.

4. In your website, inner-linking navigation uses the same linking structure. Using the same 'title' attributes in the menu on every page of your site is considered spam.

5. Your affiliate/referral links should differ in the anchor text. Please check the 'title' attributes to be unique and avoid keywords stuffing there.
Pay special attention to 'title' and 'alt' attributes: if they are overstuffed google bot will just place the first few lines of your page as a description in its search results which turns to be your repeating website heading information.
Solution: examine what your search results look like(site:http://www.yourwebsite.com), see what exactly indexes google and make according to changes ie. reduce the 'title' attributes.

6. Remove any static content from the bottom page of your website especially the outbound links, etc...

7. Check your affiliate links whether they are thematic or not. Remove those that are not connected to your site theme or add rel="nofollow" to them. If your site displays RSS feeds be sure to add rel="nofollow" also.
Update: Try this tool to find whether you are linking to bad neighborhood websites: http://www.bad-neighborhood.com/text-link-tool.htm

8. Lower your content keyword density

http://www.webconfs.com/keyword-density-checker.php

Keyword density is an important factor if you're serious in SEO. Once you have a great kind of content, the crucial part is to be able to present it in front of the right public.
And by having keyword density above, for example, the threshold of 2 - 4% will mark your content as thin and it won't compete/show with other websites in SERPS.

- check out your navigation (posts archive) - too many links to posts with keywords in their titles increase the overall post keyword density so be careful.
- when using forms you may also look over your hidden field values: do not use keywords there - it's an easily misunderstood issue.

- also, don't forget to check your content for being detected as potential spam:

http://tool.motoricerca.info/spam-detector/

9. Limit the usage of 'title' attributes in the <a href> tags as well as <b>/<strong> tags -> they weigh to the content presented in the SERPS.

These might sound like drastic measures but I've already managed to escape 3 websites using the above techniques. So experiment and look at what will happen. Wait and hopefully soon you'll be out of google's supplemental index too.

Load your navigation/advertising section using AJAX request not purely via JavaScript.

Check google's webmaster's tools and fix if there are any potential duplicate issues.

Check all of your sub-domains for supplemental results and fix them as soon as possible.
You know the benefits of organic SEO long-lasting effect versus the link - driven short term success. Here are some simple steps for your website to ensure a long term quality traffic flow from happy visitors.

My websites - full with unique content and constantly expanding, had a problem - plenty of pages gradually went to the supplemental index (ie. 50 out of 500 results were in the main index). After lots of experiments and reading below are my guidelines on how to organically do an on-page optimization or how easy to get more of your content indexed:

Paginated results
Ensure unique meta description wherever you can on your website, even on the paginated results: If you have an article with lots of comments - on the 2nd onward comments page strip the article text leaving just the comments. This way you'll create a brand new unique content page, just like in forums.

Repetition and bold text
Put special attention to em (italic) and bold tags - they add weight to the web page and if repeated through the pages, they could trigger google's penalty filter. Remove repeated word occurrences such as: Reply to this comment, Vote, etc... - replace such text using unobtrusive JavaScript.

Unique heading and meta descriptions
Look especially at the headings like H1, H2, and make sure that they are unique and not repeating.

Loading time
Improve loading time: inspect page loading time with Yahoo's YSLOW and Google's PageSpeed browser addons and try to make most of the suggested improvements. You can also press F12 to open up Developer Tools in Chrome or Firefox and inspect your content from the network tab to detecting slow-loading elements:

let's recap on the main speed improvements how:
- make cache version of your pages
- use asynchronous Google analytics and social sharing buttons such as facebook, twitter, etc.
- place all your JavaScripts at the bottom of your page - this way the page content will load first.
- gzip your CSS and js files
- beware of WordPress wp-cron.php file - it hogs the system CPU resources down and might get you banned from your hosting provider: just rename it or find where is used and disable all calls(includes) to this file, etc...).

Blogger users
If you use Blogger's hosting use this sitemap tool and provide the generated sitemap in Google's Search console. Benefits are that this way you can submit your all posts for indexing.

Canonical urls
Check whether your website is listed in SERPS listings both via http://yourwebsite.com , http://www.yourwebsite.com or http://yourwebsite.com/index.php

If that's the case you'll have to:
1. Manually rewrite all your: internal links to the already indexed/preferred (www or non-www) version.
2. Permanently redirect using .htaccess mod_rewrite your http://yourwebsite.com/index.html or http://yourwebsite.com/index.php page to the root domain of your indexed website URL (i.e http://www.yourwebsite.com/).

Link weight

Use Supplemental Results Detector http://www.seo4fun.com/php/pagerankbot.php to distribute evenly link weight between your pages.

Source Ordered Content
Display your content first to search engines via CSS - especially true for the new Panda update.

Log your results
Write in a text file the date on which you make changes, then check your statistics the next week to determine whether they are beneficial or not.

Friday, January 12, 2007

Duplicate website content and how to fix it

Here I'll present a practical way on how to avoid the duplicate content penalty.

When is this penalty applied?
This kind of penalty is applied by search engines such as Google when there is an indication of two exactly the same versions of your site's content.

How can your website become a victim of such a penalty?
The modern content management systems(CMS) and community forums offer numerous possibilities of managing new content, but because of their deep structure, their URLs are very long. So search engines are unable to fully spider the site.
The solution to webmasters was to rewrite the old URL so index.php?mode=look_article&article_id=12 URL now becomes just article-12.html. As a first step, it serves its purpose, but if left like this the two URLs are going to be indexed. If we look through the eyes of a search engine we'll see same content having 2 instances and of course, the duplicate filter is raised:

I-st instance: index.php?mode=look_article&article_id=12

II-nd instance: article-12.html

Easy solution
The solution is done via the PHP language and using .htaccess Apache file.
First off we'll rewrite our URLs so they can be search-friendly. Let's assume that we've to redirect our index.php?mode=look_article&article_id=... to article-....html

Create an empty .htaccess file and place this. First, edit the code and fill in your website address. If you don't have subdomain then erase the subdomain variable also.

RewriteEngine on

RewriteRule article-([0-9]+)\.html    http://www.yourwebsite/subdomain/index.php?mode=look_article&article_id=$1&rw=on

RewriteCond %{the_request} ^[A-Z]{3,9}\ /subdomain/index\.php\ HTTP/
RewriteRule index\.php http://www.yourwebsite/subdomain/ [R=301,L]

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} ^www\.yourwebsite\.subdomain [nc]
RewriteRule ^(.*)$ http://yourwebsite/subdomain/$1 [R=301,L]

Explanation:

RewriteRule article-([0-9]+)\.html http://www.yourwebsite/subdomain/index.php?mode=look_article&article_id=$1&rw=on
Those lines allow article-12.html to be loaded internaly as index.php?mode=look_article&article_id=12
The variable &rw=on is important for the later PHP code. So don't forget to include it.

RewriteCond %{the_request} ^[A-Z]{3,9}\ /subdomain/index\.php\ HTTP/
RewriteRule index\.php http://www.yourwebsite/subdomain/ [R=301,L]
These lines avoid considering index.php as a separate page thus lowering your website PR and will transfer all the PR from index.php to your domain.

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} ^www\.yourwebsite\.subdomain [nc]
RewriteRule ^(.*)$ http://yourwebsite/subdomain/$1 [R=301,L]
This will avoid duplicate URLs such as www and non-www and transfer all the requests and PR to the non-www site.

Then create file header.php and include in your website before all other files:

Put there:

$rw=$_GET['rw'];
if ($rw=="on") { echo "<meta content=\"index,follow\" name=\"robots\" />"; }

else { echo "<meta content=\"noindex,nofollow\" name=\"robots\" />"; }

This will point the search engine to index only the pages that will have rw flag set to on. These pages will be the previous set like article-12.html pages.

Of course, if you have access to your robots.txt file and to your root domain then you can just put the file: look_article there and you are done:

User-agenta: *

Disallow: /look_article.php

Notes: For those using CMS - check out whether your pages are still accessible using different parameters in the URL
Example: you've deleted an article with id=17 but the empty template would be still accessible producing header status 200 OK code - this will be surely recognized as a thin content from Google.
Solution:
1. Find out those empty pages and give them header status 404 not found code:

header("Status: 404 Not Found");

2. Create error404.html file explaining that the user is trying to access a non-existent page.

3. Then add in your .htaccess file the custom 404 error page:

ErrorDocument 404 /your_domain_name/error404.html

This way the search engine spider won't penalize your template displaying empty information - it will now see those pages like a 404 not-found document.

The next step involves cleaning up an already indexed but duplicated website content in order to regain the search engine's trust.

Above is a sample screenshot is taken from Google Search Console Keywords report. You may ask why should we need it when we can use Firefox integrated Show keyword density function?
Well, one benefit is that this function shows specific keyword significance across your pages. Let me explain what do this means:
Suppose that you are optimizing content for the keyword 'cars'. It's a normal practice to repeat 'cars' 2-3 times, style it in bold, etc... Everything's good as long as you do it naturally. The moment you overstuff your page with this keyword it will get penalized and lose its current ranking in Google SERPS. So you have to be careful with such repetitions.Moreover in the report, you can see the overall website keyword significance. And because Google likes thematic websites it is really important for these keywords to reflect your website purpose or theme. Otherwise, you're just targeting the wrong visitors and shouldn't be puzzled by the high abandonment rate.

But, enough with the theory now let's discuss how you can fix some things up:

Check every individual webpage keyword density online via Webconfs and correct(reduce) words, that are being used over 2%. Again this % depends on your local keyword concurrency. So tolerable levels can vary up and down.

Add the 'canonical' tag to all your website pages:

< link rel="canonical" href="http://www.example.com/your_preferred_webpage_url.html" />

Blogger users can achieve adding canonical with the following code at the head section in the Template:

<b:if cond='data:blog.pageType == "item"'>
<link expr:href='data:blog.url' rel='canonical'/>
</b:if>

<b:if cond='data:blog.pageType == "archive"'>
<meta content='noindex,follow' name='robots'/>
</b:if>
<b:if cond='data:blog.pageType == "index"'>
<b:if cond='data:blog.url != data:blog.homepageUrl'>
<meta content='noindex,follow' name='robots'/>
</b:if>
</b:if>

Monday, January 02, 2006

How to get out of Google Sandbox

Ever wondered why a particular website might get fewer and fewer visits?
One reason for this might be that it is inside google's sandbox, so it gets no traffic from Google queries. In such situations, the following could be experienced:
1. Drop in the total number of website visitors coming from google.com.
2. A sudden drop in google's PageRank of all website pages.
3. When querying google on specific keywords - the website appears in the last 2-3 pages of the search results or is totally banned from google's listing.

How to check if the website is within Sandbox?

If you wish to check whether a sandbox has been applied to a particular website then try the following methods:

I Method
Use this web address to check the indexing of your website pages against a specific keyword:

http://www.searchenginegenie.com/sandbox-checker.htm

II Method
which is much more reliable, just run your web browser and go to http://www.google.com
then type in the search box:

www.yourwebsite.com -asdf -asdf -asdf -fdsa -sadf -fdas -asdf -fdas -fasd -asdf -asdf -asdf -fdsa -asdf -asdf -asdf -asdf -asdf -asdf -asdf

If your website appears in the search results and has good keyword ranking then your website is in google's sandbox.

III Method
Run your web browser, go to http://www.google.com and type:

site:www.yourwebsite.com

If there are no results found and then your website is out of google's indexing database. The difference between non-indexed fresh websites and sandboxed ones is that on the sandboxed you'll not see: If the URL is valid, try visiting that web page by clicking on the following link: www.yourwebsite.com

IV Method
When running google query then add at the end of the URL:

&filter=0

This will show all the results from the primary and supplemental google index of your website. If your website has been penalized then its results will reside in the supplemental index.

How to get your website out of Google's Sandbox

Next follows a guide on how to get one website out of google's sandbox having following techniques applied:

* Have a website structure not deeper than 3rd level (i.e: don't put content to be reachable via more than 3 links away, because the crawler/spider might stop crawling it. )
* rewrite the meta tags to explicitly manifest which pages should not be indexed. For this you should put in the header section of a page:

meta name="robots" content="index, follow" - for webpages that will be indexed
meta name="robots" content="noindex, nofollow" - for webpages that you don't want to be indexed

* delay the crawling machines
This is important especially if your hosting server doesn't provide fast bandwidth :

In your robots.txt file put:

User-agent: *
Crawl-Delay: 20

You can also adjust the Crawl delay time.

* remove the duplicate or invalid pages from your website that are still in google's index/cache: First prepare a list of all the invalid pages. Then use google's webpage about urgent URL removal requests:

https://www.google.com/webmasters/tools/url-removal

Ensure that those pages are no longer indexed, by typing your full website address in google with site:your_website.com. If there are no results this means that you've succeeded to get the pages out of google's index. It may sound strange, but this way you can reindex them again. When ready remove all the restrictions that you might have from .htaccess and webpage headers(noindex,nofollow)
Next, go to http://www.google.com/addurl/?continue=/addurl , put your website in the field for inclusion and wait for the re-indexing process to start.
During the waiting process, you can start getting links from forums and article directories to your quality content, which should point not only to your top-level domain but also to specific webpages.
For example: not only <a href="www.website.com"> but <a href="www.website.com/mywebpage1.html" >

* remove javascript redirects
Check whether are you using in your website meta-refresh javascript redirects. For example:
meta equiv="refresh" content="5; url=http://www.website.com/filename.php"
If so remove them, because they are assumed as spam by Google's Bot.
How: You can check your whole website by using the software as Xenu Link Sleuth

http://home.snafu.de/tilman/xenulink.html

Download and start the program. The whole process is straightforward - just type in the input box your website address

and start the check. (click on File->Check URL. That brings up a form for you to fill in with your website's URL).
This tool will check every page on your website and produce a report. In the report if you see 302 redirects - beware and try to fix them too. Using Xenu you could also check your website for broken links, etc.

* disavow 302 Redirects from other sites
Check if websites linking to you give HTTP response code 200 OK
In google search box type allinurl: http://www.yoursite.com Then check every website other than yours by typing them here

http://www.webrankinfo.com/english/tools/server-header.php

and look for HTTP response code 200 OK.

If there are any that give 302 header response code then try to contact the administrator of the problematic website to fix the problem. If you think that they are stealing your Page Rank - report them to google report spam page

http://www.google.com/contact/spamreport.html

with a checkmark on Deceptive Redirects. As a last resort, you can also place the URL in google's disavow tool to clean up your backlink profile: https://www.google.com/webmasters/tools/disavow-links-main

For the next steps you will need access to your web server .htaccess file and have mod_rewrite module enabled in your Apache configuration:

* Make static out of dynamic pages
Using mod_rewrite you could rewrite your dynamic page URLs to look like static ones. So if you've got a dynamic .php page with parameters you could rewrite the URL to look like a normal .html page:

look_item.php?item_id=14

for the web visitor will become:

item-14.html

HOW: You have to add the following lines to your .htaccess file (placed in the root directory of your web server):

RewriteEngine on
RewriteRule item-([0-9]+)\.html http://your_website.com/sub_directory/look_item.php?item_id=$1 [L,R=301]

and transfer previously accumulated PR and backlinks
type in Google's search box:

site: http://your_website.com

This query will show all website indexed pages. Since you've been moving to search engine preferred (static .html) URLs it would be good to transfer the accumulated dynamic .php URLs PR and links to the corresponding static .html URLs. Here is an example about transferring a .php URL request to a static page URL of the web site.
HOW: Add the following line in your .htaccess file:

RewriteRule look_item\.php http://website.com/item.html [L,R=301]

where 301 means Moved Permanently, so the Search Engine Bot will map and use http://website.com/item.html instead of look_item.php as a legitimate source of information.

* Robots.txt check
In order to avoid Google from spidering both .html and .php pages thus assuming them as duplicate content which is bad, place a canonical tag for the proper web-page version that you prefer and empty your robots.txt file, so that google will consolidate both PHP and HTML pages into one preferred .html version:

Important Note: If you have already .php files in the google index and don't want to use the canonical header tag, you can use meta attributes nofollow, noindex placed in the .php versions of the files, which requires little bit more effort.

* Redirect www to non-www URLs
Just check the indexing of your web site with and without the preceding "www". If you find both versions indexed, you are surely losing PageRank, backlinks and promoting duplicate content to Google. This happens because some sites can link to you with the http://www, and some prefer to use the pure domain version HTTP://. It's hard to control the sites linking to your website whether they link using "www" or "non-www". This time Apache Redirect and Rewrite rules come to help to transfer the www-URLs of your website to non-www URLs. Again to avoid PR loss and duplicates you will want your website URL to be accessible from only 1 location.
HOW: Place at the end of your robots.txt the following lines:

RewriteCond %{HTTP_HOST} ^www\.your_website\.com [nc]
RewriteRule (.*) http://your_website.com/$1 [R=301,L]

When finished with the redirect from www to non-www and https to http versions of your website or vice-versa, specify your preferred version in google webmaster tools.

* Redirect index.php to root website URL
There is one more step for achieving non-duplicated content. You must point your index.html, index.htm, index.asp or index.php to the ./ or root of your website.
HOW: Insert in your robots.txt the following lines before the previous mentioned two lines:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /subdomain/index\.php\ HTTP/
RewriteRule index\.php http://yourwebsite.com/subdomain/ [R=301,L]

Note: If your website is hosted under a subdomain fill its name in the /subdomain part. If not just delete the /subdomain . You can replace index.php with index.html, index.asp or whatever suits you.

* Have custom error 404 page:
in your .htaccess file type:

ErrorDocument 404 /your_website_dir/error404.html

Then create a custom webpage named error404.html to instruct the user what to do when came across a non-existent page. Then check if the 404 page actually returns 404 not found, and not 200 OK header status code.

Congratulations, by following those steps your website will be re-indexed soon. In a few days it will be re-crawled and out of the sandbox and possibly the advises below will help you to achieve better indexing for your website..

Cheers!

Nevyan's how to's