Learn how to generate income from home using Bing Click Here For More Info More and more people are cashing on Bing very fast so Cash in on Bing
Search Engines vs. SEO Spam: Statistical Methods
November 16, 2009 by IBI · Leave a Comment
High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site’s business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.
In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called “black-hat” SEO.
‘Black Hat’ SEO and Search Engine Spam
The oldest and simplest “black SEO” strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However “black-hat’ SEO went one step further creating the so-called “doorway’ pages – tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.
Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of “black-hat”‘ SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.
“Black-hat” SEO is responsible for the immense amount of search engine spam-pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.
Using Statistics to Detect Search Engine Spam
An example of an application of statistical methods to detect web spam is presented in the paper “Spam, Damn Spam and Statistics” by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.
Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects – the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).
The research concentrates on studying the following properties of web pages:
– URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
– Host name resolutions.
– Linkage properties.
– Content properties.
– Content evolution properties.
– Clustering properties.
URL Properties
Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.
The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits-and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.
Host Name Resolutions
One can notice that Google, given a query q, tends to rank a page higher if the host component of the page’s URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.
This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs-to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.
To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.
Linkage Properties
The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.
In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.
Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.
Content Properties
Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.
For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).
Content Evolution
The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.
The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.
Clustering Properties
Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.
To form clusters of similar pages the ’shingling’ algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.
The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)
To Sum Up
The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.
References:
1. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages” (2004). Microsoft Research.
2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. “Syntactic Clustering of the Web”. In 6th International World Wide Web Conference, April 1997.
How to Differentiate Ethical SEO Services From Fake SEO Services
October 29, 2009 by IBI · Leave a Comment
As a small or medium business owner, you may want to advertise and promote your website as economically as possible. In recent years, Search Engine Optimization (SEO) has emerged as one of the surest and cheapest way for long-term business promotion.
A typical business owner is not expected to be knowledgeable about SEO and its benefits. Even if the owner knows SEO, he/she may not be sure how to choose an ethical SEO who can get the job done at the best rates and within timeline. Choosing a SEO consultant or company for your website can be a tiring and overwhelming experience. Many SEO specialists will boast about adopting the latest path-breaking SEO techniques, use testimonials from clients and guarantees of #1 placements on Google. But how can you, as a non expert, select a genuine SEO consultant? Listed below, are a few of the main pointers used to detect fake SEO consultants or fake SEO companies.
#1 Google ranking for Website:
If any company guarantees you that they can place your website in the number one position on Google, then take my advice – RUN in the opposite direction. It’s obvious this company is desperate to get business and is willing misleading you to get your hard earned cash. No company can make such claims. Google’s algorithms are one of the best kept secrets. What SEO companies can do is optimize and promote the website according to widely known principles, intelligent guess work, best practices and lots of hard work. At most a company can claim that they can acquire good rankings such as top 10 results. But even then, verify their past track record to measure how they fared with other clients.
Very Low Prices:
Many companies charge ridiculously low amount of money for a lot of services. As an astute business person, you can intelligently deduce that any work that requires hours and hours of hard work, research and analysis by a team of experts cannot come cheap. Most of the times, specialists from various fields such as programmers, content writers, SEO analyst, link builders etc. pool their resources to make a site successful on Google. Therefore quality services cannot come as cheap as many companies charge.
Secret and Proprietary Technique That Cannot be Revealed:
I don’t think we even need to elaborate on this. As the owner of the website, you should be aware of what’s happening to your site and what steps are been taken to optimize and promote it.
Recommending Black Hat Techniques:
Lot of companies recommends shady tactics known as black hat practices to get your site up in search engines. Not only do these practices have very short term benefits, but they can also be very counter productive since the search engines can eventually catch website following these practices. Many websites big and small have been heavily penalized for using such techniques. So play safe and avoid these SEO companies like the plague.
Using Outdated Techniques:
Once upon a time, techniques like reciprocal link exchange, keyword stuffing etc. were used to get good rankings. Google and other search engines have wizened up and rank a site for its worthiness and not just back links and keyword stuffing. Choose a company that will recommends steps to make your site useful & relevant and not take short cuts.
Keep these tips in mind while choosing an SEO company for your website and hopefully they will help you select an SEO company that knows their job and can handle your website promotion very well. Good luck!
Visit at www.Nicheforseo.com
Step-By-Step, Do-It-Yourself, SEO Guide
October 27, 2009 by IBI · Leave a Comment
SEO or Search Engine Optimization is one internet marketing strategy that any internet entrepreneur would want for their website. Well maybe except for those that really established a strong name over the web, but as for those that started out in ecommerce would really need to learn about SEO. Let’s get to the point, SEO is used as a way to raise a website’s rank in any search engine like Yahoo! and Google (any of the top 10 slots will do, but It’ll be greater if it would be in top 1). By doing so, the possibility of visitors that visits a website would increase, thus increasing sales and profit margin. So long-story-short, SEO is all about increasing the visibility of a website in the world wide web by use of search engines. Now the question is, how do you do SEO?
SEO companies are now available for hire around the world, such as SEO Philippines companies. Some companies have outsourced their SEO needs to these establishments. But it can also be done by anyone that knows basic SEO procedures, even at home. So here is a step-by-step, do-it-yourself, guide on SEO and how to increase visibility in the world wide web.
Step 1: Get Your Keywords
The first step that any SEO Philippines company would do is to research on keywords that perfectly describes the website. Free online keyword research tools like Google Adwords Keyword tool or SEO Book’s Keyword tool are popularly used by many SEO specialists. Highly competitive keywords is the main ingredient in any successful SEO dish. But highly competitive doesn’t only describe single words such as “SEO”, it also emphasizes on the keyword structure, if it really describes the website like “SEO service in the Philippines”.
Note: In a keyword research, always make sure that the keywords are relevant to the website to be optimized. Irrelevant keywords pointing to a website could earn a one way trip to “Ban“svill courtesy of search engines.
Step 2: Know Your Enemies
Well not exactly “enemies” per-se, but competitors. After doing a keyword research, It’s time to look up on your competitors by use of your chosen keywords. Try looking at what SEO tactics or strategies they’ve done. By viewing their source code (ctrl+U for Firefox), you can see the keyword density, meta tags they used, alt tags, and title tags (I’ll explain these later on). To learn the number backlinks (I’ll explain these later on) of a website, try using SEOpen’s free backlink tool, this way you’ll know what you have to beat to gain their spot. Using your competitor’s strength against them is one successful way to succeed in an SEO campaign.
Step 3: Start On-Page SEO
So here we’ll start the first pace of SEO, the on-page optimization. Actually, SEO is divided into 2 parts, the on-page and off-page (which I’ll discuss later on). On-page optimization involves making changes within the website to make it more “search engine friendly”. The strategies and tactics used here are as follows:
- Content Optimization – This strategy involves changing the content of the website, particularly with it’s text elements. One way is to embed the researched keywords on texts found throughout the website, focusing mainly on the headers (those with the big bold letters). It is said that keywords used should be take about 7% of the total number of words in a content. Let’s say you have 400 words in a single page, so 7% of that 400 (which is 28) should be keywords. Note: Make sure that the content is still readable after placing those keywords. Remember that the users are still your prime target, not the search engine spiders.
- Site Map – Site maps are an important part of websites optimized for search engine. This way, search engine spiders (or robots) could easily crawl through the whole website by use of site map. Making it easier for spiders to crawl through the website is one way to make a website “search engine friendly”.
- Meta Tags - Meta tags are comprised of the Title, Description, and Keywords of the website. Other than the content of websites, meta tags are also the perfect place for keywords. The only difference is that these things are like a “Free Buffet” sign for search engine spiders, it’s the one responsible for attracting search engine spiders to crawl your website. But remember that it still depends on the structure of the meta tags.
- Alt Tags - Alt tags are those found in HTML image codes. Let’s say your keyword is “SEO Philippines“, your image tag should now look like this: <img src=”images.png” alt=“SEO Philippines”> Just like the meta tags, alt tags are also an appropriate place to put several important keywords. Note: Make sure that only one (1) keyword should be placed on every alt tags found all over the website to avoid keyword spamming (which is part of black hat SEO)
- Title Tags – Like meta tags and alt tags, title tags are also an appropriate place to put some keywords in. By adding a keyword, let’s take for example “SEO Service”, in the title tag, the HTML hyperlink tag should look like this: <a onClick=”javascript:pageTracker._trackPageview(’/outgoing/article_exit_link’);” href=”default.aspx” title=“SEO Service Online”> Note: Alt tags are not seen by visitors, only the search engine spiders so it wouldn’t hurt to just leave it with the keyword. But title tags are seen by visitors, so make sure that the title tag is still readable after embedding the a keyword or two.
Step 4: Start Off-Page SEO
Unlike the on-page, off-page optimization involves SEO strategies outside of the website. This part of SEO is divided into two categories, backlinks and public relation strategies. Backlinks are links all over the world wide web that points back to your website. It has long been considered that the many backlinks you have, the more popular you’ll get. Another theory added is the weight or importance of the website where the link is coming from. Tools used to determine the popularity of a website is through Google’s PageRank and Alexa traffic tool. Here are some of the strategies used by SEO specialists to get backlinks.
- Link Exchange - One of the earliest forms of backlinking involves exchanging links with other relevant websites.
- Article Marketing - This type of off-page SEO can be considered both for backlinks and public relations. An article commonly generates an average of 3 links (depending on the kind of article directory it was submitted). It’s also considered as public relations because these articles commonly promotes the business they’re trying to market.
- Blog and Forum Commenting – Forums and blogs are also a popular trend to generate backlinks by using their signatures as links linking back to your website.
- Blogging - Other than commenting on other blogs, creating your own blog and embedding links is another popular way for generating backlinks.
PR or public relation strategies of off-page SEO involves generating traffic not only through search engines but also with other types of media. This type of off-page SEO mainly works by marketing their service through word of mouth. These strategies involves the use of:
- Social Media - Social media such as Friendster, FaceBook, Twitter, and MySpace are the perfect place to start a buzz that could benefit the website through the marketing strategy they call “word of mouth”. Adding a link could also help in getting more backlinks.
- Social Bookmarking – By using the articles in bookmarking sites such as Technorati or Digg, people could easily learn about your services through your article. This is also another way for the articles to gain more importance in the web, thus increasing the chance for your websites to be visited through your articles.
Other forms of off-page SEO is submitting your website to search engine directories. Make sure to submit your URL for free to Google, Yahoo!, Alexa, www.altavista.com, www.alltheweb.com, www.excite.com, www.lycos.com, www.webcrawler.com, www.Jayde.com, www.whatuseek.com, and more. DMOZ and Yahoo Directory listings are tremendously valuable. There are a lot more free and paid search engines and directories to submit, but if you do not have time to search for and submit, you can also use our search engine submission service.Visit http://www.myoptimind.com for more information.
Margarette Mcbride is a copywriter of Optimind Web Design and SEO, a web design and seo company in the Philippines. Optimind specializes in building and promoting websites that are designed for conversion..

