Flash and SEO

I know, I know. Google's indexing Flash and Flash developers can rejoice now that their content is SEO-friendly. Sorry - I don't buy it for a second. Flash contentis fundamentally different from HTML on webpage URLs, and being able to parse links in the Flash code and text snippets does not make Flash search-engine friendly. I think it's great that Google's digging deeper into Flash, but I don't believe web developers should be any less wary than they've been in the past about Flash-based websites or Flash-embedded content.

And guess what - I used to be a Flash developer (prior to founding SEOmoz). I still build graphics and wireframes and content (like this demo I quickly made for SEOmoz's PRO content) entirely in Flash. I'm not a hardcore Flash junkie or anactionscript developer anymore, but I can say with confidence that Flash ≠ a smart SEO strategy.

Some reasons why include:

  1. Different Content is NOT on Different URLs
    This is the same problem you encounter with AJAX-based pages. You could have unique frames, movies within movies, etc. that appear to be completely unique portions of the Flash site, yet there's no way to link to these individual elements (unless the Flash developer is specifically building for this scenario - and even then, there's almost always some portions that are missed).

  2. The Breakdown of Text
    Google can index the output files in the SWF to see words and phrases, but in Flash, a lot of your text isn't in nice clean h1 or p1 tags, it's jumbled up into half phrases for graphical effects and will often be outputted in the incorrect order. Worse still are text effects that often require "breaking" words apart into individual letters to animate them - I'm guessing search engines aren't yet smart enough to play Scrabble with Flash output.


  3. Flash Gets Embedded
    A lot of Flash content is only linked-to by other Flash content wrapped inside shell Flash pages. This line of links, where no other internal or external URLs are referencing the interior content, means some very low PageRank/link juice documents. Even if they manage to stay in the main index, they probably won't rank for anything.

  4. Testing Crawlability with Hope
    That's what you're doing with Flash content for SEO - hoping. Google's Flash-crawling technology is proprietary, and while we all know and can test what search engines see from a content and link perspective in HTML, there's no "test my site's Flash file crawlability" feature that I'm aware of, leaving us very much in the dark about exactly how the engine's going to parse your material.

  5. Flash Doesn't Earn External Links Like HTML
    For whatever reason, etiquette on the web simply doesn't lend itself to Flash media earning link love. An all-Flash site might get a large number of links to the homepage, but interior pages almost always suffer. For embeddable Flash content, it's the HTML host page earning those links when they do come. As a simple example, imagine a blog post or news article in HTML - those who enjoy it might copy and paste a few quotes into their own pages and link over, yet this rarely ever happens with Flash text (which can be hard to copy and paste unless the designer builds it properly) and even still isn't common practice among the "linkerati."

  6. SEO Basics Are Often Missing
    Anchor text, headlines, bold/strong text, img alt tags, and even title tags are not simple elements to properly include in Flash, and 9 times out of 10, the designer won't build them in properly. Developing Flash with SEO in mind is not just more difficult than doing it in HTML, it's not part of the cultural lexicon of the Flash-development world.

  7. A Lot of Flash Isn't Even Crawlable
    Google said they don't execute external javascript calls (which many Flash-based sites use) or index the content from external files called by Flash (which, again, a lot of Flash sites rely on). These limitations could severely impact what a visitor can see vs. what Googlebot can index.

Of course, itis nice to see some Flash content ranking at Google (like for the query "break apart flash letters," which illustrates point #2 above quite nicely). Just don't let a Flash developer who just found out about Google's new ability to crawl their work talk you into doing anything rash.

I love Flash - I still work in it and I think that a lot of great sites and applications have used it well. But, trusting Flash content to get good SEO results is like trusting a Seattle summer wedding to be rain-free. It could happen, butno one would call it a wise bet. If you're seeking to make Flash as accessible and SEO-friendly as possible, that's a noble philosophy (and Jon Hochman's How to SEO Flash is quite a good start), but making it your primary content delivery system on the web is a recipe for disaster.


Cracking Google's 1,000 Page Barrier

One of the frustrations of doing SEO for large websites is the fact that Google makes it very difficult to see more than a small part of the search index. Even in Webmaster Tools, Google's index search is built on the same mechanics as its web search, which only lets you see the first 1,000 pages of any result. Whether you're trying to get pages discovered, struggling with duplicate content, confirming robots.txt changes, or doing advanced index sculpting, that 1,000-page barrier can be extremely limiting when you're dealing with a site with 10,000 or more indexed pages.

So, how can we dig deeper into the index and really see the big picture?

The Tools – Site: and Inurl:

First off, you're going to need a couple of tools. I'll assume that most of you are familiar with Google's "site:" command, which returns the indexed pages from any given domain or subdomain. Let's take our friends here at SEOmoz as an example. Type "site:seomoz.org" into Google's search box, and you'll see something like this:


The other command we'll be using is "inurl:", which, paired with other search terms, restricts the results to only those containing a specific keyword in the URL. Paired with the "site:" command, Google only reveals indexed pages which contain those URL keywords.

The Tactic – Index Deconstruction

Using our SEOmoz example, how can we find out which pages are included in the roughly 12,000-page index when we can only see those pages 1,000 at a time? Those last three words are the key: we can only see 1,000 pages at a time, but depending on how we construct our searches, they don't have to be the same 1,000 pages. By splitting up our index searches logically, we can break the full index up into manageable chunks. We'll do this by using "inurl:" to force the "site:" command to show us the index through smaller windows.

An Example – Deconstructing SEOmoz

This is one of those techniques that's much easier to illustrate with an example. Let's say that we needed to dig deeply into SEOmoz's 12,000 indexed pages. The first thing that we might do is to take a look at the main navigation to get an idea of the URL/folder structure of the site. Looking at the top-right navigation on SEOmoz, we see the following (I've added the numbers 1-6 - see below):

Other than "Home," the first link goes to the "/blog" folder. That looks promising, so let's try out our combination "site:" and "inurl:" search:


After clicking the "omitted results" link to see the full list, we get 2,430 pages of the index that contain the word "blog." That's a good start, so let's see what we can do with a few more of the major folders (numbered above):

1. inurl:blog – 2430
2. inurl:ugc - 712
3. inurl:articles - 96
4. inurl:tools - 29
5. inurl:users – 5880
6. inurl:marketplace - 787

Not bad: with just 6 subfolders, we've accounted for 9,934 pages or over 80% of the index. This, of course, assumes minimal overlap, and the accuracy of Google's numbers may be questionable (I'll discuss some issues with "inurl:" at the end of the post), but it's more than adequate to get the job done.

Now, we're left with a couple of groups, such as (5) that are still greater than 1,000 pages. At this point, you'll have to use some logic and your knowledge of the site in question. As a frequent Moz user, I know that the "users" folder contains all of the user profiles. Digging a little, I can easily find that those profiles all contain "users/view." A new search on "inurl:users/view" reveals 5,810 user profiles, making up almost all of the pages in the "users" folder and almost half of the total index.

An Example – Canonical URLs

Most of the time, we aren't going to be trying to deconstruct the entire Google index for a site, but just need to answer a specific question. Let's take my own company site/blog as an example. Recently, I realized that I had left some loose ends in the code that were revealing both canonical and non-canonical URLs. So, for example, the same blog post might have the following two URLs:

1. http://www.usereffect.com/topic/the-last-spam-youll-ever-need
2. http://www.usereffect.com/index.php?id=154

I've recently made some code changes to fix the problem, but how do I find out if my fix is working? I simply look for "id" in the URL with a search command like "site:usereffect.com inurl:id". As of this writing, that search only shows 1 result, suggesting that my changes are having the desired effect.

Advanced Inurl Tips

I hope that I've demonstrated just how powerful two relatively simple search tools can be when effectively combined. Before you go out and put this to work, though, a couple of warnings about "inurl:", which has a tendency to misbehave.

First, "inurl:" seems to ignore punctuation, for the most part. A targeted search on the folder "inurl:/blog" returns the same results as "inurl:blog," which is to say that it returns every page that contains "blog" anywhere in the URL. In some cases, this won't be a problem, but you'll have to judge that on a case-by-case basis. Like standard Google search terms, "inurl:" only searches on whole words (but doesn't seem to allow word stems), and you can only use a single word at a time in any given "inurl:" statement.

You can use multiple "inurl:" statements (one for each word) in your search, which are automatically combined with a logical AND. You can also use "-inurl:" to exclude specific URL keywords from any given search. Finally, you can combine "site:", "inurl:" and stand-alone keywords to target indexed pages by URL and content keywords in one statement.

Adobe Teams Make Flash Searchable With Google, Yahoo


As Web content has grown increasingly dynamic, some of it has also become harder to find through conventional search engines. Adobe Systems is hoping to rectify that situation somewhat with new technology it is providing to search giants like Google and Yahoo.

The software manufacturer announced yesterday that it would begin providing optimized Flash technology to leading search engines that will make rich Internet applications (RIA) and dynamic content created using Flash more easily identifiable.

"Up 'til now, Flash content just hasn't been as thoroughly searchable as we'd like," said Justin Everett-Church, senior product manager for Adobe Flash Player. "The text was just being pulled out in strange ways and left it looking really incomplete, as though you were reading the index of a book rather than the book itself."

The optimized Flash technology will now allow search engines to identify text within Flash programs that would otherwise have escaped them without requiring any change in behavior on the developers' part. The result should be millions of newly searchable RIAs and dynamic experiences, including brand experiences on the Web.

Adobe is also hoping this will convince developers to use Flash in places where they otherwise would not have, as the search problems had long been a sticking point with the program.

Using Flash "has always been a bit of a tradeoff," said Everett-Church. "You get all the great graphics and experiences but you lose some search capability. Hopefully this will remove some of those barriers to entry."

At least one prominent digital agency executive applauded the move, but warned developers against using this as an excuse to go overboard with Flash.

"I think this makes Flash more attractive to use, but you also have to be careful not to fall back into the old times when Flash first came on the market and people went crazy with it," said Andreas Roell, CEO of Geary Interactive.

While a certain amount of Flash can enhance a users' experience, too much can make a site unwieldy, ultimately turning away the consumers you are looking to engage, he said.

"Flash should add to the experience, but never be 100 percent of it," he said. "We're defeating the purpose of what Flash is meant to be if we get too extreme. The old principles still apply."

That said, he applauded Adobe for addressing the biggest drawback to the popular program.

"I'm really actually very pleased that Adobe is tackling this," he said. "Obviously there is a big drive for getting additional market share with developers, but overall I think this checks off one major problem that allows us to focus more on [serving clients]."

Good White Hat Cloaking


A quote from Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index. Google's Guidelines on Cloaking:

Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

There are two critical pieces in that sentence - "may" and "user agent." Now, it's true that if you cloak in the wrong ways, with the wrong intent, Google (and the other search engines) "may" remove you from their index and if you do it egregiously, they certainly will. But, in many cases, it's the right thing to do, both from a user experience perspective and from an engine's.

To start, let me list a number of web properties that currently cloak without penalty or retribution.

* Google - Search for "google toolbar" or "google translate" or "adwords" or any number of Google properties and note how the URL you see in the search results and the one you land on almost never match. What's more, on many of these pages, whether you're logged in or not, you might see some different content to what's in the cache.
* NYTimes.com - The interstitial ads, the requst to login/create an account after 5 clicks and the archive inclusion are all showing different content to engines vs. humans.
* Forbes.com - Even the home page can't be reached without first viewing a full page intersitial ad, and comparing Google's "cached text" of most pages to the components that humans see is vastly different.
* Wine.com - In addition to some redirection based on your path, there's the state overlay forcing you to select a shipping location prior to seeing any prices (or any pages). That's a form the engines don't have to fill out.
* WebmasterWorld.com - Pioneers of the now permissable and tolerated "first click free," Googlebot (and only GGbot from the right set of IP addresses) is allowed access to thousands of clicks without any registration.
* Yelp.com - Geotargeting through cookies based on location; a very, very popular form of local targeting that hundreds, if not thousands of sites use.
* Amazon.com - In addition to the cloaking issues that were brought up on the product pages at SMX Advanced, Amazon does lots of fun things with their buybox.amazon.com subdomain and with the navigation paths & suggested products if your browser accepts cookies.
* iPerceptions.com - The site itself doesn't cloak, but their pop-up overlay is only seen by cookied humans, and appears on hundreds of sites (not to mention it's a project of one of Google's staffers).
* InformationWeek.com - If you surf as Googlebot, you'll get a much more streamlined, less ad-intensive, interstitial free browsing experience.
* ComputerWorld.com - Interstitials, pop-ups and even some strange javascript await the non-bot surfers.
* ATT.com - Everyone who hits the URL gets a unique landing page with different links and content
* Salon.com - No need for an ad sponsored "site pass" if you're Googlebot :)
* CareerBuilder.com - The URLs you and I see are entirely different than the ones the bots get.
* CNet.com - You can't even reach the homepage as a human without seeing the latest digital camera ad overlay.
* Scribd.com - The documents we see look pretty different (in format and accessibility) than the HTML text that's there for the search engines.
* Trulia.com - As was just documented this past week, they're doing some interesting re-directs on partner pages and their own site.
* Nike.com - The 1.5 million URLs you see in Google's index don't actually exist if you've got Flash enabled.
* Wall Street Journal - Simply switching your user-agent to Googlebot gets you past all those pesky "pay to access" breaks after the first paragraph of the article.

This list could go on for hundreds more results, but the message should be clear. Cloaking isn't always evil, it won't always get you banned, and you can do some pretty smart things for it, so long as you're either:

A) A big brand that Google's not going to get angry with for more than a day or two if you step over the line OR
B) Doing the cloaking in a completely white hat way with a positive intent for users and engines

Here's a visual interpretation of my personal cloaking scale:

Search Engine Cloaking Scale




Let's run through some examples of each:

Pearly White - On SEOmoz, we have PRO content like our Q+A pages, link directory, PRO Guides, etc. These are available only to PRO members, so we show a snippet to search engines and non-PRO members, and the full version to folks who are logged into a PRO account. Technically, it's showing search engines and some users different things, but it's based on the cookie and it's done in exactly the type of way engines would want. Conceptually, we could participate in Google News's first-click free program and get all of that content into the engine, but haven't done so to date.

Near White - Craigslist.org does some automatic geo-targeting to help determine where a visitor is coming from and what city's page they'd want to see. Google reps have said publicly that they're OK with this so long as Craigslist treats search engine bots the same way. But, of course, they don't. Bots get redirected to a page that I can only see in Google's cache (or if I switch my user agent). It makes sense, though - the engines shouldn't be dropped onto a geo-targeted page; they should be treated like a user coming from everywhere (or nowhere, depending on your philosophical interpretation of Zen and the art of IP geo-location). Despite going against a guideline, it's so extremely close to white hat, particularly from an intention and functionality point-of-view, that there's almost no risk of problems.

Light Gray - I don't particularly want to "out" anyone who's doing this now, so let me instead offer an example of when and where light gray would happen (if you're really diligent, you can see a couple of the sites above engaging in this type of behavior). Imagine you've got a site with lots of paginated articles on it. The articles are long - thousands of words, and even from a user experience point-of-view, the breakup of the pages is valuable. But, each page is getting linked-to separately, there's a "view on one page" URL, a "print version" URL and an "email a friend" URL that are all getting indexed. Often, when an article's interesting, folks will pick it up on services like Reddit and link to the print-only version, or to an interior page of the article in the paginated version. The engines are dealing with duplciate content out the wazoo, so the site detects for engines and 301s all the different versions of the article back to the original, view on one page source, but drops visitors who click that SERP to the article homepage in the paginated version.

Once again, the site is technically violating guidelines (and a little more so than in the near-white example), but it's still well-intentioned, and it really, really helps engines like MSN & Ask.com, who don't do a terrific job with duplicate content detection and canonicalization (and, to be fair, even Yahoo! and Google get stuck on this quite a bit). So - good intentions + positive user experience that meets expectations + use of a proclaimed shady tactic = light gray. Most of your big brand sites can get away with this ad infinitum.

Dark Gray - Again, I'll give a hypothetical rather than call someone out. There are many folks who participate in affiliate programs, and the vast majority of these send their links through a redirect in Javascript, both to capture the click for their tracking purposers, and to stop link juice from passing. Some savvier site owners have realized how valuable that affiliate link juice can be and have set up their own affiliate systems that do pass link juice, often by collecting links to unique pages, then 301'ing those for bots, passing the benefit of the links on to pages on their domain where they need external links to rank. The more crafty ones even sell or divide a share of this link juice to their partners or the highest bidder. This doesn't necessarily affect visitors who come seeking what the affiliate's linked to, but it can create some artificial ranking boosts, as the engines don't want to count affiliate links in the first place, and certainly don't want them helping pages they never intended to recieve their traffic.

Solid Black - Since I found some pure spam that does this; I thought I'd share. I recently performed a search at Google for inurl:sitemap.xml, hoping to get an estimate of how many sites use sitemaps. In the 9th position, I found the odd URL - www.acta-endo.ro/new/viagra/sitemap.xml.html, which redirects humans to a page on pharmaceuticals. Anytime a search result misleadingly takes you to content it not only doesn't show the engine, but isn't relevant to your search query, I consider it solid black.

Now for a bit of honesty - we've recommended pearly white, near white and yes, even light gray to our clients in the past and we'll continue to do so in the future when and where it makes sense. Search engine reps may decry it publicly, but the engines all permit some forms of cloaking (usually at least up to light gray) and even encourage it from brands/sites where it provides a better, more accessible experience.

The lesson here is - don't be scared off a tactic just because you hear it might be black hat or gray hat. Do your own research, form your own opinions, test on non-client sites and do what makes the most sense for your business and your client. The only thing we have to fear is fear itself (and overzealous banning, but that's pretty rare). :-)

p.s. The takeaway from this post should not be "cloak your site." I'm merely suggesting that inflexible, pure black-and-white positions on cloaking deserve potential re-thinking.