How to audit content and structured data for the mobile first index

November 30, 2017 — 5 Min Read

With the mobile index coming up on us, I've recently had to do a fair bit of mobile auditing. Some of it can be quite a lot of work, so I'm sharing some of the processes I use to speed it up.

We will specifically be looking at auditing content and structured data.

What is usually in a mobile audit?

Broadly speaking I end up looking at:

What type of website is it and are the mobile directives setup correctly?
- Dynamic serving: Vary header, Separate M. domain: switchboard tags etc.
Directives
- Canonical, hreflang, noindex etc.
Mobile experience.
- Is it easy to use? Does the website function on mobile? etc. etc.
Content differences
- Is there the same content, structured data, videos etc.

We're focusing on looking at differences in content and structured data

The first two are easily done with any crawler, the third is a huge topic that I'm not going to cover here.

The last is what I want to talk about today. When Google have talked about it publically, they've mentioned having content parity between mobile desktop a number of times. (Here's the tweet I've seen go around the most on it.)

I couldn't find a current way to do this at the scale I needed, so I whipped something up myself. You might find it helpful.

Here's a methodology to look at content differences between mobile and desktop sites.

What will be the output of this process?

A list of all the main structured data entities which are different, between mobile and desktop URLs.
A list of all the main blocks of text which are different between mobile and desktop URLs.
Some useful numbers alongside them to help prioritising the output.

Here's a link to an example output sheet that this script would generate. (Note that due to hacky workaround each page will have one row where the value is "nothing", this can be safely ignored.)

How will we do this audit?

With a python notebook. Hoorah! (Don't worry if you haven't used python before, I have a guide written here on how to get setup with python.)

I don't yet have a guide on using these (I will update when I do.)

How does this Python notebook work?

First take a representative sample of your pages. I.e. one or two from each major template type on your site.

We don't want too many here, as the output becomes overwhelming quickly and web design decisions happen at a template level anyway.

Crawl those pages and extract the whole content of the page, along with the alternate tags if they exist.

Both Screaming Frog and Deep Crawl can do this, any crawler with extraction will do. (You could quite happily pull down the pages with python if that was your thing).

The CSS selectors for extracting those two elements will be:

Body: body
Alternate tags: html > link[rel="alternate"]

(Note: you may have multiple alternate tags that will be picked up here (for example if you have app deep linking. That's fine, the script will choose one that includes m.)

The rest of the instructions are in the notebook at this point so open it up and get going. Or read on and find out how it works!

How does the workbook check for differences? (and what are the caveats.)

Text comparison

The text comparison works on an element by element basis.

This means it takes every single HTML element on the page which contains text, compares it to every other and returns this difference. This is normally pretty thorough, however that does mean the following case will catch it out:

If you have:

This is the first sentence. This is the second sentence.

That won't match:

This is the first sentence. This is the second sentence.

This isn't a common scenario, I haven't seen it yet, (probably due to how content is typically inserted into a CMS), but there's no reason it couldn't happen.

Structured data comparison

This works by comparing the structure of the top level elements in Google's structured data testing tool.

Comparing top level elements, means when you open something in the SDTT, it will compare everything at the first level without clicking in.

It also looks for structural differences, rather than value differences. For example, the following two pieces of structured data would be considered identical:

{
  "@context": "http://schema.org",
  "@type": "Product",
  "name": "Kenmore White 17\" Microwave",
  "image": "640x1080-kenmore-microwave-17in.jpg"
}

{
  "@context": "http://schema.org",
  "@type": "Product",
  "name": "Kenmore White 17\" Microwave",
  "image": "320x640-kenmore-microwave-17in.jpg"
}

Why does it only compare structure?

Why? As I've attempted to illustrate above, mobile pages may often have different values from desktop pages for good reasons.

For example a mobile image link on a m. page and desktop image link on a www. page.

It is possible to turn this off in the script (there is a note on the relevant point), but I wouldn't recommend it.

Notes on comparing top level elements

This is fine for most websites. For example here is a common output you'll see on the Google Structured Data Testing Tool.

If however you have a website with perfect hierarchal structured data, where everything has a parent and is incredibly thorough (maybe you hired Jarno Van Driel), then what you'll see will look like this, in that case, it's only going to tell you if the two top level segments are identical.

Thankfully most websites don't have that, so you probably won't run into this one either.

Extra stats for comparison

The sheet throws also throws out two extra numbers:

Length in characters of the text/sd that is missing.
The total number of missing SD or text elements for each page.

Sorting by the first lets you find the biggest differences. Sorting by the second lets you see which page templates are the most different.

That's all folks, hopefully that's helpful for some of you and questions and bugs in the comments!

By Dominic Woodman. This bio is mostly here because it looked good in mock-ups.