Setting up your React app for Google’s SEO crawlers
SEO (search engine optimization) is crucial for any site in today’s world, especially if you intend to be discovered organically through searches. I’m no expert at SEO, but luckily there are plenty of resources such as Google’s own SEO Starter Guide.
Using “Fetch as Google”
The first thing I wanted to see was how Google’s crawlers were viewing my site. There’s a tool called “Fetch as Google,” which is accessible through this dashboard. If you haven’t already, you’ll need to verify that you’re the owner of the site. Once you’ve verified, you can go into the property and find “Fetch as Google” on the left.
This tool allows you to “Fetch” or “Fetch and render” various pages which gives insights as to how Google’s crawlers are viewing and indexing your site.
Issues with crawling
I went ahead and ran a “Fetch and render” which allows you to click in and see the response as well as a screenshot. However, I noticed the downloaded response was pretty minimal html with the only content being the following text:
You need to enable JavaScript to run this app.
Which happens to be the default text in the index.html
for a React app until the javascript does its work and renders the DOM. I found this to be strange, because even though Google’s crawlers are just bots, they should load some js and additional content before completing the inspection.
If this was the only content crawlers were picking up, there’d be no SEO value. Now I needed to figure out why my javascript wasn’t being run by the crawler.
The robots.txt file
A commonly mentioned file in articles was the robots.txt
file. This file is used to guide crawlers and tell them what resources they shouldn’t access. Even more about it here. It should be located at http://www.yoursite.com/robots.txt
. I noticed there were a few GET requests to it in my server logs. However, as I have React setup right now, it just returns a page with a 404 message.
Hence, I decided to create a robots.txt
. A basic one that allows for everything looks like:
User-Agent: *
Disallow:
Now, when the crawler looks up robots.txt
it will pick up an actual robots.txt
rather than a ‘page not found’. The dashboard also has a tool in it called “robots.txt Tester,” which allows you to type in files there to see if they’re accessible by the crawler.
I went ahead and tested that the path to my main js bundle was accessible. It was. Therefore it was not robots.txt
preventing the js from being loaded. No luck yet.
Reproducing the problem with the crawler with PhantomJS
I encountered a number of interesting resources such as this article and this article. After trying a couple of the suggested changes with no progress, I concluded deploying blindly was too slow. As with any debugging, one of the first things you try and do is reproduce the problem.
Now I don’t have access to Google’s crawlers locally, but various sites seemed to suggest they maybe using PhantomJS, a headless browser. This might be close enough to what Google’s doing to accurately mimic the problem.’ After downloading PhantomJS, I used it to run the following:
var url = 'http://localhost:3000/';
var page = require("webpage").create();function onPageReady() {
var htmlContent = page.evaluate(function () {
return document.documentElement.outerHTML;
}); console.log(htmlContent);
phantom.exit();
}page.open(url, function (status) {
function checkReadyState() {
setTimeout(function () {
var readyState = page.evaluate(function () {
return document.readyState;
});
if ("complete" === readyState) {
onPageReady();
} else {
checkReadyState();
}
}, 2000);
} checkReadyState();
});
This opens the page (http://localhost:3000
in this case), gives it time to load the javascript, and then console.log's
the output, in other words the content the crawler would be looking at.
Doing this revealed something interesting. It threw the following error
ReferenceError: Can't find variable: Map
but still loaded the rest of the page with the unrendered index.html
file that included the text “You need to enable JavaScript to run this app”. Just like “Fetch as Google.” Now I was getting somewhere. I could reproduce and (hopefully) fix the issue locally!
Polyfills
A few additional searches for the error led me to the topic of polyfills, which had already appeared in a number of posts researching this issue. Browsers have always deviated from each other a bit (IE, Firefox, Chrome etc). Sometimes there’s functionality that is expected natively from the browser, but it’s missing. Polyfills fill in the missing expected functionality. In this case our browser, PhantomJS, did not have Map
, hence the error and crashing javascript which stopped the rest of the content from loading.
Two simple solutions seemed common there. The first was including polyfill through a CDN in your index.html
.
<script src="https://cdn.polyfill.io/v2/polyfill.min.js"></script>
The second option was to use the babel-polyfill
package:
npm install --save babel-polyfill
Then add require(‘babel-polyfill’);
to your config/polyfills.js
file, or wherever necessary.
Doing either of these resolves the error and gives PhantomJS the Map
functionality.
The final obstacle
Note: this is likely very specific to my project
I thought I might be in the clear after resolving the first polyfill issue. However, when I reran the PhantomJS script, I pleasantly encountered a new error:
Error: [mobx] MobX 5+ requires Proxy objects. If your environment doesn't support Proxy objects, please downgrade to MobX 4.
I’m using MobX rather than Redux for this particular project, and MobX version 5 in this case. Unfortunately PhantomJS does not support Proxy objects either… and this time, they can’t be polyfilled like Map
. MobX version 5 is built on top of Proxies, so the current solution was to downgrade to version 4. Which wasn’t too much work for me luckily.
Upon downgrading, there rerunning PhantomJS finally loaded all javascript, css, and site content! Pushing the changes live and retesting with “Fetch as Google” yielded an actual site now too.
There’s still a lot more to do to improve SEO, but having a site that the crawlers can actually load is a start.