I’ve used the WordPress REST API to run batch content migrations and run into frustrations with authenticating my API requests. By default, read-only operations are publicly accessible but write operations require authentication. The official docs only show cookie-based authentication, I will demonstrate 2 additional methods of authentication with the REST API.
In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:
Unsanitary HTML, extracting content from shell
Adding template tags and breadcrumbs
Manual copy/paste labor
The Solution
Use JavaScript to sanitize the DOM
Write RegExp rules
Scrape the via Node.JS and import directly into WP using the JSON API
The Stack
Node 6+ for cleaner JS
JSDOM for Headless Scraping -npm module
WordPress JSON API -wp plugin
WordPress Client -npm module
Steps
List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.
Export the list to a CSV You should have a source.csv with something like:
url, title, description, template, category, meta1... https://en.wikipedia.org/wiki/Baldwin_Street, Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania ...
Parse the CSV into a JS form {url:..., title:..., description:...}
For each URL, scrape with JSDOM and extract custom fields
Extract the body, sanitize HTML
Insert new post/page via the WP REST APIThe Code
Example
For this example, I will scrape the Bhagavad Gita from Wiki Source and populate each page
env.js — Environment Config
For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this into source control with actual WP credentials.
// These data shouldn't be checked in. module.exports = { 'WP_URL': 'https://domain.com', 'WP_USERNAME': 'test', 'WP_PASSWORD': 'test' }
data/source.csv — List of URLs
I will use a single column CSV, you can pass metadata by adding more columns.
But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:
const { API } = require ('./lib/api'); let api = new API(); api.initialize().then(console.log, console.error);
Run testAPI.js, you should see a JSON with your user details.
lib/Scrape.js — Headless Webpage Scraper
This module wraps around JSDOM for extensibility. You can swap it with other libraries (for ex. Cheerio, X-Ray, Phantom…). The fnProcess argument to the constructor expects a function that takes a window object as input and returns a parsed JSON. We are including jQuery for convenience.
testScrape.js — Checkpoint, it should scrape example.com
const { Scrape } = require ('./lib/scrape'); let page = new Scrape('http://example.org/', function (window) { return {title: window.title, body: window.jQuery('p').text()} }) page.scrape().then(console.log, console.error);
Run testAPI.js, you should see a JSON with your user details.
index.js — The Glue
Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.
The Scrape Function
Scrape the fields we want from WikiSource:
// Scrape function to be executed in DOM const fnScrape = function(window) {
// From // The Bhagavad Gita (Arnold translation)/Chapter 1 // To // Chapter 1
let $ = window.jQuery;
let title = $('#header_section_text').text().replace(/["()]/g, ""); body = $('.poem').text() return { title, body }; }
I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations. This is run in the fake browser context.
The entire file:
const async = require('async'); const { List } = require('./lib/read'), { Scrape } = require('./lib/scrape'), { API } = require('./lib/api');