18 Apr

Tutorial: Importing Content to WordPress

Migrating drones, taken from here: https://www.slashgear.com/researcher-demonstrate-a-flock-of-autonomous-flying-drones-28318932/

In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:

  1. Unsanitary HTML, extracting content from shell
  2. Adding template tags and breadcrumbs
  3. Manual copy/paste labor

The Solution

  1. Use JavaScript to sanitize the DOM
  2. Write RegExp rules
  3. Scrape the via Node.JS and import directly into WP using the JSON API

The Stack

  • Node 6+ for cleaner JS
  • JSDOM for Headless Scraping -npm module
  • WordPress JSON API -wp plugin
  • WordPress Client -npm module

Steps

  1. List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.

Export the list to a CSV You should have a source.csv with something like:

url, title, description, template, category, meta1...
https://en.wikipedia.org/wiki/Baldwin_Street, Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania
...

2. Get WP Ready

  1. On your WP installation, install and activate this plugin [WP REST API](https://wordpress.org/plugins/rest-api/)
  2. Upload and unzip the Basic Auth plugin, it is not in the Plugin repo at the time of this writing. https://github.com/eventespresso/Basic-Auth
  3. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
  4. Test if the API works, navigate to {baseurl}/wp-json . You should see a JSON response with your site’s info
  5. Add the following to .htaccess to enable Basic Auth:
RewriteRule ^index\.php$ — [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]

Verify with cURL:

curl -u username:password -i -H 'Accept:application/json' {baseurl}/wp-json/wp/v2/users/me

It should display your user information. If this doesn’t work, check Apache global config for restrictions on overrides. And if that doesn’t work, there are other methods of authentication here: https://developer.wordpress.org/rest-api/using-the-rest-api/authentication/

The App Logic

  1. Connect with WP and Authenticate/Authorize
  2. Parse the CSV into a JS form {url:..., title:..., description:...}
  3. For each URL, scrape with JSDOM and extract custom fields
  4. Extract the body, sanitize HTML
  5. Insert new post/page via the WP REST APIThe Code

Example

For this example, I will scrape the Bhagavad Gita from Wiki Source and populate each page

env.js — Environment Config

For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this into source control with actual WP credentials.

// These data shouldn't be checked in. module.exports = { 'WP_URL': 'https://domain.com', 'WP_USERNAME': 'test', 'WP_PASSWORD': 'test' }

data/source.csv — List of URLs

I will use a single column CSV, you can pass metadata by adding more columns.

url https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_1 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_2 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_3 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_4 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_5 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_6 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_7 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_8

lib/Read.js — Reads CSV Files

For most use cases, this crude CSV parser would suffice:

const fs = require('fs'); try { let raw = fs.readFileSync('./data/source.csv', 'utf8') let parsed = raw.split("\n") // Rows .map(r => r.split(",") // Fields .map(f => f.trim('\r\n') )) // Trailing chars } catch (e){ console.error(e); }

But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:

const csv = require('fast-csv'),
fs = require('fs');
class List {
constructor(filePath, limit = 500) {
this.filePath = filePath || null;
this.limit = limit;
this.data = [];
this.stream = null;
}
read() {
return new Promise((resolve, reject) => {
if (!(this.filePath && fs.existsSync(this.filePath))) {
return reject('File does not exist');
}
// TODO: impement scalable streaming.
this.stream = fs.createReadStream(this.filePath);
this.stream.pipe(csv()).on("data", (raw) => {
if (this.data.length > this.limit) {
console.log("Read", "Limit exceeded");
return this.stream.destroy();
}
this.data.push(raw);
}).on("end", () => {
resolve(this.data)
});
})
}
}
module.exports = {
List
};

testRead.js — Checkpoint: Verify the CSV is read

const { List } = require ('./lib/read');
let file = new List('./data/source.csv');
file.read().then(console.log, console.error);

Run testRead.js, you should see a 2D array of your CSV.

lib/API.js — WP API Wrapper

This file wraps around the wpapi npm module to provide support authentication and provide only the functions we need: new post and new page

/* 
* Wrapper around WP-API
*/
const env = require('../env');
const WPAPI = require('wpapi');

class API {
constructor () {
this.wp = null;
this.user = null;
}

addPost(title, content, category, meta, type='posts', status='draft') {
return new Promise((resolve, reject) => {
this.wp.posts().create({
title,
content,
status
}).then(function( response ) {
resolve(response.id);
}, reject);
});
}

addPage(title, content, category, meta, type='posts', status='draft') {
return new Promise((resolve, reject) => {
this.wp.pages().create({
title,
content,
status
}).then(function( response ) {
resolve(response.id);
}, reject);
});
}

initialize() {
return new Promise((resolve, reject) => {
if (!this.wp)
{
let config = {
endpoint: `${env.WP_URL}/wp-json`,
username: env.WP_USERNAME,
password: env.WP_PASSWORD,
auth: true
}

this.wp = new WPAPI(config)

// Verify that it authenticated
this.wp.users().me().then((user) => {
this.user = user;
console.log('API', 'authenticated as', user.name);
resolve(user);
}, (error) => reject(error))
}
else
{
reject ("API already initialized");
}
});
}
}

module.exports = { API };

testAPI.js — Checkpoint: Verify the WP connects

const { API } = require ('./lib/api'); let api = new API(); api.initialize().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

lib/Scrape.js — Headless Webpage Scraper

This module wraps around JSDOM for extensibility. You can swap it with other libraries (for ex. Cheerio, X-Ray, Phantom…). The fnProcess argument to the constructor expects a function that takes a window object as input and returns a parsed JSON. We are including jQuery for convenience.

const jsdom = require('jsdom');
class Scrape {
constructor(url, fnProcess = null, libs = []) {
this.url = url || null;
this.libs = [...["http://code.jquery.com/jquery.js"], libs];
this.fnProcess = (typeof fnProcess === 'function') ? fnProcess : function(window) {
return window.document.body.innerHTML;
}
this.output = null;
}
scrape() {
return new Promise((resolve, reject) => {
jsdom.env(this.url, ["http://code.jquery.com/jquery.js"], (err, window) => {
if (err) {
return reject(err);
}
this.output = this.fnProcess(window);
resolve(this.output);
});
});
}
}
module.exports = {
Scrape
}

testScrape.js — Checkpoint, it should scrape example.com

const { Scrape } = require ('./lib/scrape'); let page = new Scrape('http://example.org/', function (window) { return {title: window.title, body: window.jQuery('p').text()} }) page.scrape().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

index.js — The Glue

Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.

The Scrape Function

Scrape the fields we want from WikiSource:

// Scrape function to be executed in DOM
const fnScrape = function(window) {

// From
// The Bhagavad Gita (Arnold translation)/Chapter 1
// To
// Chapter 1

let $ = window.jQuery;

let title = $('#header_section_text').text().replace(/["()]/g, "");
body = $('.poem').text()
return {
title,
body
};
}

I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations. This is run in the fake browser context.

The entire file:

const async = require('async');
const { List } = require('./lib/read'),
{ Scrape } = require('./lib/scrape'),
{ API } = require('./lib/api');

const csvFilePath = './data/source.csv',
LIMIT_PARALLEL = 5;

// Step 1 - Init WP
let api = new API();

// Step 2 - Read CSV
const readTheFile = function() {
let file = new List(csvFilePath);
console.log('Reading file...');
return file.read();
};

// Step 3 - Process multiple URLs
const processPages = function(data) {
data.shift(); // CSV header
console.log('Processing', data.length, 'pages');
async.forEachLimit(data, LIMIT_PARALLEL, processSingle, (err)=>{
if (err)
{
return console.error(err);
}
console.log("Done!");
});
};

// Step 4 - Get a JSON version of a URL
const scrapePage = function(url) {
return new Promise((resolve, reject) => {
if (url.indexOf('http') !== 0) {
reject('Invalid URL');
}
let page = new Scrape(url, fnScrape);
page.scrape().then((data) => {
console.log(">> >> Scraped data", data.body.length);
resolve(data);
}, (err) => reject);
});
};

// Scrape function to be executed in DOM
const fnScrape = function(window) {

// From
// The Bhagavad Gita (Arnold translation)/Chapter 1
// To
// Chapter 1

let $ = window.jQuery;
let title = $('#header_section_text').text().replace(/["()]/g, ""),
body = $('.poem').text()
return {
title,
body
};
}

// Step 3 - Get a JSON version of a URL
const processSingle = function(data, cb) {
let [url] = data;
console.log(">> Processing ", url);
scrapePage(url).then((data) => {
// Step 5 - Add page to WordPress
api.addPage(data.title, data.body).then((wpId) => {
console.log(">> Processed ", wpId);
cb();
}, cb)
}, cb);
}

// Kick start the process
api.initialize()
.then(readTheFile, console.error)
.then(processPages, console.error);
console.log('WP Auth...');

Output

...
>> Processed 140
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_12
>> >> Scraped data 12634
>> Processed 141
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_13
>> Processed 142
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_14
>> Processed 143
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_15
>> >> Scraped data 3005
>> >> Scraped data 3706
>> >> Scraped data 5297
>> >> Scraped data 4039
>> Processed 144
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_16
>> >> Scraped data 3835
>> Processed 145
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_17
>> >> Scraped data 3781
>> Processed 146
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_18
>> >> Scraped data 11816
>> Processed 147
>> Processed 148
>> Processed 149
>> Processed 150
>> Processed 151
Done!

Check your WP for the new content.

GitHub: Scrape to WP

Leave a Reply

Your email address will not be published. Required fields are marked *