23 Jul

Tutorial: IP Whitelisting for Docker Containers

You can use iptables to restrict network access to an individual container without altering the Host’s rules or introducing external firewalls.

Why?

Some potential use cases:

  1. I have an app that’s not ready for production but needs to be tested on a production server.
  2. I have multi-tenancy in my SaaS and the customers want to restrict usage of a container just to their IP range.
  3. I host services (databases, proxies, cache…) that rely on IP-based whitelist authorization

How?

Install IP tables in a docker container and use the docker exec command to alter the rules.

Here’s an example of IP based restriction for an Nginx container

Requirements

  1. A VPS ( DigitalOcean or Vultr work— Free $10 Credit) with Docker
  2. 2 different public IP addresses to test from. You can toggle a VPN or SSH into another machine you have.

Getting Started

I have a staging container on a shared VPS that should be accessed only from a range of IPs (say an office VPN )

We will call the machines and IPs as follows

A ) 222.100.100.100 — The VPS

B) 222.200.200.200 — Trusted Source IP

Step 1. Prepare your VPS

Install Docker and confirm that you can access your firewall is open. Some providers like EC2 require that you manually override Security Groups to allow external incoming traffic.

Launch a basic Nginx container that listens on all IP addresses (0.0.0.0) on port 8080

Using iptables within a container requires additional network capabilities forreasons stated here. So, we also add NET_ADMIN and NET_RAW capabilities.

docker pull nginx
docker run --cap-add=NET_ADMIN --cap-add=NET_RAW --name app1 -d -p 0.0.0.0:8080:80 nginx

Test that it works

curl http://222.100.100.100:8080

should print the Nginx welcome screen on all machines.

Step 2. Install IP Tables

Many Docker images come with iptables pre-installed. Just in case, install it via exec:

docker exec app1 apt update
docker exec app1 apt install iptables -y

Verify it works:

docker exec app1 iptables -S

It should output:

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

Step 3. Block All Traffic

First, we block all traffic to the port that is bound in the container. Here, it is port 80, not 8080.

docker exec app1 iptables -A INPUT -p tcp --dport 80 -j DROP

Verify that it’s blocked

From an external machine, curl http://222.100.100.100:8080 should not work.

Step 4. Whitelist IPs

Then, we white list the trusted source IPs:

docker exec app1 iptables -I INPUT -p tcp --dport 80 --source 222.200.200.200 -j ACCEPT

Notice the -I flag which prepends to the rule file ensuring the white lists have higher precedence over the rule we added in Step 3.

curl http://222.100.100.100:8080

This will work only from the white listed IP. If you can access it from another source, it means something was misconfigured.

Step 5. Remove Whitelist IPs

To remove a whitelist, you can retrieve a list of all your rules:

docker exec app1 iptables -S

and then copy/paste the rule to the -D command, which drops it

docker exec app1 iptables -D INPUT -p tcp — dport 80 — source 222.200.200.200 -j ACCEPT

Step 6. Testing

You can use https://www.geoscreenshot.com to test HTTP access from many IPs.

23 Jul

Tutorial: Authenticating with WordPress REST API

I’ve used the WordPress REST API to run batch content migrations and run into frustrations with authenticating my API requests. By default, read-only operations are publicly accessible but write operations require authentication. The official docs only show cookie-based authentication, I will demonstrate 2 additional methods of authentication with the REST API.

Pre-requisites

  • WordPress installed with Apache
  • Access to WP CLI and .htaccess

This tutorial assumes that WordPress is running on http://localhost/

Method 1: Basic Auth Plugin

  1. Verify that REST API is running, this should return a 200 response.

curl -o -localhost:8080/wp/v2/posts

2. Install the latest version

wp plugin install https://github.com/WP-API/Basic-Auth/archive/master.zip --activate

3. Modify htaccess

4. Test

Method 2: JWT Auth Plugin

  1. Install and activate
wp plugin install jwt-authentication-for-wp-rest-api --activate

2. Modify htaccess

3. Test

Integrating with node-wpapi

  1. Add a helper function
  2. Store JWT in header
18 Apr

Tutorial: Importing Content to WordPress

Migrating drones, taken from here: https://www.slashgear.com/researcher-demonstrate-a-flock-of-autonomous-flying-drones-28318932/

In situations where I have had to migrate content from a static platform (static content in code, older CMS…) to WordPress. Some of the annoying issues were:

  1. Unsanitary HTML, extracting content from shell
  2. Adding template tags and breadcrumbs
  3. Manual copy/paste labor

The Solution

  1. Use JavaScript to sanitize the DOM
  2. Write RegExp rules
  3. Scrape the via Node.JS and import directly into WP using the JSON API

The Stack

  • Node 6+ for cleaner JS
  • JSDOM for Headless Scraping -npm module
  • WordPress JSON API -wp plugin
  • WordPress Client -npm module

Steps

  1. List out the URLs of the existing pages you want to migrate. If there is significant variance in the DOM structure between these pages, group them by template so you can process them easier. If you would like to add additional data, a page title override, custom fields…, include them.

Export the list to a CSV You should have a source.csv with something like:

url, title, description, template, category, meta1...
https://en.wikipedia.org/wiki/Baldwin_Street, Baldwin Street\, Dunedin, A short suburban road in Dunedin, New Zealand, reputedly the world's steepest street., Asia and Oceania
...

2. Get WP Ready

  1. On your WP installation, install and activate this plugin [WP REST API](https://wordpress.org/plugins/rest-api/)
  2. Upload and unzip the Basic Auth plugin, it is not in the Plugin repo at the time of this writing. https://github.com/eventespresso/Basic-Auth
  3. Since we use Basic Auth, create a temporary user/pass that can be discarded after import.
  4. Test if the API works, navigate to {baseurl}/wp-json . You should see a JSON response with your site’s info
  5. Add the following to .htaccess to enable Basic Auth:
RewriteRule ^index\.php$ — [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]

Verify with cURL:

curl -u username:password -i -H 'Accept:application/json' {baseurl}/wp-json/wp/v2/users/me

It should display your user information. If this doesn’t work, check Apache global config for restrictions on overrides. And if that doesn’t work, there are other methods of authentication here: https://developer.wordpress.org/rest-api/using-the-rest-api/authentication/

The App Logic

  1. Connect with WP and Authenticate/Authorize
  2. Parse the CSV into a JS form {url:..., title:..., description:...}
  3. For each URL, scrape with JSDOM and extract custom fields
  4. Extract the body, sanitize HTML
  5. Insert new post/page via the WP REST APIThe Code

Example

For this example, I will scrape the Bhagavad Gita from Wiki Source and populate each page

env.js — Environment Config

For the sake of simplicity, I am not using OAuth but Basic HTTP Authentication. This method demands that credentials are transmitted in plain text over the wire. Use a temporary account and ensure HTTPS is enabled. DO NOT check this into source control with actual WP credentials.

// These data shouldn't be checked in. module.exports = { 'WP_URL': 'https://domain.com', 'WP_USERNAME': 'test', 'WP_PASSWORD': 'test' }

data/source.csv — List of URLs

I will use a single column CSV, you can pass metadata by adding more columns.

url https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_1 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_2 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_3 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_4 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_5 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_6 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_7 https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_8

lib/Read.js — Reads CSV Files

For most use cases, this crude CSV parser would suffice:

const fs = require('fs'); try { let raw = fs.readFileSync('./data/source.csv', 'utf8') let parsed = raw.split("\n") // Rows .map(r => r.split(",") // Fields .map(f => f.trim('\r\n') )) // Trailing chars } catch (e){ console.error(e); }

But synchronous processing does not scale well (say 1,000,000 rows) , so we’ll use Streams which are most robust. The fast CSV module has built-in support for Node streams. The following code is a starter for a scalable solution:

const csv = require('fast-csv'),
fs = require('fs');
class List {
constructor(filePath, limit = 500) {
this.filePath = filePath || null;
this.limit = limit;
this.data = [];
this.stream = null;
}
read() {
return new Promise((resolve, reject) => {
if (!(this.filePath && fs.existsSync(this.filePath))) {
return reject('File does not exist');
}
// TODO: impement scalable streaming.
this.stream = fs.createReadStream(this.filePath);
this.stream.pipe(csv()).on("data", (raw) => {
if (this.data.length > this.limit) {
console.log("Read", "Limit exceeded");
return this.stream.destroy();
}
this.data.push(raw);
}).on("end", () => {
resolve(this.data)
});
})
}
}
module.exports = {
List
};

testRead.js — Checkpoint: Verify the CSV is read

const { List } = require ('./lib/read');
let file = new List('./data/source.csv');
file.read().then(console.log, console.error);

Run testRead.js, you should see a 2D array of your CSV.

lib/API.js — WP API Wrapper

This file wraps around the wpapi npm module to provide support authentication and provide only the functions we need: new post and new page

/* 
* Wrapper around WP-API
*/
const env = require('../env');
const WPAPI = require('wpapi');

class API {
constructor () {
this.wp = null;
this.user = null;
}

addPost(title, content, category, meta, type='posts', status='draft') {
return new Promise((resolve, reject) => {
this.wp.posts().create({
title,
content,
status
}).then(function( response ) {
resolve(response.id);
}, reject);
});
}

addPage(title, content, category, meta, type='posts', status='draft') {
return new Promise((resolve, reject) => {
this.wp.pages().create({
title,
content,
status
}).then(function( response ) {
resolve(response.id);
}, reject);
});
}

initialize() {
return new Promise((resolve, reject) => {
if (!this.wp)
{
let config = {
endpoint: `${env.WP_URL}/wp-json`,
username: env.WP_USERNAME,
password: env.WP_PASSWORD,
auth: true
}

this.wp = new WPAPI(config)

// Verify that it authenticated
this.wp.users().me().then((user) => {
this.user = user;
console.log('API', 'authenticated as', user.name);
resolve(user);
}, (error) => reject(error))
}
else
{
reject ("API already initialized");
}
});
}
}

module.exports = { API };

testAPI.js — Checkpoint: Verify the WP connects

const { API } = require ('./lib/api'); let api = new API(); api.initialize().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

lib/Scrape.js — Headless Webpage Scraper

This module wraps around JSDOM for extensibility. You can swap it with other libraries (for ex. Cheerio, X-Ray, Phantom…). The fnProcess argument to the constructor expects a function that takes a window object as input and returns a parsed JSON. We are including jQuery for convenience.

const jsdom = require('jsdom');
class Scrape {
constructor(url, fnProcess = null, libs = []) {
this.url = url || null;
this.libs = [...["http://code.jquery.com/jquery.js"], libs];
this.fnProcess = (typeof fnProcess === 'function') ? fnProcess : function(window) {
return window.document.body.innerHTML;
}
this.output = null;
}
scrape() {
return new Promise((resolve, reject) => {
jsdom.env(this.url, ["http://code.jquery.com/jquery.js"], (err, window) => {
if (err) {
return reject(err);
}
this.output = this.fnProcess(window);
resolve(this.output);
});
});
}
}
module.exports = {
Scrape
}

testScrape.js — Checkpoint, it should scrape example.com

const { Scrape } = require ('./lib/scrape'); let page = new Scrape('http://example.org/', function (window) { return {title: window.title, body: window.jQuery('p').text()} }) page.scrape().then(console.log, console.error);

Run testAPI.js, you should see a JSON with your user details.

index.js — The Glue

Now that we’ve tested these components individually, it is time to glue them up. Async is a popular library for managing control flow in Node applications. This is the code version of the logic mentioned above.

The Scrape Function

Scrape the fields we want from WikiSource:

// Scrape function to be executed in DOM
const fnScrape = function(window) {

// From
// The Bhagavad Gita (Arnold translation)/Chapter 1
// To
// Chapter 1

let $ = window.jQuery;

let title = $('#header_section_text').text().replace(/["()]/g, "");
body = $('.poem').text()
return {
title,
body
};
}

I tested / fine-tuned this in Chrome DevTools. You should run this test against your source URLs to make sure you account for page variations. This is run in the fake browser context.

The entire file:

const async = require('async');
const { List } = require('./lib/read'),
{ Scrape } = require('./lib/scrape'),
{ API } = require('./lib/api');

const csvFilePath = './data/source.csv',
LIMIT_PARALLEL = 5;

// Step 1 - Init WP
let api = new API();

// Step 2 - Read CSV
const readTheFile = function() {
let file = new List(csvFilePath);
console.log('Reading file...');
return file.read();
};

// Step 3 - Process multiple URLs
const processPages = function(data) {
data.shift(); // CSV header
console.log('Processing', data.length, 'pages');
async.forEachLimit(data, LIMIT_PARALLEL, processSingle, (err)=>{
if (err)
{
return console.error(err);
}
console.log("Done!");
});
};

// Step 4 - Get a JSON version of a URL
const scrapePage = function(url) {
return new Promise((resolve, reject) => {
if (url.indexOf('http') !== 0) {
reject('Invalid URL');
}
let page = new Scrape(url, fnScrape);
page.scrape().then((data) => {
console.log(">> >> Scraped data", data.body.length);
resolve(data);
}, (err) => reject);
});
};

// Scrape function to be executed in DOM
const fnScrape = function(window) {

// From
// The Bhagavad Gita (Arnold translation)/Chapter 1
// To
// Chapter 1

let $ = window.jQuery;
let title = $('#header_section_text').text().replace(/["()]/g, ""),
body = $('.poem').text()
return {
title,
body
};
}

// Step 3 - Get a JSON version of a URL
const processSingle = function(data, cb) {
let [url] = data;
console.log(">> Processing ", url);
scrapePage(url).then((data) => {
// Step 5 - Add page to WordPress
api.addPage(data.title, data.body).then((wpId) => {
console.log(">> Processed ", wpId);
cb();
}, cb)
}, cb);
}

// Kick start the process
api.initialize()
.then(readTheFile, console.error)
.then(processPages, console.error);
console.log('WP Auth...');

Output

...
>> Processed 140
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_12
>> >> Scraped data 12634
>> Processed 141
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_13
>> Processed 142
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_14
>> Processed 143
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_15
>> >> Scraped data 3005
>> >> Scraped data 3706
>> >> Scraped data 5297
>> >> Scraped data 4039
>> Processed 144
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_16
>> >> Scraped data 3835
>> Processed 145
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_17
>> >> Scraped data 3781
>> Processed 146
>> Processing https://en.wikisource.org/wiki/The_Bhagavad_Gita_(Arnold_translation)/Chapter_18
>> >> Scraped data 11816
>> Processed 147
>> Processed 148
>> Processed 149
>> Processed 150
>> Processed 151
Done!

Check your WP for the new content.

GitHub: Scrape to WP

16 Apr

Tutorial: Scraping Infinite List Pagination

An example of Infinite Scrolling, taken from here: https://dev-blog.apollodata.com/pagination-and-infinite-scrolling-in-apollo-client-59ff064aac61

TL;DR

Scraping pages that use the infinite scroll pattern can be challenging. This guide shows one approach to tackling the problem.

Intro

Web scraping is a popular (sometimes controversial) option for fetching structured data from web sites that don’t offer a public API.

In the case of traditional web applications, server-side rendered HTML can be fetched using HTTP clients (for ex. cURL, Wget or HTTP libraries) and de-constructed using a DOM parser. Pagination is generally handled by following links or incrementing GET parameters, and the logic can be followed at scale. Due to the low CPU consumption and lightweight payload (initial render HTML) of such scrapers, these apps can be scraped at a high performance and low cost.

Modern web apps that fetch data dynamically in the client side environment typically make paginated AJAX requests to a public API endpoint. In such scenarios, emulating the HTTP calls (for ex. DevTools) can make the task very trivial. In most cases, this is the preferred approach.

However, some web apps require authenticated sessions, use alternative protocols (WebSockets) or nonced API calls that can be challenging to replicate. In these cases, you can run an actual browser (Selenium, PhantomJS, Chrome Headless), and scrape the DOM in the Console to get the desired results. It is possible to automate complex user flows with good reliability (web standards support, low risk of detection) using User Behavior Automation.


An Example

For this example, I will use Quora’s search result page for What is the meaning of life?It should yield enough results for our purposes. The end result will be a JSON array of the following data for each entry.

  • Title
  • Excerpt
  • Link

Note: This is strictly for educational purposes only, please respect Quora’s TOS regarding scrapers (https://www.quora.com/about/tos)

This is what the page looks like:

Entries automatically appended as you scroll down

It looks like it makes fragmented AJAX requests. For the purpose of this article and general laziness, I will assume the requests are nonced and can’t be reproduced on the server-side easily.

New AJAX calls added as you scroll

The Strategy

  1. Navigate to a search result page with an actual browser
  2. Identify the selectors for the desired recurring DOM elements
  3. Loop through visible elements
  4. Scrape the data into an ECMAScript Set
  5. Empty the screen contents and scroll the full viewport
  6. Repeat 3–5 until there are no more elements
  7. Serialize the results as a JSON

Parts

  1. Identify Selectors and Extract Entries
  2. Emulate scroll behavior and lazy load list
  3. Loop until all entries are fetched and return JSON
  4. Complete Script
  5. Headless Automation

Helpers

Prep the way for custom logger functions. This is useful when automating via a headless browser because we can override these with custom logger functions in the Script context (NodeJS, Python, Lua, etc…) instead of the browser console.

const _log = console.info,
_warn = console.warn,
_error = console.error,
_time = console.time,
_timeEnd = console.timeEnd;const page = 1;// Global Set to store all entries
let threads = new Set(); // Prevents dupes// Pause between pagination, fine-tune according to load times
const PAUSE = 4000;

Part 1. Identify Selectors and Extract Entities

At the time of writing, I was able to infer these selectors from this URL: https://www.quora.com/search?q=meaning%20of%20life%3F&type=answer . Since most feeds / lazy loaded lists follow a similar DOM structure, you may be able to re-use this script by simply modifying the selectors.

// Class for Individual Thread
const C_THREAD = '.pagedlist_item:not(.pagedlist_hidden)';// Class for threads marked for deletion on subsequent loop
const C_THREAD_TO_REMOVE = '.pagedlist_item:not(.pagedlist_hidden) .TO_REMOVE';// Class for Title
const C_THREAD_TITLE = '.title';// Class for Description
const C_THREAD_DESCRIPTION = '.search_result_snippet .search_result_snippet .rendered_qtext ';// Class for ID
const C_THREAD_ID = '.question_link';// DOM attribute for link
const A_THREAD_URL = 'href';// DOM attribute for ID
const A_THREAD_ID = 'id';

Scrape a single entry

// Accepts a parent DOM element and extracts the title and URL
function scrapeSingleThread(elThread) {
try {
const elTitle = elThread.querySelector(C_THREAD_TITLE),
elLink = elThread.querySelector(C_THREAD_ID),
elDescription = elThread.querySelector(C_THREAD_DESCRIPTION);
if (elTitle) {
const title = elTitle.innerText.trim(),
description = elDescription.innerText.trim(),
id = elLink.getAttribute(A_THREAD_ID),
url = elLink.getAttribute(A_THREAD_URL);

threads.add({
title,
description,
url,
id
});
}
} catch (e) {
_error("Error capturing individual thread", e);
}
}

Scrape all visible threads. Loops through each thread and parses the details. It returns the thread count.

// Get all threads in the visible context
function scrapeThreads() {
_log("Scraping page %d", page);
const visibleThreads = document.querySelectorAll(C_THREAD);if (visibleThreads.length > 0) {
_log("Scraping page %d... found %d threads", page, visibleThreads.length);
Array.from(visibleThreads).forEach(scrapeSingleThread);
} else {
_warn("Scraping page %d... found no threads", page);
}// Return master list of threads;
return visibleThreads.length;
}

Execute the above two scripts in your browser console to get:

If you execute scrapeThreads()in the console at this stage, you should get a number and the global Set should populate.

Part 2. Emulate scroll behavior and lazy load list

We can use JS to scroll to the bottom of the screen. This function is executed after every successful `scrapeThreads`

// Scrolls to the bottom of the viewport
function loadMore() {
_log("Load more... page %d", page);
window.scrollTo(0, document.body.scrollHeight);
}

Clear the DOM of entries that have already been processed:

// Clears the list between pagination to preserve memory
// Otherwise, browser starts to lag after about 1000 threads
function clearList() {
_log("Clearing list page %d", page);
const toRemove = `${C_THREAD_TO_REMOVE}_${(page-1)}`,
toMark = `${C_THREAD_TO_REMOVE}_${(page)}`;
try {
// Remove threads previously marked for removal
document.querySelectorAll(toRemove)
.forEach(e => e.parentNode.removeChild(e));// Mark visible threads for removal on next iteration
document.querySelectorAll(C_THREAD)
.forEach(e => e.className = toMark.replace(/\./g, ''));} catch (e) {
_error("Unable to remove elements", e.message)
}
}

clearList() is called before every loadMore(). This helps us control the DOM memory usage (in the case of 1000s of pages) and also eliminates the need to keep a cursor.

Part 3. Loop until all entries are fetched and return JSON

The flow of the script is tied here. loop() calls itself until the visible threads are exhausted.

// Recursive loop that ends when there are no more threads
function loop() {
_log("Looping... %d entries added", threads.size);
if (scrapeThreads()) {
try {
clearList();
loadMore();
page++;
setTimeout(loop, PAUSE)
} catch (e) {
reject(e);
}
} else {
_timeEnd("Scrape");
resolve(Array.from(threads));
}
}

Part 4. Complete Script

You can run and tweak this script in your browser console. This should return a promise that resolves a JS array with entry objects.

(function() {
return new Promise((resolve, reject) => {// Class for Individual Thread
const C_THREAD = '.pagedlist_item:not(.pagedlist_hidden)';
// Class for threads marked for deletion on subsequent loop
const C_THREAD_TO_REMOVE = '.pagedlist_item:not(.pagedlist_hidden) .TO_REMOVE';
// Class for Title
const C_THREAD_TITLE = '.title';
// Class for Description
const C_THREAD_DESCRIPTION = '.search_result_snippet .search_result_snippet .rendered_qtext ';
// Class for ID
const C_THREAD_ID = '.question_link';
// DOM attribute for link
const A_THREAD_URL = 'href';
// DOM attribute for ID
const A_THREAD_ID = 'id';const _log = console.info,
_warn = console.warn,
_error = console.error,
_time = console.time,
_timeEnd = console.timeEnd;_time("Scrape");let page = 1;// Global Set to store all entries
let threads = new Set(); // Eliminates dupes// Pause between pagination
const PAUSE = 4000;// Accepts a parent DOM element and extracts the title and URL
function scrapeSingleThread(elThread) {
try {
const elTitle = elThread.querySelector(C_THREAD_TITLE),
elLink = elThread.querySelector(C_THREAD_ID),
elDescription = elThread.querySelector(C_THREAD_DESCRIPTION);
if (elTitle) {
const title = elTitle.innerText.trim(),
description = elDescription.innerText.trim(),
id = elLink.getAttribute(A_THREAD_ID),
url = elLink.getAttribute(A_THREAD_URL);threads.add({
title,
description,
url,
id
});
}
} catch (e) {
_error("Error capturing individual thread", e);
}
}// Get all threads in the visible context
function scrapeThreads() {
_log("Scraping page %d", page);
const visibleThreads = document.querySelectorAll(C_THREAD);if (visibleThreads.length > 0) {
_log("Scraping page %d... found %d threads", page, visibleThreads.length);
Array.from(visibleThreads).forEach(scrapeSingleThread);
} else {
_warn("Scraping page %d... found no threads", page);
}// Return master list of threads;
return visibleThreads.length;
}// Clears the list between pagination to preserve memory
// Otherwise, browser starts to lag after about 1000 threads
function clearList() {
_log("Clearing list page %d", page);
const toRemove = `${C_THREAD_TO_REMOVE}_${(page-1)}`,
toMark = `${C_THREAD_TO_REMOVE}_${(page)}`;
try {
// Remove threads previously marked for removal
document.querySelectorAll(toRemove)
.forEach(e => e.parentNode.removeChild(e));// // Mark visible threads for removal on next iteration
document.querySelectorAll(C_THREAD)
.forEach(e => e.className = toMark.replace(/\./g, ''));} catch (e) {
_error("Unable to remove elements", e.message)
}
}// Scrolls to the bottom of the viewport
function loadMore() {
_log("Load more... page %d", page);
window.scrollTo(0, document.body.scrollHeight);
}// Recursive loop that ends when there are no more threads
function loop() {
_log("Looping... %d entries added", threads.size);
if (scrapeThreads()) {
try {
clearList();
loadMore();
page++;
setTimeout(loop, PAUSE)
} catch (e) {
reject(e);
}
} else {
_timeEnd("Scrape");
resolve(Array.from(threads));
}
}
loop();
});
})().then(console.log)

Part 5. Headless Automation

Since the script runs in the browser context, it should work with any modern browser automation framework that allows custom JS execution. For this example, I will use Chrome Puppeteer using Node.JS 8.

Save the script as a node module as script.js in the CommonJS format:

module.exports = function() {
//...script
}

Install puppeteer npm install puppeteer and:

const puppeteer = require('puppeteer')
const script = require('./script');
const { writeFileSync } = require("fs");function save(raw) {
writeFileSync('results.json', JSON.stringify(raw));
}const URL = 'https://www.quora.com/search?q=meaning%20of%20life&type=answer';(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('console', msg => console.log(msg.text()));
await page.goto(URL);
const threads = await page.evaluate(script);
save(threads);
await browser.close();
})();

The script should produce an output similar to this:

[  
{
"title":"Does life have a purpose or not? If not, does that give us the chance to make up any purpose we choose?",
"description":"Dad to son \"Son, do you know that I have been thinking about the meaning of life since I was a little kid of your age.\" His son keeps on licking his ice cream. … \"And you kno...",
"url":"/Does-life-have-a-purpose-or-not-If-not-does-that-give-us-the-chance-to-make-up-any-purpose-we-choose",
"id":"__w2_JaoJDz0_link"
},
{
"title":"What is the meaning of life?",
"description":"We don't know. We can't know. But... … Every religion and every philosophy builds itself around attempting to answer this question. And they do it on faith because life d...",
"url":"/What-is-the-meaning-of-life-66",
"id":"__w2_Qov8B7u_link"
},...
]

Complete Code: