Backend Web-Scraping with Kubernetes, Puppeteer, Node.js, GraphQL, RabbitMQ and MongoDB — getting news headlines hourly

Published in

The Startup

5 min readApr 28, 2020

Given the fact that the nowadays global uncertainty is considered the murkiest in modern history, it’s easy to guess that it largely effects the mood of the masses, with the mass media deeply shaping how we experience the world and ourselves. Curious about the frequency and dynamic of rapid-changing news headlines, I have created a basic system that will consistently and automatically web-scrape the major news sites in Israel (Ynet, Walla and Israel Hayom), and store the headlines as objects in non-relational db, that may have a future use of behavioral/statistical analysis, machine-learning models, etc.

TL;DR — exploring the architecture and components:

Kubernetes: Container-orchestration platform that manages our cluster.

RabbaitMQ: Message broker, AMQP based. Will help our components to communicate with one another.

Puppeteer: Node.js service with the awesome capability to communicate with Chrome/Chromium via devtools protocol. Puppeteer will be our web-scraper.

GraphQL: Query language using client/server principals.

MongoDB: Non-relational DB that will store our news headlines as objects.

Before we begin, let’s use the ‘kubectl get pods’ command (kubernetes cli-api) to verify that everything is up and well:

kubectl get pods --all-namespaces 
NAMESPACE     NAME                                       STATUS    
newscraping   graphql-server                             Running            
newscraping   headlines-scraper                          Running           
newscraping   headlines-scraping-activator               Running           
newscraping   jenkins-prod-54b9c4f5c4-pz6x5              Running           
newscraping   mongodb-headlines-5db784bfb4-5gdx8         Running           
newscraping   rabbitmq-headlines-scraper-rabbitmq-ha-0   Running           
newscraping   rabbitmq-headlines-scraper-rabbitmq-ha-1   Running     
newscraping   rabbitmq-headlines-scraper-rabbitmq-ha-2   Running

To launch the scheduled and automated web-scraping process, let’s have a look at our scraping-scheduler. It’s a node.js service that runs on docker container and managed as pod in kubernetes. Communicating with the broker, communicator may be a producer or consumer. This time, scraping-scheduler will act as producer. As appears in the sixth code line below, we use the ‘await’ mechanism to perform the async tasks of sending ‘start-scraping’ message to the RabbitMQ broker:

Since we want our news headlines to be extracted hourly, the ‘* 1 * * * *’ pattern is defined and passed to the the scheduleJob function using the node-schedule library. Using this library in that way guarantees that the AMQP message ‘start-scraping’ will be sent hourly as long as the node.js service is up and we haven’t terminated the job intentionally in the code.

Firing the message to our rabbit broker, we can visually track it on the management dashboard, looking at the scraping-queue chart:

Now, our puppeteer service, which listens to the scraping-queue, acting as consumer, immediately receives the message:

Using typescript and page-object-model along with puppeteer, we can verify that each page object has the relevant functions for extracting the data. We will do that with an interface. The interface will define basic page methods as well as ‘extractNewsHeadline’ method. Overall, this will largely improve our code efficiency and accuracy:

export interface NewsPageInterface {
    getNewsHeadline(): void;
    extractNewsHeadline(): void;
    getURL(): void;
}

Getting the data stored and parsed in our puppeteer service, will use a graphql mutation to update the headline schema with the newly extracted headlines, as appear in the SDL(schema definition language) declaration:

const addHeadline = gql`
  mutation($newHeadline: HeadlineInput) {
    addHeadline(newHeadline: $newHeadline) {
      _id
    }
  }
`;

Using puppeteer along with appolo-client library, we are performing the mutation above to add each extracted headline to the MongoDB database through the GraphQL server. The necessary practice in GraphQL is to work according to the schema, so we will define our input for the ‘addHeadline’ mutation according to the input type ‘HeadlineInput’, as defined in our GraphQL server code:

input HeadlineInput {
    siteId: ID!
    title: String!
    content: String!
    date: Date!  
  }

The headlines from our source sites(Ynet, Walla, Israel Hayom) will be stored as three different objects in our schema, each having ObjectID field, date field, headline title, headline content and site id field that refers to the related site object. Performing our mutation for the first source, we are getting a response from GraphQL server, notifying that the schema was updated successfully, returning the id of headline object:

(4619) -> Mutation result: {
  "data": {
    "addHeadline": {
      "_id": "5ea4da2f466fcb8052b34540",
      "__typename": "Headline"
    }
 }

The process kept going until the three headlines were extracted and stored. Now we’re good! we have received a confirmation about our first half of the workflow, scraping and storing headlines, being successful. In the next stage, our puppeteer scraping container will transform from consumer to producer, sending a scraping-confirmation message through the RabbitMQ broker intended for the scraping-callback-queue:

Now, scraping-scheduler, transforming from producer to consumer, receive the message:

The scheduled job was terminated and should wait approximately an hour for sending the next ‘start-scraping’ message to the broker.

GraphQL playground is an amazing real-time, cozy web interface. Among its capabilities it allows us to perform queries and mutations. To have a great finale, let’s use it to verify our mutation by getting the latest headline object of walla.co.il.

First, we send an input of both getHeadlineById query and the ObjectID that we received in the result from puppeteer:

Now, executing the query, we are expecting the web-scraped data to be received as result, and ta da! It really shows up.

To explore web-scraping with puppeteer a bit more, you can use the Page-Object-Model sample I developed on github: https://github.com/AutomatedOwl/puppeteer-pom-example

For further discussion about scalable web-scraping over micro-services environment, feel free to inquire me.

Backend Web-Scraping with Kubernetes, Puppeteer, Node.js, GraphQL, RabbitMQ and MongoDB — getting news headlines hourly

Written by Danny Simantov