#
How to build an API that enrich and aggregates rss feeds.This tutorial is a practical, non trivial, real world example of how you can use Joyce platform.
We'll go step by step through the creation a complete flow of data integration and serving of a final API.
#
GoalWe want to realize an api that aggregates several rss feeds and enrich them extracting topics and categorization of the link that arrives. Our source will be IT news sites:
- http://feeds.arstechnica.com/arstechnica/index/
- https://www.engadget.com/rss.xml
- https://hnrss.org/newest?points=100
#
SetupYou'll need to have docker installed, we'll be using docker-compose to startup a minimal installation of Joyce.
Let's begin:
#
Modeling our sourcesWe start by modeling our data with a Joyce schema, and configuring how to pull data from the rss feed with a Kafka Connect RSS source connector configuration inside the schema.
As you can read from the documentation, the connector produce json with this form:
It's trivial to write a schema that reshape this content, save this to news.yaml
:
And than send it to the import-gateway:
You'll see this output:
#
Kafka ConnectorsA lot already happened, not only the schema is saved, but giving the configuration of the connector, a kafka connect task with the specific configuration is created and started.
This means that if we didn't do anything wrong in the schema transofrmation, content is already pumped through joyce and correctly transformed.
Head over akhq and look at the content of joyce_content
topic, you'll see messages from the rss feed already processed.
Import Gateway expose different endpoints to control and check the status of the connectors within a schema.
Now pause the connector, because we want to enrich the schema with more complex transformation:
#
Transformation HandlersThe result of the transoformation is good, but we want to enrich what arrives from the rss feeds, we miss few things to have a nice api:
- a categorization of the content.
- an image for the article if we can get to it.
- a summary more relevant than what could arrive from the feed.
How can we do it? We'll be using the power of joyce transformation handlers, in particular $script and $rest.
Let's see how.
#
Categorization of the contentWe need to extracts topics from the article text, there are tons of way to do it with NLP libraries and custom code, but we go'll the short way and use a service that does it with an exposed API.
Head over https://www.textrazor.com/signup and signup for a free account, you'll obtain an api token that is everything we need to use their service.
Have a look at their rest api documentation and try this call to extract topics from a random link:
We are ready to enrich our imported model with topics by using this call inside the schema using $rest
handler, add this field to your Schema:
What are we doing here??
We're adding a field topics
that is an array of string, we populate it with a $rest
handler that calls the TextRazor api and extract the topics label, we used some filtering in the json path expression to take more relevant extracted topics (see json-path docs).
Save the yaml, and update the schema:
Before restarting the connector we're going to test the transformation with a dry run using a json you can retrieve from the joyce_import
topic.
We are happy with the output, topics are extracted nicely.
#
Enrich with open graph dataWe want to enrich more the results with some more info, a summary and an image for the article, how can we do it?
Every news site, usually, includes Open Graph tags, we could grab and use them to obtain what we want.
Joyce ships with a $script
handler that gives the ability to use scripting language to make transformations, currently you can write scripts in python
, javascript
, groovy
.
We'll go with python.
We need to obtain the http from the url, parse the html and get the metadata tags we need.
With some python/regex kung-fu we can write a small script to do that, add this property to the Schema:
Save the schema and try again a dryrun you should have as a result something like this:
Now we can resume operation of the kafka connector:
#
Expose the rest apiNow we should tell joyce-rest
about the schema to expose it.
Edit docker-compose.yaml
and add this to environment variables of joyce-rest
service:
and this volume:
then save this json to schemas.json
Finally restart joyce-rest
You can call your new shiny api:
Yay, News!
#
Add other sourcesTime now to add another rss source, add this element to $metadata.extra.connectors
array:
Save the schema and send it to the import gateway, soon enough, by calling again the rest api you should see articles from engadget too.
Your schema now, should be something like this:
#
ConclusionYou know how to add the third source right?
s
How powerful it is to have input source, modeling data and transformation in a single, declarative file?
It is the only thing you should version, no code, just configuration, and you have enriched news news from multiple rss sources.