What even is n8n?
Some sort of low-code / no-code automation pipeline thing. A friend of mine uses it at work and I saw a NetworkChuck video about it too.
I’m mildly interested, and I’ve been waiting for the opportunity to use it.
It does market itself as a tool to perform tasks that shouldn’t exist, like user management (literally their first example). Have a proper identity provider for those things. But I get the potential.
It did absolutely explode because of the AI hype train. So that’s probably its strong point or just a marketing scheme to get more people to use a tool that basically just manages flow of data.
The “quick little project”
…
Yes, it’s just a web scraping project. It has to be something simple that I have already done using another technology, so I can truly weigh its pros and cons.
Besides, n8n markets itself as workflow automation tool. What is that if not: “collect data”, “parse data”, “perform action”.
I’m going to collect HTML pages, parse and re-collect if necessary, and skip the “perform action” step entirely because I don’t actually have anything do to with this data.
note
Even though it’s open-source there’s telemetry, which is creepy as hell.
Making HTTP requests
This is as simple as it gets: Create a “HTTP Request Node”, put the URL in there.
Parsing a DOM
Very cool, now I want to filter out elements and stuff.
One way of doing this is using Regex. For reasons I hope you understand I’ll avoid that. It’d be simpler if I was doing some small match of any sort, but what we really want here is basically parsing.
Approach number 2: Use JavaScript’s DOMParser()
inside a “Code Node”. Very elegant… if it worked…
The issue with JavaScript is that the code being executed here is not a “browser JavaScript”, it is NodeJS code. That means that there is absolutely no DOMParser()
or DOM
, for that matter. So what can we do?
There’s quite a few packages that we can use to parse DOM, one of which is the jsdom
package.
So a Code Node with something like:
const jsdom = require("jsdom");
const dom = new jsdom.JSDOM($input.all());
Should be enough to get started, right?
Cannot find module 'jsdom' [line 1]
… Of course! We didn’t install it! How can we do that?
Using external JS modules
A typical n8n development docker deployment with Docker (and literally the one I’m using) looks something like this:
services:
n8n:
container_name: n8n
image: docker.n8n.io/n8nio/n8n
restart: always
ports:
- "5678:5678"
environment:
- N8N_ENFORCE_SETTINGS_FILE_PERMISSIONS=true
- N8N_RUNNERS_ENABLED=true
- N8N_SECURE_COOKIE=false
- N8N_HOST=${SUBDOMAIN}.${DOMAIN_NAME}
- N8N_PORT=5678
- GENERIC_TIMEZONE=${GENERIC_TIMEZONE}
- TZ=${GENERIC_TIMEZONE}
volumes:
- n8n_data:/home/node/.n8n
- ./local-files:/files
volumes:
n8n_data:
The key here is the image. It runs Node inside it somehow, and it is there that the packages are. What we can try to do is the following:
- Get inside the container:
docker exec --user root -it n8n sh
Install JSOM with
npm i -g jsdom
.We can then try to use the
jsdom
module in our Code Node!
Aaaaaand it didn’t work… Turns out you need to change the environment variable NODE_FUNCTION_ALLOW_EXTERNAL=jsdom
, according to the docs.
We can then make it work, and also make the module install permanent by creating a Dockerfile
with the following contents:
FROM docker.n8n.io/n8nio/n8n
USER root
RUN npm install -g jsdom
USER node
And changing our compose.yaml
to:
services:
n8n:
container_name: n8n
# image: docker.n8n.io/n8nio/n8n
build:
context: .
dockerfile: Dockerfile
restart: always
ports:
- "5678:5678"
environment:
- NODE_FUNCTION_ALLOW_EXTERNAL=jsdom
- N8N_ENFORCE_SETTINGS_FILE_PERMISSIONS=true
- N8N_RUNNERS_ENABLED=true
- N8N_SECURE_COOKIE=false
- N8N_HOST=${SUBDOMAIN}.${DOMAIN_NAME}
- N8N_PORT=5678
- GENERIC_TIMEZONE=${GENERIC_TIMEZONE}
- TZ=${GENERIC_TIMEZONE}
volumes:
- n8n_data:/home/node/.n8n
- ./local-files:/files
volumes:
n8n_data:
Now when we start our app with docker compose up --build -d
and execute the code again:
Hell nah.
Now doing it properly
Very nice, we can get what we want. But that image manipulation there bothers me. I generally only resort to such methods only when everything else fails.
[!note] And since the initial
jsdom
idea already failed I’m not even gonna bother.
You see, n8n is meant for this sort of task (One of its “marketing points” is the “low-codeness” it offers). I’d be pretty surprised if there is no way to parse HTML output from a HTTP Node “natively”. So let’s look for a way to do this internally: HTML Nodes!
There’s a particular HTML Node precisely meant to extract HTML, so we’re using that.
All we need is to extract the list of href
s from the .sitemap
unordered-list, so we can continue. Look how convenient the HTML Node is:
Very weird thinking “low-code” / “no-code”, feels unnatural.
Iterating over results
Now, for each of them we have to do a similar task: run an HTTP request, parse the result. In this case though, we’re looking for “Brazil” in the breadcrumbs, to ensure we only query events from Brazil.
Hey, wait a second, we only made the first of 26 queries… the /cidades/a
of all 26 alphabet letters… goddamn.
Conclusions
Could I have done all of this in a tiny fraction of the time by beaultifulsouping my way around? Absolutely. But I’m sure the power of this tools shines more on the AI integrations than the actual data processing.
However, it’s definetely interesting that there are tools that allow you to pipeline and process data with little to no coding experience. This actually may allow some people to perform automation tasks instead of asking you to do it.
The best kind of automation is not needing to do anything.
Maybe I’ll find a more complex project eventually where this tool makes more sense. But this was a very good “hello, world” with it.