Let’s say you want to learn concepts of DevOps and related fields. You might’ve come across some people showing their own home labs, and how they used it to learn. But maybe it’s not possible for you to get more hardware, for any reason (I couldn’t).
So how do you do? Well, the next best thing: virtualization. This might require a more powerful workstation, but in general we can keep the VMs very lightweight. 8GB of RAM or more is ideal.
That’s what I’ll call a “lab” from now on: a collection of resources for you to do research and learn with.
Of course if you’re learning a frontend library, for instance, you lab is simply your own workstation, but this is focused towards DevOps.
To learn kubernetes for instance, I feel that deploying an actual instance is more interesting that playing with minikube a bit. This is the goal of creating virtual networks and virtual machines for our labs: simulating a real environment.
TLDR
I want to build a playround with virtual machines that is easy to deploy/use/destroy on my own workstation.
You don’t need extra machines to call your “home lab”. A couple of VMs or even containers will do just fine for your first projects.
The goal
I was going to use VirtualBox for ease of use, but then changed my mind and I’ve decided I’ll just stick to libvirt
. Why? Well it’s simple (and it’s declarative + a bit more cli friendly, which is nice).
My goal for this changed a few times. I’ve decided to make the blog of the actual thing I was building: a virtual lab manager.
Ideally the workflow will be something like:
- Define lab (images, VMs and networlks) in YAML file.
- Run some sort of
tofu apply
or./manage up
or something. - Use the VMs. Learn something. I don’t know. You decide.
Literally Infrastructure as Code, but for a local playground.
Don’t be scared, you can follow along even if you have little idea what this is, I’ll try to help. Although if you really want to use this to study I’d recommend taking a look at the tools we’re using too.
In essence, I want a way to define VMs and networks, in the simplest format possible (probably YAML in this case), and manage all of them using a single command, since “managing” means starting, stopping, building and destroying.
You see, libvirt
’s way to manage this is close. It uses XML instead of YAML, which could easily be converted, and it only allows to manage one resource at a time (that is, I can only run virsh <resource> <action>
on one resource at a time).
A script would suffice, but since we’re playing with DevOps, and what we want is obviously IaC, let’s try to use OpenTofu. I need a way to show potential employers that I’m proficient with it anyways.
Using the libvirt terraform provider
Configuration of the provider is trivial. I changed a bit terraform’s tfstate file to be a dotfile, so it doesn’t get in my way when I’m running commands around the main files.
terraform {
backend "local" {
path = "./.terraform.tfstate"
}
required_providers {
libvirt = {
source = "dmacvicar/libvirt"
version = "0.8.3"
}
}
}
provider "libvirt" {
uri = "qemu:///system"
}
For now we have to figure out how to actually define a network and the VMs.
To be fair there’s not much to it besides reading the docs, so I won’t really go into detail here. I’ll just show below what is the OpenTofu file setup we will try to achieve with the YAML file later:
1terraform {
2 backend "local" {
3 path = "./.terraform.tfstate"
4 }
5 required_providers {
6 libvirt = {
7 source = "dmacvicar/libvirt"
8 version = "0.8.3"
9 }
10 }
11}
12
13provider "libvirt" {
14 uri = "qemu:///system"
15}
16
17resource "libvirt_network" "bridge_net" {
18 name = "bridge"
19 mode = "bridge"
20 bridge = "br0"
21}
22
23resource "libvirt_network" "clients_net" {
24 name = "clients"
25 mode = "none"
26 dns { enabled = false }
27 dhcp { enabled = false }
28}
29
30resource "libvirt_network" "servers_net" {
31 name = "servers"
32 mode = "none"
33 dns { enabled = false }
34 dhcp { enabled = false }
35}
36
37data "template_file" "router_user" {
38 template = file("./cloud-init/user.cfg")
39 vars = {
40 hostname: "router"
41 }
42}
43data "template_file" "client_user" {
44 template = file("./cloud-init/user.cfg")
45 vars = {
46 hostname: "client"
47 }
48}
49data "template_file" "server_user" {
50 template = file("./cloud-init/user.cfg")
51 vars = {
52 hostname: "server"
53 }
54}
55
56resource "libvirt_cloudinit_disk" "router_cloudinit" {
57 name = "router-cloudinit.iso"
58 user_data = data.template_file.router_user.rendered
59 pool = "default"
60}
61resource "libvirt_cloudinit_disk" "client_cloudinit" {
62 name = "client-cloudinit.iso"
63 user_data = data.template_file.client_user.rendered
64 pool = "default"
65}
66resource "libvirt_cloudinit_disk" "server_cloudinit" {
67 name = "client-cloudinit.iso"
68 user_data = data.template_file.server_user.rendered
69 pool = "default"
70}
71
72resource "libvirt_volume" "router_image" {
73 name = "router.qcow2"
74 pool = "default"
75 source = "https://cloud.debian.org/images/..."
76 format = "qcow2"
77}
78resource "libvirt_volume" "client_image" {
79 name = "client.qcow2"
80 pool = "default"
81 source = "https://cloud.debian.org/images/..."
82 format = "qcow2"
83}
84resource "libvirt_volume" "server_image" {
85 name = "server.qcow2"
86 pool = "default"
87 source = "https://cloud.debian.org/images/..."
88 format = "qcow2"
89}
90
91resource "libvirt_domain" "router_vm" {
92 name = "router"
93 memory = 512
94 vcpu = 1
95 autostart = false
96 network_interface {
97 network_id = libvirt_network.bridge_net.id
98 }
99 network_interface {
100 network_id = libvirt_network.clients_net.id
101 }
102 network_interface {
103 network_id = libvirt_network.servers_net.id
104 }
105 cloudinit = libvirt_cloudinit_disk.router_cloudinit.id
106 disk {
107 volume_id = libvirt_volume.router_image.id
108 scsi = true
109 }
110 boot_device {
111 dev = [ "hd" ]
112 }
113 graphics {
114 type = "vnc"
115 listen_type = "address"
116 }
117}
118resource "libvirt_domain" "client_vm" {
119 name = "client"
120 memory = 512
121 vcpu = 1
122 autostart = false
123 network_interface {
124 network_id = libvirt_network.clients_net.id
125 }
126 cloudinit = libvirt_cloudinit_disk.client_cloudinit.id
127 disk {
128 volume_id = libvirt_volume.client_image.id
129 scsi = true
130 }
131 boot_device {
132 dev = [ "hd" ]
133 }
134 graphics {
135 type = "vnc"
136 listen_type = "address"
137 }
138}
139resource "libvirt_domain" "server_vm" {
140 name = "server"
141 memory = 512
142 vcpu = 1
143 autostart = false
144 network_interface {
145 network_id = libvirt_network.servers_net.id
146 }
147 cloudinit = libvirt_cloudinit_disk.server_cloudinit.id
148 disk {
149 volume_id = libvirt_volume.server_image.id
150 scsi = true
151 }
152 boot_device {
153 dev = [ "hd" ]
154 }
155 graphics {
156 type = "vnc"
157 listen_type = "address"
158 }
159}
I’m not sure this is 100% correct. I had to rewrite it because I forgot to write abou it =/
This is suboptimal, we can define variables or locals and use loops to write only a single block for each resource type, and loop through them. By doing that we make it very easy to use a YAML file.
1# provider setup...
2
3variable "lab_path" {
4 type = string
5}
6
7locals {
8 config = {
9 # ... the actual lab definition here
10 # yamldecode(file("${path.module}/${var.lab_path}.yaml"))
11 }
12}
13
14resource "libvirt_network" "network" {
15 for_each = { for network in local.config.networks : network.name => network }
16
17 name = each.value.name
18 mode = each.value.mode
19
20 bridge = each.value.mode == "bridge" ? each.value.bridge : null
21
22 # addresses = each.value.mode == "bridge" ? null : each.value.addresses
23 dynamic "dns" {
24 for_each = each.value.mode == "bridge" ? [] : [each.value]
25 content {
26 enabled = each.value.dns.enabled
27 }
28 }
29 dynamic "dhcp" {
30 for_each = each.value.mode == "bridge" ? [] : [each.value]
31 content {
32 enabled = each.value.dhcp.enabled
33 }
34 }
35}
36
37data "template_file" "user_data" {
38 for_each = { for vm in local.config.vms : vm.name => vm }
39 # FIXME: not sure I like this
40 template = file("labs/${each.value.cloud-init.user_data}")
41 vars = {
42 hostname: each.value.name
43 }
44}
45resource "libvirt_cloudinit_disk" "cloudinit" {
46 for_each = { for vm in local.config.vms : vm.name => vm }
47 name = "${each.value.name}-cloudinit.iso"
48 user_data = data.template_file.user_data[each.value.name].rendered
49 pool = "default"
50}
51
52# To pre-download remote images
53resource "libvirt_volume" "base_image" {
54 for_each = { for baseimg in local.config.images : baseimg.name => baseimg }
55
56 name = each.value.name
57 pool = "default"
58 source = each.value.source
59 format = "qcow2"
60}
61
62# TODO: proper image configuration (local images, other pools, etc)
63resource "libvirt_volume" "image" {
64 for_each = { for vm in local.config.vms : vm.name => vm }
65
66 name = "${each.value.name}.qcow2"
67 pool = "default"
68 # source = each.value.image.source
69 base_volume_id = libvirt_volume.base_image[each.value.image].id
70 format = "qcow2"
71}
72
73resource "libvirt_domain" "vm" {
74 for_each = { for vm in local.config.vms : vm.name => vm }
75
76 name = each.value.name
77 description = each.value.description
78 memory = each.value.memory
79 vcpu = each.value.vcpu
80 autostart = false
81
82 dynamic "network_interface" {
83 for_each = each.value.networks
84 content {
85 network_id = libvirt_network.network[network_interface.value].id
86 }
87 }
88
89 cloudinit = libvirt_cloudinit_disk.cloudinit[each.value.name].id
90
91 # Boot disk (TODO: other disks)
92 disk {
93 volume_id = libvirt_volume.image[each.value.name].id
94 scsi = true
95 }
96
97 boot_device {
98 dev = [ "hd" ]
99 }
100
101 graphics {
102 type = "vnc"
103 listen_type = "address"
104 }
105}
Now, with this main.tf
file we can run both tofu apply
to spin up the lab and tofu destroy
to destroy it.
Extracting information from a YAML file
Cool, now we need to define informations about the lab in an external YAML file, and import it in the OpenTofu config.
We made it very easy on ourselves. By already using the locals we can simply change this:
locals {
config = {
# ... the actual lab definition here
}
}
To this:
locals {
config = yamldecode(file("${path.module}/${var.lab_path}.yaml"))
}
I even ommitted the config itself, because I skipped this step and did it directly using the file.
Improving remote-image efficiency
You might also notice that I added an “images” field to the config. That’s reflected in the config like this:
1images:
2 - name: "debian12"
3 source: "https://cloud.debian.org/images/cloud/bookworm/..."
4
5networks:
6 - name: "bridge"
7 mode: "bridge"
8 bridge: "br0"
9 # ...
10
11vms:
12 - name: "router"
13 description: "Router for the simulated network"
14 networks: ["bridge", "servers", "clients"]
15 vcpu: 1
16 memory: 512
17 image: debian12
18 cloud-init:
19 user_data: ./cloud-init/user.cfg
20 # ...
This was done to mitigate a pretty funny behavior by the libvirt provider: When you use remote cloud images on multiple VMs and they have the same source, each has a separate download it seems.
This way we download it once, call it a “base image”, and make the clones from it.
In the future I might add an option to “keep it cached”, so even if tofu destroy
is run the base image is still present, speeding up the lab startup quite a bit.
Cloud-init
I forgot to talk about it, but it’s also important:
I’m using cloud images here, they’re my priority, because with them I can simply have a “default” Linux distro on-the-fly and configure the base system to my needs with cloud-init and even extra scripts or ansible roles if I’m feeling fancy.
One day I’ll give proper local image support (I’ll need it if I decide to do the “base image caching” I talked about above), but one day is not today.
Ansible to manage running status
Great, we managed to build and destroy the VMs. But what if we want to save work for later? One approach is to simply deactivate the VMs and networks, and then activate them when we need them again.
We also might eventually want to store a snapshot of a certain point in time. As of right now I do not need such functionality, so I’ll let future me deal with it.
We can manage the “running state” of the lab using ansible. We can even use it to call the tofu apply
command, so it can manage the entire state of the deployment.
That’s the goal: ansible-playbook manage.yaml [-e 'status=undefined|running|stopped']
, and then use the lab.
First of all, we need a way to tofu apply
the lab. For that we can use the ansible community terraform module, which can apply the states by setting the project_path
and the state
(planned/present/absent).
So the general structure of the playbook will be as follows:
- name: Control labs' VMs and Networks states
hosts: localhost
vars:
binary_path: /usr/bin/tofu
lab_path: labs/base
status: running # running | stopped | undefined
tasks:
- name: Apply OpenTofu config
# ...
when: status == "running" or status == "stopped"
- name: Start VMs and Networks
# ...
when: status == "running"
- name: Stop VMs and Networks
# ...
when: status == "stopped"
- name: Destroy OpenTofu config
# ...
when: status == "undefined"
I also need to set binary_path
for the module, as I’m using OpenTofu, not Terraform, because I’m not really a fan of the license changes.
Easy enough, now ansible-playbook state.yaml
runs tofu apply
for me, wherever the lab may be. I can even use the CLI itself to manage the running status of the lab by passing -e status=stopped
to the command, for example.
However, we can’t yet use the lab data in ansible, we’re not actually reading it anywhere! To achieve that we’ll change our setup a bit: read the lab data file inside ansible, and pass that data as variables into the terraform config.
Let’s modify the setup a bit.
# From this
locals {
config = yamldecode(file("${path.module}/${var.lab_path}.yaml"))
}
# To this
variable "lab_data" {
type = any
}
And then just update all local.config
references to be var.lab_data
, aka :s/local.config/var.lab_data/g
.
Then, in ansible we can read the YAML file using slurp
and set_fact
like so:
# ...
- name: Read lab config
ansible.builtin.slurp:
src: "{{ lab_path }}.yaml"
register: lab_file
- name: Interpret remote file content as yaml
ansible.builtin.set_fact:
lab_data: '{{ lab_file.content | b64decode | from_yaml }}'
# ...
There’s probably 100 ways of doing this. This is the first one I tried and it worked nicely.
We can finally pass it to the opentofu config by setting complex_vars
to true and passing lab_data
as we defined before.
# ...
- name: Apply OpenTofu config
community.general.terraform:
binary_path: '{{ binary_path }}'
project_path: ./
state: present
force_init: true
complex_vars: true
variables:
lab_path: '{{ lab_path }}'
lab_data: '{{ lab_data }}'
when: status == "running" or status == "stopped"
# ...
Finally! The ansible-playbook state.yaml -e status=running
command works the same as before!
Aaaand the last part: managing the running status of the virtual machines and networks. For that there’s a neat collection called… you guessed it community.libvirt
. Man I love ansible.
There are three modules virt
for VMs, net
for networks and pool
, for… pools? They need to be activated by me?
Huh… didn’t know that.
Oh, of course, I’m just dumb. Reading some docs for 15 seconds gives me answers. I tought it was managed completely behind the scenes but apparently it’s not. The more you know.
In any case, I only use the default one for now, and it autostart
s by default. So I’m not messing with it. So since there are two resources we want to manage and two modules, we need two tasks. The layout is as follows:
# ...
- name: Start Virtual Networks
community.libvirt.virt_net:
state: running
name: "{{ item.name }}"
loop: '{{ lab_data.networks }}'
when: status == "running"
- name: Start Virtual Machines
community.libvirt.virt:
state: active
name: "{{ item.name }}"
loop: '{{ lab_data.vms }}'
when: status == "running"
# ...
And then for stopping it’s the same, just change the state to stopped and then “when” condition to status == "stopped"
. You get it.
Note that if the VM or network is not active and we try to set the lab status to undefined
the VMs and networks will not be undefined (removed from libvirt).
That’s why we need the extra “Undefine Virtual Networks” and “Undefine Virtual Machines” at the end of the playbook.
Currently it does not check if the VM even exists, so it prints out some harmless VMNotFound errrors.
The directory layout
With this specific setup we can actually have a single main.tf
file, after all it’s only used to define and destroy the lab, which is already passed as a variable since we’re now using YAML files to write the lab data in a more concise way.
This also means that we do not need an entire directory for each lab, meaning we can achieve a directory layout similar to this:
.
├── configure_bridge.sh
├── labs
│ ├── cloud-init
│ │ └── user.cfg
│ ├── example.yaml
│ └── tmp.yaml
├── main.tf
├── README.md
└── state.yaml
I honestly don’t even know if it can be much simpler (assuming the requirements of “using opentofu+libvirt+cloudinit”). It’s simple and easy enough to me.
The bridge problem
Not an actual problem, really, but since my goal is to also make this lab manager as easy to use as possible, I’d really like also that people that don’t even really know what a bridge is to use this. I’m building what I wish I had when I started learning DevOps: a playground, easy to use, resetable, yada yada yada.
There’s also the fact that this will not work with wifi. There are workarounds with ebtables
, but I’m not willing to do it.
The point is: the bridge is currently managed outside of the ansible playbook. This is fine when the VMs are not using a bridged connection, but when they are the bridge has to already exist.
I took a brief look at some ansible community modules, and apparently we could make this work with the nmcli module. I don’t like network manager, don’t use it, and won’t.
So for now we’ll do it via a bash script, which does require the manual work of getting the physical NIC name right (for now):
#!/usr/bin/env bash
PHYSICAL_NAME='enp42s0'
BRIDGE_NAME='br0'
ip link del "$BRIDGE_NAME"
ip link add name "$BRIDGE_NAME" type bridge
dhcpcd -k "$PHYSICAL_NAME"
ip link set "$PHYSICAL_NAME" down
ip link set "$PHYSICAL_NAME" master "$BRIDGE_NAME"
ip link set "$PHYSICAL_NAME" up
ip link set "$BRIDGE_NAME" up
dhcpcd "$BRIDGE_NAME"
echo "Bridge '$BRIDGE_NAME' configured for interface '$PHYSICAL_NAME'."
What we can do then is simply remove the bridge name from the lab configurations alltogether, and then assume the bridge name is br0
, which is the one created by the script itself.
Eventually I might also want to “disable” the bridge setup (we setup a bridge in the lab, it sort of makes sense to undo it at some point). But since it keeps the ethernet connection working normally there’s really no need for now.
I didn’t do this yet, but I plan to include running this script in ansible too, if I don’t figure out something nicer.
The final setup
I know have a pretty useful tool to teach (and learn) networking.
As an example, let’s see my “networking guide” lab:
images:
- name: "debian12"
source: "https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2"
networks:
- name: "bridge"
mode: "bridge"
- name: "servers"
mode: "none"
# addresses: ["192.168.169.0/24"]
dns:
enabled: false
dhcp:
enabled: false
- name: "clients"
mode: "none"
# addresses: ["192.168.200.0/24"]
dns:
enabled: false
dhcp:
enabled: false
vms:
- name: "router"
description: "Router for the simulated network"
networks: ["bridge", "servers", "clients"]
vcpu: 1
memory: 512
image: debian12
cloud-init:
user_data: ./cloud-init/user.cfg
- name: "dns"
description: "DNS server for the simulated network"
networks: ["servers"]
vcpu: 1
memory: 512
image: debian12
cloud-init:
user_data: ./cloud-init/user.cfg
- name: "dhcp"
description: "DHCP server for the simulated network"
networks: ["servers"]
vcpu: 1
memory: 512
image: debian12
cloud-init:
user_data: ./cloud-init/user.cfg
- name: "client"
description: "Mock client for the simulated network"
networks: ["clients"]
vcpu: 1
memory: 512
image: debian12
cloud-init:
user_data: ./cloud-init/user.cfg
Which might be best described by this image:
Essentially, I’ll use this to explain in practice how some networks are generally operated: with a firewalled router for external access (and cross-internal communication too) networks.
Before running the lab I can check if there is anything there:
marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
Id Name State
--------------------
marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
Name State Autostart Persistent
----------------------------------------
marcusvrp@ghost ~/.../projects/virt-labs %
No networks, no VMs, cool. Now we start it with ansible-playbook state.yaml -e 'status=running'
. Time to check again.
marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
Id Name State
------------------------
1 client running
2 dhcp running
3 dns running
4 router running
marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
Name State Autostart Persistent
--------------------------------------------
bridge active no yes
clients active no yes
servers active no yes
marcusvrp@ghost ~/.../projects/virt-labs %
Amazing. I’ll write a test file to each VM on echo $HOSTNAME > /root/hello
. And then run ansible-playbook state.yaml -e 'status=stopped'
. Time to check once more.
marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
Id Name State
-------------------------
- client shut off
- dhcp shut off
- dns shut off
- router shut off
marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
Name State Autostart Persistent
----------------------------------------------
bridge inactive no yes
clients inactive no yes
servers inactive no yes
marcusvrp@ghost ~/.../projects/virt-labs %
All stopped, nice. Let’s restart with ansible-playbook state.yaml -e 'status=running'
. Checking again, this is the one that could go wrong:
# oh no
It had to fail. Of course. Turns out the opentofu config does not like it when the networks aren’t active and we try to configure the VMs, which is exactly what I hoped didn’t happen.
After manually starting, the /root/hello
files are all there with the right contents. So at least I wasn’t messing up the disks.
Man… I feel lazy right now. I’m going to simply keep the networks running for now. It’s even better this way, I can use the libvirt provider itself for this.
With that all the basics work: defining, starting, stopping and undefining virtual labs!
Was this optimal in any way? Absolutely not!
I’ve been using my old laptop with Proxmox for a while now, so this little project was more of a “can I do it?” instead of “I need it”.
I did wish I had something like this before, but I just managed libvirt VMs manually. Not only that, I have a pretty big playground where I study anyways, so I never did something like this.
Am I going to use this? I doubt it. Maybe if I adapt it a bit for Proxmox, since I’m managing my terraform files manually as of right now.
All that is left to do is use them for learning something more useful.
Appreaciate ya if you read all this.