Let's make a virtual home lab!

In this blog I'll walk you through the steps I took to create simulated networks I call "labs".

Reading Time: ~18 min
card.jpg

Let’s say you want to learn concepts of DevOps and related fields. You might’ve come across some people showing their own home labs, and how they used it to learn. But maybe it’s not possible for you to get more hardware, for any reason (I couldn’t).

So how do you do? Well, the next best thing: virtualization. This might require a more powerful workstation, but in general we can keep the VMs very lightweight. 8GB of RAM or more is ideal.

That’s what I’ll call a “lab” from now on: a collection of resources for you to do research and learn with.

Of course if you’re learning a frontend library, for instance, you lab is simply your own workstation, but this is focused towards DevOps.

To learn kubernetes for instance, I feel that deploying an actual instance is more interesting that playing with minikube a bit. This is the goal of creating virtual networks and virtual machines for our labs: simulating a real environment.

TLDR

I want to build a playround with virtual machines that is easy to deploy/use/destroy on my own workstation.

You don’t need extra machines to call your “home lab”. A couple of VMs or even containers will do just fine for your first projects.

The goal

I was going to use VirtualBox for ease of use, but then changed my mind and I’ve decided I’ll just stick to libvirt. Why? Well it’s simple (and it’s declarative + a bit more cli friendly, which is nice).

My goal for this changed a few times. I’ve decided to make the blog of the actual thing I was building: a virtual lab manager.

Ideally the workflow will be something like:

  1. Define lab (images, VMs and networlks) in YAML file.
  2. Run some sort of tofu apply or ./manage up or something.
  3. Use the VMs. Learn something. I don’t know. You decide.

Literally Infrastructure as Code, but for a local playground.

Don’t be scared, you can follow along even if you have little idea what this is, I’ll try to help. Although if you really want to use this to study I’d recommend taking a look at the tools we’re using too.

In essence, I want a way to define VMs and networks, in the simplest format possible (probably YAML in this case), and manage all of them using a single command, since “managing” means starting, stopping, building and destroying.

You see, libvirt’s way to manage this is close. It uses XML instead of YAML, which could easily be converted, and it only allows to manage one resource at a time (that is, I can only run virsh <resource> <action> on one resource at a time).

A script would suffice, but since we’re playing with DevOps, and what we want is obviously IaC, let’s try to use OpenTofu. I need a way to show potential employers that I’m proficient with it anyways.

Using the libvirt terraform provider

Configuration of the provider is trivial. I changed a bit terraform’s tfstate file to be a dotfile, so it doesn’t get in my way when I’m running commands around the main files.

terraform {
  backend "local" {
    path = "./.terraform.tfstate"
  }
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
        version = "0.8.3"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

For now we have to figure out how to actually define a network and the VMs.

To be fair there’s not much to it besides reading the docs, so I won’t really go into detail here. I’ll just show below what is the OpenTofu file setup we will try to achieve with the YAML file later:

  1terraform {
  2  backend "local" {
  3    path = "./.terraform.tfstate"
  4  }
  5  required_providers {
  6    libvirt = {
  7      source = "dmacvicar/libvirt"
  8        version = "0.8.3"
  9    }
 10  }
 11}
 12
 13provider "libvirt" {
 14  uri = "qemu:///system"
 15}
 16
 17resource "libvirt_network" "bridge_net" {
 18  name   = "bridge"
 19  mode   = "bridge"
 20  bridge = "br0"
 21}
 22
 23resource "libvirt_network" "clients_net" {
 24  name   = "clients"
 25  mode   = "none"
 26  dns { enabled = false }
 27  dhcp { enabled = false }
 28}
 29
 30resource "libvirt_network" "servers_net" {
 31  name   = "servers"
 32  mode   = "none"
 33  dns { enabled = false }
 34  dhcp { enabled = false }
 35}
 36
 37data "template_file" "router_user" {
 38  template = file("./cloud-init/user.cfg")
 39  vars = {
 40    hostname: "router"
 41  }
 42}
 43data "template_file" "client_user" {
 44  template = file("./cloud-init/user.cfg")
 45  vars = {
 46    hostname: "client"
 47  }
 48}
 49data "template_file" "server_user" {
 50  template = file("./cloud-init/user.cfg")
 51  vars = {
 52    hostname: "server"
 53  }
 54}
 55
 56resource "libvirt_cloudinit_disk" "router_cloudinit" {
 57  name           = "router-cloudinit.iso"
 58  user_data      = data.template_file.router_user.rendered
 59  pool           = "default"
 60}
 61resource "libvirt_cloudinit_disk" "client_cloudinit" {
 62  name           = "client-cloudinit.iso"
 63  user_data      = data.template_file.client_user.rendered
 64  pool           = "default"
 65}
 66resource "libvirt_cloudinit_disk" "server_cloudinit" {
 67  name           = "client-cloudinit.iso"
 68  user_data      = data.template_file.server_user.rendered
 69  pool           = "default"
 70}
 71
 72resource "libvirt_volume" "router_image" {
 73  name = "router.qcow2"
 74  pool = "default"
 75  source = "https://cloud.debian.org/images/..."
 76  format = "qcow2"
 77}
 78resource "libvirt_volume" "client_image" {
 79  name = "client.qcow2"
 80  pool = "default"
 81  source = "https://cloud.debian.org/images/..."
 82  format = "qcow2"
 83}
 84resource "libvirt_volume" "server_image" {
 85  name = "server.qcow2"
 86  pool = "default"
 87  source = "https://cloud.debian.org/images/..."
 88  format = "qcow2"
 89}
 90
 91resource "libvirt_domain" "router_vm" {
 92  name        = "router"
 93  memory      = 512
 94  vcpu        = 1
 95  autostart   = false
 96  network_interface {
 97    network_id = libvirt_network.bridge_net.id
 98  }
 99  network_interface {
100    network_id = libvirt_network.clients_net.id
101  }
102  network_interface {
103    network_id = libvirt_network.servers_net.id
104  }
105  cloudinit = libvirt_cloudinit_disk.router_cloudinit.id
106  disk {
107    volume_id = libvirt_volume.router_image.id
108    scsi = true
109  }
110  boot_device {
111    dev = [ "hd" ]
112  }
113  graphics {
114    type = "vnc"
115    listen_type = "address"
116  }
117}
118resource "libvirt_domain" "client_vm" {
119  name        = "client"
120  memory      = 512
121  vcpu        = 1
122  autostart   = false
123  network_interface {
124    network_id = libvirt_network.clients_net.id
125  }
126  cloudinit = libvirt_cloudinit_disk.client_cloudinit.id
127  disk {
128    volume_id = libvirt_volume.client_image.id
129    scsi = true
130  }
131  boot_device {
132    dev = [ "hd" ]
133  }
134  graphics {
135    type = "vnc"
136    listen_type = "address"
137  }
138}
139resource "libvirt_domain" "server_vm" {
140  name        = "server"
141  memory      = 512
142  vcpu        = 1
143  autostart   = false
144  network_interface {
145    network_id = libvirt_network.servers_net.id
146  }
147  cloudinit = libvirt_cloudinit_disk.server_cloudinit.id
148  disk {
149    volume_id = libvirt_volume.server_image.id
150    scsi = true
151  }
152  boot_device {
153    dev = [ "hd" ]
154  }
155  graphics {
156    type = "vnc"
157    listen_type = "address"
158  }
159}

I’m not sure this is 100% correct. I had to rewrite it because I forgot to write abou it =/

This is suboptimal, we can define variables or locals and use loops to write only a single block for each resource type, and loop through them. By doing that we make it very easy to use a YAML file.

  1# provider setup...
  2
  3variable "lab_path" {
  4  type = string
  5}
  6
  7locals {
  8  config = {
  9    # ... the actual lab definition here
 10    # yamldecode(file("${path.module}/${var.lab_path}.yaml"))
 11  }
 12}
 13
 14resource "libvirt_network" "network" {
 15  for_each = { for network in local.config.networks : network.name => network }
 16
 17  name      = each.value.name
 18  mode      = each.value.mode
 19
 20  bridge    = each.value.mode == "bridge" ? each.value.bridge : null
 21
 22  # addresses = each.value.mode == "bridge" ? null : each.value.addresses
 23  dynamic "dns" {
 24    for_each = each.value.mode == "bridge" ? [] : [each.value]
 25    content {
 26      enabled = each.value.dns.enabled
 27    }
 28  }
 29  dynamic "dhcp" {
 30    for_each = each.value.mode == "bridge" ? [] : [each.value]
 31    content {
 32      enabled = each.value.dhcp.enabled
 33    }
 34  }
 35}
 36
 37data "template_file" "user_data" {
 38  for_each = { for vm in local.config.vms : vm.name => vm }
 39  # FIXME: not sure I like this
 40  template = file("labs/${each.value.cloud-init.user_data}")
 41  vars = {
 42    hostname: each.value.name
 43  }
 44}
 45resource "libvirt_cloudinit_disk" "cloudinit" {
 46  for_each = { for vm in local.config.vms : vm.name => vm }
 47  name           = "${each.value.name}-cloudinit.iso"
 48  user_data      = data.template_file.user_data[each.value.name].rendered
 49  pool           = "default"
 50}
 51
 52# To pre-download remote images
 53resource "libvirt_volume" "base_image" {
 54  for_each = { for baseimg in local.config.images : baseimg.name => baseimg }
 55
 56  name = each.value.name
 57  pool = "default"
 58  source = each.value.source
 59  format = "qcow2"
 60}
 61
 62# TODO: proper image configuration (local images, other pools, etc)
 63resource "libvirt_volume" "image" {
 64  for_each = { for vm in local.config.vms : vm.name => vm }
 65
 66  name = "${each.value.name}.qcow2"
 67  pool = "default"
 68  # source = each.value.image.source
 69  base_volume_id = libvirt_volume.base_image[each.value.image].id
 70  format = "qcow2"
 71}
 72
 73resource "libvirt_domain" "vm" {
 74  for_each = { for vm in local.config.vms : vm.name => vm }
 75
 76  name        = each.value.name
 77  description = each.value.description
 78  memory      = each.value.memory
 79  vcpu        = each.value.vcpu
 80  autostart   = false
 81
 82  dynamic "network_interface" {
 83    for_each = each.value.networks
 84    content {
 85      network_id = libvirt_network.network[network_interface.value].id
 86    }
 87  }
 88
 89  cloudinit = libvirt_cloudinit_disk.cloudinit[each.value.name].id
 90
 91  # Boot disk (TODO: other disks)
 92  disk {
 93    volume_id = libvirt_volume.image[each.value.name].id
 94    scsi = true
 95  }
 96
 97  boot_device {
 98    dev = [ "hd" ]
 99  }
100
101  graphics {
102    type = "vnc"
103    listen_type = "address"
104  }
105}

Now, with this main.tf file we can run both tofu apply to spin up the lab and tofu destroy to destroy it.

Extracting information from a YAML file

Cool, now we need to define informations about the lab in an external YAML file, and import it in the OpenTofu config.

We made it very easy on ourselves. By already using the locals we can simply change this:

locals {
  config = {
    # ... the actual lab definition here
  }
}

To this:

locals {
  config = yamldecode(file("${path.module}/${var.lab_path}.yaml"))
}

I even ommitted the config itself, because I skipped this step and did it directly using the file.

Improving remote-image efficiency

You might also notice that I added an “images” field to the config. That’s reflected in the config like this:

 1images:
 2  - name: "debian12"
 3    source: "https://cloud.debian.org/images/cloud/bookworm/..."
 4
 5networks:
 6  - name: "bridge"
 7    mode: "bridge"
 8    bridge: "br0"
 9  # ...
10
11vms:
12  - name: "router"
13    description: "Router for the simulated network"
14    networks: ["bridge", "servers", "clients"]
15    vcpu: 1
16    memory: 512
17    image: debian12
18    cloud-init:
19      user_data: ./cloud-init/user.cfg
20    # ...

This was done to mitigate a pretty funny behavior by the libvirt provider: When you use remote cloud images on multiple VMs and they have the same source, each has a separate download it seems.

This way we download it once, call it a “base image”, and make the clones from it.

In the future I might add an option to “keep it cached”, so even if tofu destroy is run the base image is still present, speeding up the lab startup quite a bit.

Cloud-init

I forgot to talk about it, but it’s also important:

I’m using cloud images here, they’re my priority, because with them I can simply have a “default” Linux distro on-the-fly and configure the base system to my needs with cloud-init and even extra scripts or ansible roles if I’m feeling fancy.

One day I’ll give proper local image support (I’ll need it if I decide to do the “base image caching” I talked about above), but one day is not today.

Ansible to manage running status

Great, we managed to build and destroy the VMs. But what if we want to save work for later? One approach is to simply deactivate the VMs and networks, and then activate them when we need them again.

We also might eventually want to store a snapshot of a certain point in time. As of right now I do not need such functionality, so I’ll let future me deal with it.

We can manage the “running state” of the lab using ansible. We can even use it to call the tofu apply command, so it can manage the entire state of the deployment.

That’s the goal: ansible-playbook manage.yaml [-e 'status=undefined|running|stopped'], and then use the lab.

First of all, we need a way to tofu apply the lab. For that we can use the ansible community terraform module, which can apply the states by setting the project_path and the state (planned/present/absent).

So the general structure of the playbook will be as follows:

- name: Control labs' VMs and Networks states
  hosts: localhost
  vars:
    binary_path: /usr/bin/tofu
    lab_path: labs/base
    status: running  # running | stopped | undefined
  tasks:
  - name: Apply OpenTofu config
    # ...
    when: status == "running" or status == "stopped"
  - name: Start VMs and Networks
    # ...
    when: status == "running"
  - name: Stop VMs and Networks
    # ...
    when: status == "stopped"
  - name: Destroy OpenTofu config
    # ...
    when: status == "undefined"

I also need to set binary_path for the module, as I’m using OpenTofu, not Terraform, because I’m not really a fan of the license changes.

Easy enough, now ansible-playbook state.yaml runs tofu apply for me, wherever the lab may be. I can even use the CLI itself to manage the running status of the lab by passing -e status=stopped to the command, for example.

However, we can’t yet use the lab data in ansible, we’re not actually reading it anywhere! To achieve that we’ll change our setup a bit: read the lab data file inside ansible, and pass that data as variables into the terraform config.

Let’s modify the setup a bit.

# From this
locals {
  config = yamldecode(file("${path.module}/${var.lab_path}.yaml"))
}
# To this
variable "lab_data" {
  type = any
}

And then just update all local.config references to be var.lab_data, aka :s/local.config/var.lab_data/g.

Then, in ansible we can read the YAML file using slurp and set_fact like so:

# ...
  - name: Read lab config
    ansible.builtin.slurp:
      src: "{{ lab_path }}.yaml"
    register: lab_file

  - name: Interpret remote file content as yaml
    ansible.builtin.set_fact:
      lab_data: '{{ lab_file.content | b64decode | from_yaml }}'
# ...

There’s probably 100 ways of doing this. This is the first one I tried and it worked nicely.

We can finally pass it to the opentofu config by setting complex_vars to true and passing lab_data as we defined before.

# ...
  - name: Apply OpenTofu config
    community.general.terraform:
      binary_path: '{{ binary_path }}'
      project_path: ./
      state: present
      force_init: true
      complex_vars: true
      variables:
        lab_path: '{{ lab_path }}'
        lab_data: '{{ lab_data }}'
    when: status == "running" or status == "stopped"
# ...

Finally! The ansible-playbook state.yaml -e status=running command works the same as before!

Aaaand the last part: managing the running status of the virtual machines and networks. For that there’s a neat collection called… you guessed it community.libvirt. Man I love ansible.

There are three modules virt for VMs, net for networks and pool, for… pools? They need to be activated by me?

Huh… didn’t know that.

Oh, of course, I’m just dumb. Reading some docs for 15 seconds gives me answers. I tought it was managed completely behind the scenes but apparently it’s not. The more you know.

In any case, I only use the default one for now, and it autostarts by default. So I’m not messing with it. So since there are two resources we want to manage and two modules, we need two tasks. The layout is as follows:

# ...
  - name: Start Virtual Networks
    community.libvirt.virt_net:
      state: running
      name: "{{ item.name }}"
    loop: '{{ lab_data.networks }}'
    when: status == "running"

  - name: Start Virtual Machines
    community.libvirt.virt:
      state: active
      name: "{{ item.name }}"
    loop: '{{ lab_data.vms }}'
    when: status == "running"
# ...

And then for stopping it’s the same, just change the state to stopped and then “when” condition to status == "stopped". You get it.

Note that if the VM or network is not active and we try to set the lab status to undefined the VMs and networks will not be undefined (removed from libvirt).

That’s why we need the extra “Undefine Virtual Networks” and “Undefine Virtual Machines” at the end of the playbook.

Currently it does not check if the VM even exists, so it prints out some harmless VMNotFound errrors.

The directory layout

With this specific setup we can actually have a single main.tf file, after all it’s only used to define and destroy the lab, which is already passed as a variable since we’re now using YAML files to write the lab data in a more concise way.

This also means that we do not need an entire directory for each lab, meaning we can achieve a directory layout similar to this:

.
├── configure_bridge.sh
├── labs
│   ├── cloud-init
│   │   └── user.cfg
│   ├── example.yaml
│   └── tmp.yaml
├── main.tf
├── README.md
└── state.yaml

I honestly don’t even know if it can be much simpler (assuming the requirements of “using opentofu+libvirt+cloudinit”). It’s simple and easy enough to me.

The bridge problem

Not an actual problem, really, but since my goal is to also make this lab manager as easy to use as possible, I’d really like also that people that don’t even really know what a bridge is to use this. I’m building what I wish I had when I started learning DevOps: a playground, easy to use, resetable, yada yada yada.

There’s also the fact that this will not work with wifi. There are workarounds with ebtables, but I’m not willing to do it.

The point is: the bridge is currently managed outside of the ansible playbook. This is fine when the VMs are not using a bridged connection, but when they are the bridge has to already exist.

I took a brief look at some ansible community modules, and apparently we could make this work with the nmcli module. I don’t like network manager, don’t use it, and won’t.

So for now we’ll do it via a bash script, which does require the manual work of getting the physical NIC name right (for now):

#!/usr/bin/env bash

PHYSICAL_NAME='enp42s0'
BRIDGE_NAME='br0'

ip link del "$BRIDGE_NAME"

ip link add name "$BRIDGE_NAME" type bridge
dhcpcd -k "$PHYSICAL_NAME"
ip link set "$PHYSICAL_NAME" down
ip link set "$PHYSICAL_NAME" master "$BRIDGE_NAME"

ip link set "$PHYSICAL_NAME" up
ip link set "$BRIDGE_NAME" up

dhcpcd "$BRIDGE_NAME"

echo "Bridge '$BRIDGE_NAME' configured for interface '$PHYSICAL_NAME'."

What we can do then is simply remove the bridge name from the lab configurations alltogether, and then assume the bridge name is br0, which is the one created by the script itself.

Eventually I might also want to “disable” the bridge setup (we setup a bridge in the lab, it sort of makes sense to undo it at some point). But since it keeps the ethernet connection working normally there’s really no need for now.

I didn’t do this yet, but I plan to include running this script in ansible too, if I don’t figure out something nicer.

The final setup

I know have a pretty useful tool to teach (and learn) networking.

As an example, let’s see my “networking guide” lab:

images:
  - name: "debian12"
    source: "https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-generic-amd64.qcow2"

networks:
  - name: "bridge"
    mode: "bridge"

  - name: "servers"
    mode: "none"
    # addresses: ["192.168.169.0/24"]
    dns:
      enabled: false
    dhcp:
      enabled: false

  - name: "clients"
    mode: "none"
    # addresses: ["192.168.200.0/24"]
    dns:
      enabled: false
    dhcp:
      enabled: false

vms:
  - name: "router"
    description: "Router for the simulated network"
    networks: ["bridge", "servers", "clients"]
    vcpu: 1
    memory: 512
    image: debian12
    cloud-init:
      user_data: ./cloud-init/user.cfg

  - name: "dns"
    description: "DNS server for the simulated network"
    networks: ["servers"]
    vcpu: 1
    memory: 512
    image: debian12
    cloud-init:
      user_data: ./cloud-init/user.cfg
  - name: "dhcp"
    description: "DHCP server for the simulated network"
    networks: ["servers"]
    vcpu: 1
    memory: 512
    image: debian12
    cloud-init:
      user_data: ./cloud-init/user.cfg

  - name: "client"
    description: "Mock client for the simulated network"
    networks: ["clients"]
    vcpu: 1
    memory: 512
    image: debian12
    cloud-init:
      user_data: ./cloud-init/user.cfg

Which might be best described by this image:

Drawing of the network topology

Essentially, I’ll use this to explain in practice how some networks are generally operated: with a firewalled router for external access (and cross-internal communication too) networks.

Before running the lab I can check if there is anything there:

marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
 Id   Name   State
--------------------

marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
 Name   State   Autostart   Persistent
----------------------------------------

marcusvrp@ghost ~/.../projects/virt-labs %

No networks, no VMs, cool. Now we start it with ansible-playbook state.yaml -e 'status=running'. Time to check again.

marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
 Id   Name     State
------------------------
 1    client   running
 2    dhcp     running
 3    dns      running
 4    router   running

marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
 Name      State    Autostart   Persistent
--------------------------------------------
 bridge    active   no          yes
 clients   active   no          yes
 servers   active   no          yes

marcusvrp@ghost ~/.../projects/virt-labs %

Amazing. I’ll write a test file to each VM on echo $HOSTNAME > /root/hello. And then run ansible-playbook state.yaml -e 'status=stopped'. Time to check once more.

marcusvrp@ghost ~/.../projects/virt-labs % virsh list --all
 Id   Name     State
-------------------------
 -    client   shut off
 -    dhcp     shut off
 -    dns      shut off
 -    router   shut off

marcusvrp@ghost ~/.../projects/virt-labs % virsh net-list --all
 Name      State      Autostart   Persistent
----------------------------------------------
 bridge    inactive   no          yes
 clients   inactive   no          yes
 servers   inactive   no          yes

marcusvrp@ghost ~/.../projects/virt-labs %

All stopped, nice. Let’s restart with ansible-playbook state.yaml -e 'status=running'. Checking again, this is the one that could go wrong:

# oh no

It had to fail. Of course. Turns out the opentofu config does not like it when the networks aren’t active and we try to configure the VMs, which is exactly what I hoped didn’t happen.

After manually starting, the /root/hello files are all there with the right contents. So at least I wasn’t messing up the disks.

Man… I feel lazy right now. I’m going to simply keep the networks running for now. It’s even better this way, I can use the libvirt provider itself for this.

With that all the basics work: defining, starting, stopping and undefining virtual labs!

Was this optimal in any way? Absolutely not!

I’ve been using my old laptop with Proxmox for a while now, so this little project was more of a “can I do it?” instead of “I need it”.

I did wish I had something like this before, but I just managed libvirt VMs manually. Not only that, I have a pretty big playground where I study anyways, so I never did something like this.

Am I going to use this? I doubt it. Maybe if I adapt it a bit for Proxmox, since I’m managing my terraform files manually as of right now.

All that is left to do is use them for learning something more useful.

Appreaciate ya if you read all this.