Forking Overleaf because no one else will

Like many things I ended up writing here before I actually deployed my website, I actually have already done what I’m writing about. I’ve already forked Overleaf because we needed it to support the OpenID Connect auth protocol (it’s only a 3-year-old open issue/feature request) and they won’t implement it.

What bothers me with this is how simple this was. I was avoiding forking it because I thought I’d have to mess with a lot of code I don’t know and learn a bunch of the codebase, and I didn’t want to waste too much time on that.

Turns out, after understanding the auth code, in 20 minutes you know exactly how you can do it.

If there's one thing I got from this though

I fell actually ready to start contributing to OSS more.

I’ve avoided it for some time now because I feel a bit anxious contributing and messing with other people’s code even though I felt I understood the codebase properly.

Matter of fact, one of the projects I want to contribute to the most is the Linux kernel (you might understand why the anxiety for this one at least). I’ve even had to write a driver for myself once. But that’s a story for another time.

One thing about updating this fork was how much time it actually took. Specially because the package-lock.json had to be updated this took quite a few minutes. No wonder people avoid forking, this takes forever. For reference I cloned the original repo and simply ran cd development && ./bin/dev and the image building process took over 15 minutes (I have a pretty nice internet connection and definetly decent PC, wtf?).

Before we begin let’s explain what problem we’re solving here.

ShareLaTeX

For a bit of context: I’m part of the sysadmin/devops team on C3SL at the Department of Informatics in the Federal University of Paraná (Brazil mentioned? 0.0). C3SL handles a lot of the infrastructure of the department.

We make some services available for students of our

Some of these services are:

Remote-booted Linux Mint image for people with an account to log-in into our laboratories.
A moodle instance.
NFS-exported /home directory for every user
A GitLab instance.
A High Performance Computing Cluster
A BigBlueButton instance.
Jitsi instance
…
And of course: a ShateLaTeX instance.

You might notice that it’s all FOSS/OSS, which make sense once you see that C3SL stands for “Center of Cientific Computation and Free (as in freedom) Software”.

Anyways: ShareLaTex is an old version of Overleaf. We still use it because since our user database is an LDAP db in the machine with the /homes. Our main authentication method for years now is Kerberos. SSH uses kerberos, GitLab used kerberos, etc.

The problem arises when stuff starts dropping support for it, so we just stop upgrading services.

Overleaf is the successor to ShareLaTex, has more features.

OpenID Connect (& Keycloak)

By default, Overleaf’s Community Edition only supports “internal logins”, that is: an administrator creates an account manually for anyone that wants to use the service. It’s also possible to enable self-register, but that’s not what we want.

So what should we do? Run a script to sync the user databases? Dude nah.

We are trying to improve our infra and the way we handle things. One of the improvements we’re trying is new authentication methods in order to support stuff properly. The particular way we’re trying to improve is by using a centralized identity and access management service: keycloak.

Using that we plan to solve a few issues of disconnected user databases, and centralize auth. One cool thing is that by using this solution not only will the same login work on all the services, but it may also be cached, so you login once and then every other service’s login is a click away.

In any case: Overleaf does not support that.

Its Enterprise Edition supports SAML, which Keycloak also does, but we’re not using the Enterprise Edition. It makes zero sense to pay for it simply to have auth.

Another big reason for forking, which I’ll discuss in detail later, is that Overleaf added a compilation time limit of 20 seconds to their free plan. Needless to say, a bunch of our projects exceed that, so even using Overleaf’s own server is out of the question.

So what do we do?

Fork

The plan is simple: add an extra login route that uses an OpenID Connect callback. Create the user if it does not exist. And proceed with a normal login.

“In and out, 20 minute adventure.”

Bonus points for making both login options available, in case we need to give external access to someone.

It’s kinda annoying that their bin/dev script uses docker-compose. As far as I can tell that has been deprecated a while ago.

Taking a very quick look at the codebase. Specially the services/web/app/infrastructure/Server.mjs file you see that Overleaf uses passport to handle logins.

That is great, because we can just use passport-openidconnect to provide the auth strategy and be done with this in actually a few minutes. This literally couldn’t be easier. Actually it could, they could just have native support for it.

Let’s start with the strategy code:

if (process.env['OPENID_ENABLED'] === 'true') {
  const crypto = require('crypto');
  // had to rush to make this fork, so externally managed users will get a random password
  async function genPass(length) {
      return new Promise((res, rej) => {
          crypto.randomBytes(length, (err, buffer) => {
              if (err)
                  return rej(err);
              const randomString = buffer.toString('base64').slice(0, length);
              res(randomString);
          });
      });
  }
  passport.use(new OpenIDConnectStrategy(
    {
      // TODO: throw error on configuration missing
      issuer: process.env['OPENID_ISSUER'],
      authorizationURL: process.env['OPENID_AUTHORIZATION_URL'],
      tokenURL: process.env['OPENID_TOKEN_URL'],
      userInfoURL: process.env['OPENID_USERINFO_URL'],
      clientID: process.env['OPENID_CLIENT_ID'],
      clientSecret: process.env['OPENID_CLIENT_SECRET'],
      callbackURL: '/oidc/redirect',
      scope: [ 'profile' ]
    },
    async function verify(_issuer, profile, cb) {
      // TODO: proper profile to userDetails mapper configuration
      const names = profile.displayName.split(' ');
      const mid = (names.length + 1) / 2;
      const userDetails = {
        id: profile.id,
        email: profile.username + "@inf.ufpr.br",
        first_name: names.slice(0, mid).join(' '),
        last_name: names.slice(mid).join(' '),
        password: await genPass(64)
      };
      // wonder if this is right...
      let user = await UserGetter.promises.getUserByAnyEmail(userDetails.email);
      if (!user)
        user = await UserRegistrationHandler.registerNewUser(userDetails);
      return cb(null, user);
    }
  ))
}

We check if the strategy is enabled in the environment variables, and add a strategy if it is. We generate a random password for the user.

A better solution would be to straight up not allow other forms of login by setting a “federated” flag, or something. But this will do. That’s similar behavior to our GitLab, in which you can still reset your “internal password” and login via GitLab auth itself in case our auth service is down.

It would probably be ideal to leave the verify function, which maps the keycloak identity to the Overleaf user, to be implemented by people who use the fork. We’ll tackle that on another day. The email being username + '@inf.ufpr.br' is internall stuff.

The most annoying part of the project is getting the proper package-lock.json file working. Since we added an extra dependency and the Dockerfiles use npm ci we need to get that sorted. One way to get it working nicely is to generate it inside the container itself, and then copy it out. It’s either that or debugging why something about node-gyp fails catastrophically.

In any case, there is some error handling missing there, mainly related to the environment variable extractions.

After that it’s a matter of creating the login routes:

oidcLogin(req, res, next) {
  passport.authenticate('openidconnect')(req, res, next)
},

oidcRedirect(req, res, next) {
  passport.authenticate('openidconnect', {
    keepSessionInfo: true,
    successReturnToOrRedirect: '/',
    failureRedirect: '/login'
  }, async function(err, user, _) {
    if (!user)
      return res.status(401).json({redir:'/'});
    if (err)
      return next(err);
    try {
      // I have no idea what these do, I just copied the passportLogin below pretty much
      await Modules.promises.hooks.fire('saasLogin', { email: user.email }, req);
      await AuthenticationController.promises.finishLogin(user, req, res);
    } catch (err) {
      return next(err)
    }
  })(req,res,next)
},

You can take a look at how OpenID Connect works, but basically we need one function to call to tell the auth provider we want to authenticate and another to handle the response (the redirect).

After that we just mess with the web/app/views/user/login.pug to set the login UI and we’re done.

The “proper” way: overleaf-toolkit

Now that we got the overleaf fork itself working, it’s time to take a look at how Overleaf tells users to actually run a local instance: overleaf-toolkit.

It’s basically a pre-built container. We just use it as base, copy the relevant files to it, update dependencies and it’s done.

You can see how simple it is in my repo.

Am I going to get sued?

Wouldn’t really make any sense at all. OpenID Connect is not a feature even in the Enterprise Edition.

Not only that. There is obviously zero commercial use. Plus the fact that it’s only accessible to people we give accounts to (mostly students and professors from our department).

But talking about Enterprise Edition brings up some interesting point.

A note on self-hosting and FOSS

You saw it. It’s ridiculous how simple that fork was. So why don’t they do it?

Because they’re selling a service.

The important part of Overleaf is not the software itself, it’s the overleaf.com website, or the premium/plus/whatever plan upgrade proprietary subscription thing.

How else could they make money? It’s the type of thing that people love to use, very few contribute to begin with. Hell, even we (in the academic area), who quite literally depend on a collaborative LaTeX editor were not willing to pay for the service itself. Given of course that we have our means to fork and self-host this thing, which people from other departments might not have.

In any case: the 20 second compilation limit for the free plan wasn’t there before. Isn’t this familiar? An OSS project that makes the default option gradually worse in order to promote their paid plan.

Are they far from changing licenses and going downhill?

Because let’s be honest, there are quite a few companies like HashiCorp.

But they wrote [here] that they wouldn’t do that.

Man, even Firefox did change from “we’re not selling your data” to “we’re responsibly selling your data”. And that’s also for a valid reason of sorts: how can they survive otherwise?

Pretty much everyone around me uses Firefox, literally the majority. So it’s kinda counterintuitive to think that Firefox is actually a very small minority in the browser space, and doesn’t have nearly as much funding as other proprietary browsers.

So that’s their last resort. Start “betraying” their own users because it’s either that or the project starts dying. That’s the fate of many large projects. A “CLI kanban board” you can write in a day is not the same thing as a privacy-respecting alternative browser. Large OSS projects with little resources are fated to resort to alternative ways of making money, or just dying.

So what’s the alternative?

Forking and Self-hosting it seems.

If the code is publicly available, you have the means and resources to do so, honestly, do it. That way the code you’re running is pretty much yours. Tune it to your liking, if that’s what you want.

Or just don’t use it. In Overleaf’s case, it might not be the same thing, but using a local LaTeX compiler and sharing the source via git is a very valid way of doing things. How often do you end up editing the exact same content to the point you start changing what each other is writing at the same time? Merge conflicts will be minimal, I (almost) guarantee.

You know what? You can even ask an AI to help with this: automate the compilation and stuff in the repo.

In any case, that’s how it’s going to go. Open Source Projects only really survive generally because bigger companies fund them. Or some poor soul(s) dedicate a chunk of their life to keeping the internet running.

The Asahi Linux project is an example. You can search it, the contributions were at their highest at the very beginning of the project, and then they gradually decresed even though the number of features and cost (not necessarily the monetary one) of maintenance went up.

Can we do something about it?

I… don’t know.

Honestly, I’d love to spend all my days working solely on open source projects that a ton of people use. However, I’d also love to have food on my table, and sometimes there is conflict between these two wishes.

There is the option of working on projects on weekends and stuff. But man you have to love that to spend a chunk of your free time doing something that generally is unappreciated. Matter of fact, probably the number one skill of a good OSS maintainer is the ability to say “no” a lot. There are a bunch of people who got extremely annoyed or straight up quit OSS. Doing that is hard.

And again, let’s use Firefox as an example of a large project: do you really believe Firefox would be the same if only a few people worked on it on (some) weekends?

I’ve worked on a number of projects now, but by far the ones I enjoyed the most were simple things that people actually used and enjoyed. This overleaf fork is one of them, students and professors were happy about it, and so was I.

You can always say “we should contribute more to OSS” and similar things, but actually doing it is what matters.

One thing that might help: contribute to the codebase. Fuck fixing typos on documentation, do something useful for the project you want to help. Hell, want to fix documentation? Look for ambiguous or unclear text. That requires you to understand the code to begin with, to have used it, and to know how to teach people to use it. Sometimes the documentation is lacking.

By all means, pick libraries you actually use, study the codebase. At the very least you become more proficient with something you’ve been using. That’s a skill in of itself: opening source code that’s not yours and understanding it. Develop that skill.

I mean it. Found a bug? Study it and the codebase first (it might be a “feature”), report it and suggest a solution!

Don’t rush with it. If you want to contribute to the Linux Kernel for instance, you open the source and don’t understand shit it’s because you’re not ready. Study a bit more. Another skill that you’ll end up developing is talking to other people. There is absolutely a community that will welcome you to help you get started, no matter what.

As someone who neglected the “talking to other people” skill I can say this for sure: It’s just as important as writing good code (whatever that means), and sometimes more.

And honestly, contributing code is already farther than most would go to begin with. And if that’s done right you probably take just a bit of weight off of the maintainers’ shoulders.

So that’s my conclusion: Contribute to codebases, that way you help OSS projects and develop your skills as a developer yourself.

Gotta work on my “contribution anxiety”. I already have a list of projects I know enough to contribute but am shy to do so.