Back

GitHub Actions: Ephemeral self-hosted runners and new webhooks for auto-scaling

192 points3 yearsgithub.blog
sascha_sl3 years ago

This feature was delayed every month after May.

And yet it is still half baked. We prepared for this with internally shared docs and the branch built in private for a while, but still had to roll back yesterday because the scheduler reverted to putting jobs wherever it pleased (including on ephemeral runners that already have a job) and randomly cancels large sets of jobs too.

I have been of the opinion that investing into GH Actions at this stage is purely sunken cost (at my org), and I'm not moving until the team behind this thing ships something that doesn't break half the time. These have been seriously frustrating months, because no amount of working around this messy code[1] made of 5 layers of MS style .NET (seriously, deleting a directory goes 5 layers deep in the call stack) will ever produce a stable product. They don't even know their own code base that well, when they first attempted ephemeral runners with `--once` it turned out the thing they produced could never work (because the server-side scheduler loves pipelining jobs to machines and failing miserably when these disappear, job times out after 20 minute of waiting type)

[1]: https://github.com/actions/runner

silasb3 years ago

Has your team considered looking into buildkite? We love the flexibility it gives us. Being able to dynamically build pipelines is a very nice feature that not many others have (at least that I could tell when I was researching).

sascha_sl3 years ago

We would like to run other things (probably concourse or argo), but this decision was made way further up to justify picking GitHub as provider. There might also be a Microsoft volume discount involved.

If we hard reject actions, we’ll probably end up with the prior status quo: Jenkins.

mrclean85863 years ago

Product manager on the GitHub Actions team reporting in, we're sorry to hear about this issue with the rollout of ephemeral runners. Our engineering team is aware of this issue and is heavily prioritizing the investigation and fix.

We'd love to look into your specific case if you want shoot me an email: thejoebourneidentity@github.com

sascha_sl3 years ago

One, our githubcustomers/ contact is already forwarding anything / setting up a call with the team as needed, and two, that is not a twitter profile I'd ever send a DM to in a professional context, considering you retweet a lot of people diametrically opposed to my existence.

mrclean85863 years ago

I will check in with our githubcustomers group to help accelerate. I can also directly inform our engineering team if you're open to sending me your information at

thejoebourneidentity@github.com

Either way, we're looking into this issue and I'll post an update here once we've learned more.

+1
p89523 years ago
newman3143 years ago

One issue that I've been dealing with over the last 48hrs is that pushing Docker images to GHCR has been randomly failing with 403 errors.

AFAIK, there has been no communication/acknowledgement of this as an issue. It makes it hard to decide to pick GHCR as a registry of choice.

thomasmcfarlane3 years ago

We recently faced this; if you are using the docker/login action I'd give that a check as it turned out it was logging us out by default at the end of each job; resulting in some race conditions when running multiple runners on the same machine (sharing the same docker daemon).

Simple fix was to add `logout: false` to the action options.

e_proxus3 years ago

I really wish the runner agent was written in something more portable than .NET. That choice feels like something purely political because they’re owned by Microsoft. I doubt and independent organization would have chosen it before other excellent choices such as Go, Rust etc.

Currently hosting the runner on e.g. FreeBSD or custom embedded systems is not supported (or even possible).

sascha_sl3 years ago

It's not because they're owned by Microsoft, at least not in the way you think.

It's because GitHub Actions is rebranded Azure Pipelines.

That a team at GitHub has been given a pile of Microsoft authored code is honestly much more concerning. They don't seem to understand it in its entirety either.

easton3 years ago

I didn't want to believe it, but then I did a search through their repo:

```

  else if (value.Contains("Microsoft.Azure.DevOps"))
                {
                    m_typeName = value.Replace("Microsoft.Azure.DevOps", "GitHub");
                   m_typeName = m_typeName.Substring(0, m_typeName.IndexOf(",")) + ", Sdk";

                }
```
jsmith453 years ago

You can also compare the source code structure to the Azure Devops Pipeline agent's to tell pretty easily that the github runner is a fork what has been edited to process the somewhat different format of the actions YAML.

https://github.com/microsoft/azure-pipelines-agent/

runeks3 years ago

Which repo?

CardiBFanatic233 years ago

Most of the people who worked on Azure Pipelines got transferred to GitHub to work on Actions. You can even see the same names of developers contributing code to both the GitHub runner and the Azure DevOps agent.

xvilka3 years ago

Then it's probably faster and easier to rewrite it in Go from scratch.

jbergstroem3 years ago

There is an ongoing project built on top of act (the "local" github runner) that accomplishes this: https://github.com/ChristopherHX/github-act-runner

IshKebab3 years ago

It runs on Windows, Linux and Mac surely? How much more portable do you need?

jclulow3 years ago

There are other operating systems and CPU architectures. It's a boon for open source projects to be able to have CI on all the BSDs, and illumos, and Plan 9, and even weirder things.

rubicks3 years ago

Too little and too late. Meanwhile, I'm over here with gitlab self-hosted runners that "dispatch" ephemeral runners. I can tweak scaling limits and the whole contraption runs seamlessly on the AWS ec2 instances of my choosing.

My company just competed the migration from github to gitlab and, while it's not perfect, there's a lot to like on gitlab.

danpalmer3 years ago

I think it's all just a matter of team preferences and for many this is less "too little too late" and more "yet another great release" when compared to the other tools they're using.

I personally find Actions to be a far better product than GitLab CI and we're moving all our CI from a mix of Circle/Jenkins to Actions.

1473 years ago

What do you like about Actions more than GitLab CI? Github Actions just feels much less mature and I keep running into issues.

danpalmer3 years ago

It's funny because that's pretty much my experience with GitLab CI. Actions is certainly a younger product, I do feel that, but in terms of the design, how the pieces fit together, and what it feels like to develop for it, it all feels much more mature than GitLab CI to me. GitLab always felt like a Travis clone that was hacked to look more like CircleCI.

19h3 years ago

The absolutely most annoying issue with GitHub Runners is the fact that they run 1 job .. at a time ... per server.

You can only imagine our follow-up meetings about the fact that we had a fleet of 15 c5a.2xlarge instances and still half of the developers were waiting up to 20 minutes for an instance to go online.

The worst part? The jobs don't clean up -- probably to allow for caching. We ran into into disk space issues regularly enough for it to force us to make the spot instances commit harakiri after 2 days.

GitHub are a cool concept and we'll probably stick with them. But their quality is just bad. There's that .NET runner and it feels like it's so massively different from anything GitHub-like you could imagine .. almost as if it's a whitelabel program they licensed or like it's the result of a 4 week contract work. Simply bad.

nicoburns3 years ago

Can't you run more than one runner per server? My understanding was you start one up, give it a directory to work in, and it'll register itself with the central server and start processing jobs. I thought you could just run more instances if you wanted parallelism.

judge20203 years ago

> The jobs don't clean up -- probably to allow for caching.

Ya, this helps with a specific build cache scenario I use, A workaround if you want it to cleanup is to put `rm -rf "${{ github.workspace }}"` at the end of your workflow.

wcdolphin3 years ago

If anyone has experience using self hosted GH Actions at scale, I’d love to buy you a virtual coffee and hear about pros/cons for a parallelized CI flow currently running in Circle. Main motivation for switching would be simplification of tooling and increasing performance with better cache reuse and running within AWS for faster network access to ECR.

1473 years ago

Reach out to me and I’ll be happy to talk to you about my experiences with GitHub actions.

molszanski3 years ago

At https://packhelp.com we use Github Actions for ~50 devs. I can share my experience, reach me.

Also, we have some useful Ansible scripts that I might share if there is interest

newman3143 years ago

I would be interested in seeing the Ansible scripts if you are amenable to it. Thanks!

molszanski3 years ago

Ok, will talk to the person who can release them. Shoot me an email in a week

abszeph3 years ago

I’d also love to chat more. DM coming your way!

twistedpair3 years ago

We moved all our builds (e.g. 100+) to GH Actions. We've been using GH Actions since it was a daily tar ball drop in a private GH Slack channel in Q3 2019.

Happy to answer any questions.

The biggest challenge has been the many GH Actions service outages/impacts. We're working on moving to self hosted runners to mitigate this.

koalalorenzo3 years ago

I wish there was an official Helm Chart for k8s, like GitLab CI/CD Runner has, and not the kind that sits there and does no scale, but he kind that spins up workers on demand without taking too much resources while idle.

I wish GitHub copied that feature from GitLab too!

growse3 years ago

It's not official, but there are K8s / github actions runner deployments: https://github.com/actions-runner-controller/actions-runner-...

I've been playing about with this and it seems to work quite well. Startup latency is quite high, and it's one pod-per-job (I think), but seems pretty flexible.

twistedpair3 years ago

I've been eyeing this for a while. My biggest hangup is that CI/CD is a major attack (e.g. supply chain) vector. If you use CI/CD for deploys, then a lot of highly privileged creds are in play.

I'd really prefer if GH made and managed the K8s operator (e.g. the most popular infra provisioning tool) themselves.

thinkafterbef3 years ago

The feature pull request has been there for over a year[1], it’s nice that’s it’s released!

Incoming shameless plug; if you don’t have to handle the hosting runners, but still to reap the benefits of having proper hardware(close to the metal). Check out BuildJet for GitHub actions - 2x the speed for half the price. Easy to install and easy to revert.

[1] https://github.com/actions/runner/pull/660 [2] https://buildjet.com/for-github-actions

hardwaresofton3 years ago

And a more shameless second plug, I run SurplusCI which does the same thing for GitHub and GitLab with a few other platforms on the horizon.

I can say we're less than half the price, because we focus solely on dedicated hardware and dedicated compute. We're working onworking on pay-for-what-you-use as we speak, and this issue finally getting resolved has generated work for me this weekend.

[0]: https://surplusci.com

growse3 years ago

Buildjet seems to be KVM-based, so does the job still runs in a VM?

Does it support nested KVM, e.g. for running Android espresso / emulator tests?

thinkafterbef3 years ago

Yes, job runs in a KVM VM. Nested KVM is supported on the hypervisor, but KVM is not enabled by default in guest OS, due to we run a guest kernel for faster booting time. We will offer an option to enable kvm kernel module in the future.

veidr3 years ago

Wow that looks like exactly what I need. We recently moved to GHA and while it is nice in many ways, my main complaint is that unlike our previous (AWS CodeBuild/CodePipeline) setup, we can't just pay more to get more powerful instances to run CI.

Looking into setting up self-hosted runners has been on my todo list since the first day of using GHA; will definitely check out your service soon.

noptd3 years ago

Ephemeral runner support has been highly anticipated for our organization - I'm excited to see it go live!

However, GitHub Enterprise admins may want to take caution - some users have reported that the changes are not currently compatible https://github.com/actions/runner/pull/660

smcleod3 years ago

Github Enterprise is a license / plan on github.com - I suspect you're talking about people running the Github Enterprise self-hosted VM?

noptd3 years ago

Yeah, that's correct.

albertom943 years ago

FWIW my team jumped the gun and encountered that issue on our GHE instance as well (v3.1).

vyrotek3 years ago

We're pretty happy with Azure DevOps on our team.

But, these competing offerings between Azure and GitHub have been really confusing to follow. Especially since folks are pointing out that GitHub Actions is partly Azure DevOps under the hood. It just seems like a complicated branding play because some people will refuse to use an Azure service but will gladly use a GitHub service still owned by Microsoft?

WorldMaker3 years ago

Azure DevOps Pipelines is "stable"/"mature" and not seeing anywhere near as much active investment: most of the team supposedly moved directly over to Github Actions and that seems to be where all the new investment work is going.

Azure Codespaces was rebranded at the 11th hour before launch to Github Codespaces and moved almost entirely to the Github org and Azure DevOps was never given access unlike original announced plans under the Azure brand.

Rumors have been swirling for a while now (including when bharry, the VP whose kingdom was Azure DevOps, retired three years ago) that Azure DevOps is on the slow decline to some sort of chopping block and Microsoft will replace it entirely with Github eventually. There are rumors that even "deeply private" teams you wouldn't expect to move from Azure DevOps to Github internally at Microsoft have already migrated. (Certainly a lot of well known Windows Developers have much more active "Activity Indicators" on Github these days and it isn't necessarily entirely accountable by all the known public repos like Calculator, Terminal, etc and public facing samples projects nor that all of their documentation repos have obviously moved to Github.)

It would be wonderful to get an actual definitive and official statement from Microsoft, even if "eventually" when a migration will happen is still "years away" (which is presumably why they are afraid to give a statement yet, if it's still too far down the roadmap). That would make it easier today for some of us to start making cases to our teams that migrating voluntarily today to Github would be good for us. (Make the debate more than just "I want Codespaces" or "I want Github's dependency scanners" but also "Microsoft suggests it".)

slaughtr3 years ago

You don't need to create an Azure account to use Github Actions. It's not really refusing to use the service as much as using the streamlined one right in front of you.

twistedpair3 years ago

What about the OSX runners? Those run in MacStadium, not Azure.

xvilka3 years ago

The biggest problem with GitHub Actions that you can't restart just one job[1], it always restarts all jobs in the workflow. And this bug is not fixed for quite a while. Travis CI and Appveyor both allow that, of course.

[1] https://github.com/actions/runner/issues/432

twistedpair3 years ago

And, if you restart a job, no new entry is made for the job history, so it just overwrites the job history on rerun.

This is likely because they modeled the jobs as one-per-job-def-and-commit, so they don't have a UX to show two.

This is a security blind spot, since you can do something naughty in a job, then rerun it and the logs are no longer accessible.

hardwaresofton3 years ago

It's a bit of an old drum to beat on but just want to note that GitLab has supported this (and provides docs for running on EC2, Fargate, k8s and other platforms like LXD[0][1][2][3]) for a very long time, and the CI system there is quite robust.

I've seen my fair share of CI systems (AppVeyor, CircleCI, GitLab, GitHub, TC, Jenkins, etc) and I'd argue that the GitLab CI is the best of all the ones I've seen:

- great syntax (it's YAML like most others but somewhat easy to organize well with great documentation)

- Fantastic documentation

- Unparalleled flexibility

- Unsurprising operation (things generally work as you'd expect)

- The ability to clear your build runner cache (Just ran into the inability to do this with CircleCI again today)

That said competition is a good thing so in general I'm glad to see this finally supported by GHA and dig into it over the weekend. GHA is making a lot of really good sustainable moves in the space and keeping the field open (their marketplace is the best) so I'm all for it.

I run SurplusCI[4] which does what you'd think (runs these runners in VMs) so getting this on-demand runners working happens bit top-of-mind, right now I only offer dedicated runners which are cheaper but of course aren't as cheap as on-demand (depending on usage).

Speaking of competition, just learned of a competitor here on HN in BuildJet[5], so if you don't want to manage your own runners check them out as well, unlike SurplusCI they actually offer to-the-minute on-demand runners, and the onboarding process looks way easier.

[EDIT] - Just to say, the list above is absolutely NOT the full list of platforms GitLab Runner supports -- it's pretty insane how many directions the community and GL have gone in. The Docker Machine integration (they maintain a fork) actually means you could run your single-use-machines on Scaleway or Hetzner easily as well, no need to muss or fuss with ASGs or k8s.

[0]: https://docs.gitlab.com/runner/configuration/runner_autoscal...

[1]: https://docs.gitlab.com/runner/configuration/runner_autoscal...

[2]: https://docs.gitlab.com/runner/executors/kubernetes.html

[3]: https://docs.gitlab.com/runner/executors/custom_examples/lxd...

[4]: https://surplusci.com

[5]: https://buildjet.com/for-github-actions

nicois3 years ago

So this is a big step forward in terms of avoiding the race condition where CI runners would accept new jobs during scale-in operations. But how do you ensure you only spawn new ephemeral runners as jobs become available? The webhook provides part of the answer, but do we need to use something like redis to ensure exactly one runner per queued job is started?

sascha_sl3 years ago

It'd be nice if we had an API for all jobs that still need a runner, but I don't think that'll happen.

anonymousDan3 years ago

Can someone tell me if GHA also supports non-ephemeral self hosted runners, and if so whether they work reliably? Any good resources for getting up and running with it quickly?

jffry3 years ago

GitHub's own documentation about GHA has a nav section titled "Hosting your own runners".

Here's a link: https://duckduckgo.com/?q=github+actions+self+hosted+runners

I haven't used it myself so I can't speak to how well it works.

RyJones3 years ago

I use them with Hyperledger on a couple Mac Minis. They work fine.

NiekvdMaas3 years ago

This is great news. The only part missing is official docker support for the runner (I'm using an unofficial solution right now) and/or Alpine support.

mcintyre19943 years ago

The autoscaling piece is cool! One of the things that impressed me most about Gitlab CI was how easily we could get runners autoscaling in our own AWS environment. We'd run tiny instances as the actual runner, and they'd spin up bulky instances for different jobs with none of those running when nobody was working. It sounds like this might give a building block to build that in Github Actions.

elamje3 years ago

I wonder if/when GitHub is going to start offering a Heroku-like service or full IaaS. It seems like an incredible opportunity to slap GitHubs branding on top of a subset of Azure's infrastructure and try to beat Heroku or AWS.

smcleod3 years ago

The (previous) lack of ephemeral runners was one of my few gripes with GitHub Actions, great to see it's been released!

hirako20003 years ago

My main grip with Github is that it's pushing more tie in features to own the development experience even further. Github started as a community development hub, now trying to swallow us all, owning each bit of the development process to then own the market.

I don't have any affection for aws or gcp either, their attempt to dominate as de facto infrastructure and software provider is scary.

We don't need github actions. Spinning up machines that run cook books that can do anything, even at scale is ultimately more flexible and platform agnostic. If that's time consuming to make it work at scale, providers dedicated to that are out there.

sascha_sl3 years ago

It's an option, nothing more.

My main gripe with it that it has the same effect as MS Teams: Execs see that a new product enters the market, with a vendor they already have agreements with, and it's either bundled in for free or relatively cheap. Being the right solution for the job has already lost at that point.

spockz3 years ago

This phenomenon is also referred to “best of suite” over “best of breed”.