Reliability | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Mon, 04 May 2026 00:00:00 +0000

The recently disclosed CopyFail Linux kernel zero-day (CVE-2026-31431) opens up a way for code running inside a container to break out onto the underlying node. We took a close look at our hubs to confirm whether they were exposed, confirmed that our hubs are likely not at risk, and added another layer of protection just in case.

Are 2i2c’s hubs at risk? #

No - based on our testing and mitigation efforts, our hubs are not vulnerable to CopyFail.

Why do we think we’re not at risk? #

We tried to reproduce the exploit on a staging hub by following the public Kubernetes proof-of-concept on both AWS and EKS, and the exploit was unable to break out of the container.
Existing JupyterHub hardening on Kubernetes from jupyterhub/kubespawner#545 (originally added by Yuvi in 2021 in response to a different security issue) had already significantly reduced our risk exposure, and the exposure of anyone else running Z2JH (the standard way to deploy JupyterHub on Kubernetes).
As an extra layer of protection, we deployed copyfail-ebpf-k8s across our hubs in 2i2c-org/infrastructure#8227. It blocks the specific kernel feature that CopyFail depends on. See the project’s explanation for how that works.
We’ve upgraded all GKE clusters to use a patched image in 2i2c-org/infrastructure#8230.

What else did we look into #

Deckhouse’s mitigation was too platform-specific for us.
OVHcloud’s modprobe blocking likely won’t work on Amazon Linux 2023, since the relevant module is built into the kernel image.
AL2023 security advisories - no patched AL2023 image is available yet, so we can’t rely on a kernel-level fix from AWS for now.

Acknowledgements #

Huge thanks to Georgiana for the deep dive into the exploit and whether we’re exposed here.
Thanks to Yuvi for the PR that reduces JupyterHub’s exposure to this back in 2021!
Thanks to iwanhae for the eBPF daemonset we deployed in Kubernetes, and to JupyterHub for the upstream kubespawner hardening that lowered our exposure.
Thanks to our collaborators at NASA VEDA for the ongoing conversations about hub security.

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

Wed, 08 Apr 2026 00:00:00 +0000

We’ve completed a major round of infrastructure upgrades across all 2i2c-managed hubs - every hub is now running Kubernetes 1.34 and Z2JH helm chart 4.3.3.

Running up-to-date versions of both Kubernetes and the JupyterHub helm chart ensures that our communities get the best support and reliability, both in terms of features and security.

A new approach to infrastructure upgrades: upgrading in rounds #

This was the first time we rolled out JupyterHub helm chart upgrades in rounds rather than all at once. By upgrading a subset of hubs at a time, we could identify and fix issues in isolation before they affected the broader network. This made the process safer and more predictable.

We’re planning to perform these kinds of upgrades on a regular schedule for our member communities. Around every 6 months we’ll create an issue to make sure nothing falls through the cracks (here’s example config for creating our reminder issues).

Check out our process docs for multi-hub upgrades for more information.

Learn more #

Check out these pages for what kinds of improvements we’ve brought into our clusters / hubs with these latest updates.

Acknowledgements #

Thanks to Georgiana Dolocan for leading this upgrade effort and establishing the rounds-based approach.
Thanks to Chris Holdgraf for adapting and editing Georgiana’s notes into a blog post.

Improving our community hub reliability and stability in Q4 2025

Tue, 16 Dec 2025 00:00:00 +0000

This year we’ve prioritized making the cloud safe to try for our member communities. This has driven work in monitoring, alerting, and automating infrastructure so that we resolve small problems before they become big problems. In the last quarter of 2025, we wrapped up this effort by testing the following hypothesis:

We can reduce P1 incidents if we shorten the time to act on current alerts and learnings from prior incidents.

Here’s what we accomplished and what we learned.

What we accomplished #

In short: we’re now much more confident in the stability of community infrastructure. Here’s a snapshot of our new incident dashboard, which shows high-level trends for the stability of our infrastructure:

See the real-time status of our community hubs at status.2i2c.org

We improved infrastructure reliability for our communities #

We made several technology and team process improvements that led to these benefits for our communities:

We are now more likely to catch outages before a community reports them to us.
We are now less likely to have an outage happen more than once, or affect more than one community, because we consistently fix the issues that cause outages.

We saw a consistent drop in critical alerts that required immediate response:

For August and September we had an average of 7 outages/month (6 from alerts, 1 from community)
In October, November, and December we had an average of 3 outages/month (9 in October, 0 in November, 1 in December, with only one of these being reported by a community)

We became more efficient, responsive, and focused #

We also got several team benefits from this work:

We get fewer interruptions and distractions from deeper work.
We have clear assignment policies to make it clear who is responsible for acting in response to alerts.
We avoid invisible work from falling down rabbit-holes when responding to outages.
We decreased the stress and pressure of doing upgrades, making them easier to split into sprint items and more likely to get done consistently.

The improvements we made #

Infrastructure improvements #

Created a status page for all 2i2c community hubs, giving our team and communities visibility into the status of our infrastructure.
Created an alert that triggers when two servers fail to start consecutively in a 30-minute time window.
Improved deployment infrastructure so that we can roll out sub-chart upgrades to individual clusters, allowing us to roll out major changes in batches.
Removed our “configurator” application from community hubs, because it was causing more confusion than it was resolving.
Allowed servers to start even when users hit their storage quotas.
Provided a number of upgrades to Kubernetes and the support services that we run alongside each community hub.

Process improvements #

Made a team commitment to prioritize issues from incident reports and other stability-related problems.
Defined incident escalation policies using the status page to calibrate the urgency of our response to the severity of incidents.
Defined “on-call” procedures so our team knows when and how to be more responsive to outages.
Time-boxed our alert response process to avoid accidentally falling down rabbit holes for non-urgent problems.
Created a more reliable process for responding to incidents and writing incident reports.

Looking forward #

After this push around infrastructure reliability, we’re significantly more confident in the stability and transparency of our community hub infrastructure. This will deliver better service for our member communities and free up more of our time to engage with them instead of fighting infrastructure fires.

We will continue to improve our infrastructure, and have a better foundation to do so incrementally in the coming quarters. Here are a few things we’d still like to improve:

We still need to improve how reliably we complete follow-up actions from incidents (e.g., writing incident reports). When a process doesn’t fit into planning & scoping ceremonies, we struggle to follow it consistently.
We’d like to improve our testing framework for major upgrades across all hubs (e.g., Kubernetes version upgrades) to catch bugs before communities do.

Learn More #

Faster reporting of user home directory sizes

Tue, 09 Dec 2025 00:00:00 +0000

Storage quotas help users avoid running out of space unexpectedly and give administrators visibility into capacity planning. However, storage usage can change rapidly, and it’s important to have quick information so that administrators know whether they are close to hitting limits.

We’ve improved how quickly hub administrators can see user home directory sizes across our JupyterHubs. This makes monitoring more responsive and adds quota limit visibility that wasn’t possible before.

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #

Our existing storage monitoring tool, prometheus-dirsize-exporter, deliberately runs slowly to avoid excessive disk I/O. This meant home directory metrics could be hours out of date on systems with many users or large directories. Plus, there was no way to report user quota limits at all.

Our home directory storage is managed by jupyterhub-home-nfs, which enforces per-user quotas. It could also expose usage and limit information as Prometheus metrics using data from the underlying filesystem quota system. Because this information is already tracked by the filesystem, it’s available immediately without scanning individual files.

We made two key improvements:

Make disk usage reporting almost instantaneous. We made jupyterhub-home-nfs export total_size_bytes and hard_limit_bytes metrics to Prometheus for near-instant reporting. We used the same metric names and namespace as prometheus-dirsize-exporter for compatibility. See 2i2c-org/jupyterhub-home-nfs#76
Allow this to be used upstream in JupyterHub Grafana Dashboards so that it can support both types of disk usage reporting. This means users of the upstream JupyterHub Grafana dashboards get the same useful view about home directory usage, regardless of whether the metric comes from prometheus-dirsize-exporter or jupyterhub-home-nfs. See 2i2c-org/prometheus-dirsize-exporter#29

These changes were deployed across all our communities, so administrators can now access current home directory information within minutes regardless of directory size.

Home Directory Usage dashboard showing total size metrics from jupyterhub-home-nfs and other data from prometheus-dirsize-exporter

Try it out #

2i2c member organizations can try this out now. If you have access to your hub’s Grafana instance, you can see these new metrics in the Home Directory Usage dashboard:

Open your hub’s Grafana dashboard.
Go to Dashboards -> JupyterHub Default Dashboards -> Home Directory Usage.
Check the table for up-to-date total size and quota limit values.

For more details, see our docs on filesystem and disk dashboards.

Coming next #

We’d like to build on this work to enable alerting when individual users near their disk quotas. This will make it easier to more reliably track user disk usage across a community. See this issue for tracking: 2i2c-org/infrastructure#7166

Acknowledgements #

This was a directed contribution supported by NASA VEDA to enable more proactive monitoring and alerting for hub administrators.

Fixing the mybinder.org usage analytics archive

Tue, 14 Oct 2025 00:00:00 +0000

The analytics archive at archive.analytics.mybinder.org powers the mybinder.org usage dashboards and provides a daily-published dataset that researchers and communities use to understand how Binder is being used across different domains and scientific communities.

While updating our quarterly Binder impact report, we discovered the archive index page had stopped updating. The analytics publisher was writing index files to temporary storage before uploading to Google Cloud Storage, but for some reason the upload step stopped working. We deployed a fix that eliminates the temporary files entirely - the code now generates the HTML index as a string in memory and uploads directly.

The mybinder.org analytics archive shows a list of daily usage reports that anybody can download.

Fortunately, we didn’t lose any data! Thanks to some smart design decisions, the daily analytics files were being collected properly the entire time, only the index page listing them was broken. You can find the full archive here.

Learn more #

Pull request with the fix
mybinder.org usage dashboards
The binder-data/ repository is where we aggregate and publish archive data to be more accessible.
Our quarterly impact report from mybinder.org

Acknowledgements #

Thanks to the JupyterHub community for their collaboration on mybinder.org infrastructure

Combating tcp scanning on mybinder.org with the tcpflowkiller

Wed, 08 Oct 2025 00:00:00 +0000

We’ve deployed a new tool to mybinder.org that automatically detects and stops port scanning activity, helping us maintain service reliability while being responsible citizens of the internet.

Port scanning is a common part of network-based exploits, and many server hosts prohibit this activity (including Hetzner, where the 2i2c mybinder.org infrastructure lives). We developed a little tool called tcpflowkiller as part of the cryptnono project (our anti-abuse set of tools for hosted JupyterHub and Binder infrastructure) to automatically kill processes that exhibit port scanning behavior. This reduces the likelihood of triggering our server host’s abuse policies and helps keep mybinder.org running reliably.

Why this matters #

As providers of public compute, it’s our responsibility to make sure people can’t use our infrastructure to abuse others. This is part of being responsible citizens of the internet. It also saves us time in dealing with outages because cloud providers (understandably) block access when they suspect there is abuse.

Hetzner and similar hosts have many benefits (including significant cost savings), and tools like tcpflowkiller help keep hubs and binders running smoothly on such hosts, which have different abuse policies than the big commercial cloud providers.

AWS and other cloud providers have proprietary ways to combat abuse (like AWS GuardDuty). We could have spent our time investing in developing rules there. Instead, contributing to cryptnono helps provide the same set of features in a cloud-agnostic way, in line with our principles of supporting open infrastructure that gives communities control over their infrastructure.

This tool has now been deployed to mybinder.org, and we’ll monitor its effectiveness over time. We may roll this out to 2i2c public BinderHubs in the future based on patterns we observe.

Learn more #

Acknowledgements #

Thanks to GESIS for their continued support of mybinder.org and to Raniere Silva for collaborating on this deployment with us.

Demonstrating our infrastructure's reliability with a hub status page for our communities

Tue, 23 Sep 2025 00:00:00 +0000

One of 2i2c’s goals is to make the cloud safe for science. A big part of this is making the black box of commercial cloud infrastructure more predictable and reliable for our member communities, across our network of community hubs that all operate autonomously.

Give us feedback! Click here to provide feedback that will help us make this more impactful.

To that end, we’ve created a status page for 2i2c’s network of community hubs. This is a source of truth to provide a high-level picture of the stability of our infrastructure, let a community know if their hub is experiencing a problem, and to give us a heads up when things aren’t working as expected. You can check it out at:

👉 status.2i2c.org

The 2i2c Status Page gives communities a high-level view of the uptime for our entire network of community hubs.

While we make status more visible, we’re also streamlining our incident response processes in order to more quickly respond to outages when they occur (ideally, before a community has even noticed!).

There are still plenty of improvements we’d like to make: for example, we’re focusing on major outages right now, but would like to extend some level of reporting for degraded service, like unexpectedly slow start times.

Learn more #

👉 The status page
👉 The status page documentation
👉 Our new process for incident response
👉 Follow an in-progress initiative to improve the reliability of our infrastructure

Solving classes of problems, rather than just an instance of a problem (with an example)

Mon, 09 Jun 2025 00:00:00 +0000

The Problem #

Two of our the communities we serve ( NMFS Openscapes and CryoCloud) reported issues with starting GPU nodes on their hubs. Upon investigation, I discovered that the cluster autoscaler seems to not recognize that GPUs were available in the cluster at all suddenly, and hence wasn’t provisioning the nodes. A restart of the cluster-autoscaler pod fixed the issue for both these communities.

An incomplete solution #

But is that the end of the story? Not if we want to provide reliable long term infrastructure to communities with minimal toil on the part of 2i2c engineers!

One of the engineering principles I’m trying to have us more intentionally and structurally embody is the idea that we don’t fix individual instances of problems, but whole classes of problems, rather than just an individual instance of the problem. Fixing the immediate issue is not enough - we need to understand what class of issues was manifesting itself in this particular fashion, and fix that.

What was the class of issues we could fix here? #

Digging in, I realized that our version of cluster-autoscaler was a little behind and not the latest. I presumed this was a bug in cluster-autoscaler (given a restart fixed it, implying it is a bug about state). To me, the class of problem here is that we were not rolling out releases to our “supporting infrastructure” fast enough. Perhaps if we were on the most recent cluster-autoscaler release, this issue would have never happened.

Additionally, this failure to scale up was reported to us by the community rather than by an automated alert. We should change that too!

Structured solutions #

We follow a two week sprint cycle, and I love the (hard won) structure it provides us. I don’t want to arbitrarily start doing work that upsets prior committed work from that structure. However, we also treat support requests seriously and try to work them into the sprint. So I timeboxed myself for one hour, and saw what I could accomplish. Turns out, a lot!

I upgraded all our support components, tested them, and rolled them out to all our communities! This included upgrading Grafana, Prometheus, nginx-ingress as well as the cluster-autoscaler. This also restarts the cluster-autoscaler across our clusters, fixing this issue for other communities (if any had it).
I re-enabled the automatic once a month PR for upgrading these support tasks. We had switched to doing them on a manual sprint cadence, but clearly that was not fast enough nor automated enough. We will instead work these into the sprint once the bot opens the PR. Credit to Erik Sundell for initially setting this up
Create an issue to track the alert creation, and put it in our sprint backlog.
(In an additional fifteen minute timebox) Write this blog post, to communicate out both to the affected communities and others what we have done.

By timeboxing myself, I didn’t upset our sprint cadence and was able to continue doing other work I had committed to in the sprint, while also fixing this class of issues to the best of my ability.

Moving forward #

While we have been implicitly trying to solve whole classes of issues rather than individual instances of an issue as a team for a while, I want us to explicitly do it from now on. Communicating this out to our communities is an important part of that, as is internal team training. This blog post is the former, and we are continually working on the latter :)

Acknowledgements #

Thanks to the OpenScapes and CryoCloud communities for working with us closely on infrastructure to identify improvements like this.

Simplifying and speeding up Binder builds with BuildKit

Mon, 03 Mar 2025 00:00:00 +0000

Chris and Yuvi recently wrote a blog post on the Jupyter blog about a recent experiment to significantly reduce the cost of running a node on the mybinder.org federation.

Acknowledgements #

Project Pythia provides support for some of our work with the Binder project.
JupyterHub for working with us to get this new node deployed for mybinder.org.

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Tue, 19 Apr 2022 00:00:00 +0000

2i2c manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs from a single open infrastructure repository. This is a challenging problem, as it requires us to centralize information about a number of independent cloud services, and deploy them in an efficient and reliable manner. Our initial attempt at this had a number of inefficiencies, and we recently completed an overhaul of its configuration and deployment infrastructure.

This post is a short description of what we did and the benefit that it had. It covers the technical details and provides links to more information about our deployment setup. We hope that it helps other organizations make similar improvements to their own infrastructure.

Our problem #

2i2c’s problem is similar to that of many large organizations that have independent sub-communities within them. We must centralize the operation and configuration of JupyterHubs in order to boost our efficiency in developing and operating them, but must also treat these hubs independently because their user communities are not necessarily related, and because we want communities to be able to replicate their infrastructure on their own.

A year ago, we built the first version of our deployment infrastructure at github.com/2i2c-org/infrastructure. Over the last year of operation, we identified a number of major shortcomings:

Within a Kubernetes cluster, we deployed hubs sequentially, not in parallel. This grew out of a common practice of Canary deployments that allowed us to test changes on a staging hub before rolling them out to a production hub.
We used a single configuration file for all hubs within a cluster, which led to confusion and difficulty in identifying a hub-specific configuration.
Moreover, any change to a hub within a cluster caused a re-deploy of all hubs on that cluster. This is because we did not know whether a given change touched cluster-wide configuration or hub-specific configuration.

Our goal #

So, we spent several weeks discussing a plan to resolve these major problems - here were our goals:

We should be able to upgrade a specific hub alone, by inspecting which configuration files have been added or modified.
Production hubs should be upgraded in parallel when they are effectively run independently.
We should use staging hubs as “canary” deployments and not continue upgrading production hubs if the staging hub fails.

An overview of our changes #

To accomplish this, we needed to identify which hub required an upgrade based on file additions/modifications. This took a lot of discussion and iteration on design, and so we share it below in the hopes that it is helpful to others!

Improvements to our code and structure #

We made a few major changes to the infrastructure repository to facilitate the deployment logic described above. Here are the major changes we implemented:

We separated each hub’s configuration into its own file, or set of files. For example, here is 2i2c’s staging hub configuration.
We created a separate cluster.yaml file that holds the canonical list of hubs deployed to that cluster and the configuration file(s) associated with each one. For example, here is 2i2c’s GKE cluster configuration, which contains a reference to the previously mentioned staging hub.
We updated our deployer module to do the following things:
- Inspect the list of files modified in a Pull Request.
- From this list, calculate the name of a hub that required an upgrade, and the name of its respective cluster.
- Trigger a GitHub Actions workflow that deploys changes in parallel for each cluster/hub pair.

In addition to these structural and code changes, we also developed new GitHub Actions workflows that control the entire process.

A GitHub Actions workflow for upgrading our JupyterHubs #

We defined a new GitHub Actions workflow that carries out the logic described above. These are all defined in this deploy-hubs.yaml configuration file. Here are the major jobs in this workflow, and what each does:

generate-jobs: Generate a list of clusters/hubs that must be upgraded, given the files that are changed in a Pull Request.
- Evaluate an input list of added/modified files in a PR
- Decide if the added/modified files warrant an upgrade of a hub
- Generate a list of hubs and clusters that require upgrades, and some extra details:
  - Does the support chart that is deployed to the cluster also need an upgrade?
  - Does a staging hub on this cluster require an upgrade?
This produced two outputs to be used in subsequent steps:
- A human-readable table including information on why a given deployment requires an upgrade (using the excellent Rich library).
- JSON outputs that can be interpreted by GitHub Actions as sets of matrix jobs to run.
Our staging and support hub job matrix tells GitHub Actions to deploy staging and support upgrades that act as canaries and stop production deploys if they fail.
upgrade-support-and-staging: Update the support and staging Helm charts on each cluster. These are “shared infrastructure” Helm charts that control services that are shared across all hubs.
- Accepts the JSON list described above to determine what to do next
- Parallelises over clusters
- Upgrades the support chart of each if required
- Upgrades a staging hub for the cluster if required (for canary deployments, this is always required if at least one production hub is to be upgraded on the cluster)
filter-generate-jobs: Allows us to treat the support / staging hubs as canary deployments for all the production hubs on a cluster.
- If a staging/support hub deploy fails, removes any jobs for the corresponding cluster.
- Allows production deploys to continue on other clusters.
Our production hub job matrix tells GitHub Actions which hubs to update with new changes. These are triggered if a cluster’s staging/support job does not fail.
upgrade-prod-hubs: Deploy updates to each production hub.
- Accepts the JSON list described above to determine what to do next
- Parallelises over each production hub that requires an upgrade
- Deploy the relevant changes to that hub

Concluding Remarks #

We think that this is a nice balance of infrastructure complexity and flexibility. It allows us to separate the configuration of each hub and cluster, which makes each more maintainable by us, and is more aligned with a community’s Right to Replicate their infrastructure. It allows us to remove the interdependence of deploy jobs that do not need to be dependent, which makes our deploys more efficient. Finally, it allows us to make targeted deploys more effectively, which reduces the amount of toil and unnecessary waiting associated with each change. (It also reduces our carbon footprint by reducing unnecessary GitHub Action time).

We hope that this is a useful resource for others to follow if they also maintain JupyterHubs for multiple communities. If you have any ideas of how we could further improve this infrastructure, please reach out on GitHub! If you know of a community that would like 2i2c to manage a hub for your community, please send us an email.

Acknowledgements: The infrastructure described in this post was developed by the 2i2c engineering team, and this post was edited by Chris Holdgraf.

Reliability | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Are 2i2c’s hubs at risk? #

Why do we think we’re not at risk? #

What else did we look into #

Acknowledgements #

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

A new approach to infrastructure upgrades: upgrading in rounds #

Learn more #

Acknowledgements #

Improving our community hub reliability and stability in Q4 2025

What we accomplished #

We improved infrastructure reliability for our communities #

We became more efficient, responsive, and focused #

The improvements we made #

Infrastructure improvements #

Process improvements #

Looking forward #

Learn More #

Faster reporting of user home directory sizes

Using jupyterhub-home-nfs for near-instant disk usage metrics #

Try it out #

Coming next #

Acknowledgements #

Fixing the mybinder.org usage analytics archive

Learn more #

Acknowledgements #

Combating tcp scanning on mybinder.org with the tcpflowkiller

Why this matters #

Learn more #

Acknowledgements #

Demonstrating our infrastructure's reliability with a hub status page for our communities

Learn more #

Solving classes of problems, rather than just an instance of a problem (with an example)

The Problem #

An incomplete solution #

What was the class of issues we could fix here? #

Structured solutions #

Moving forward #

Acknowledgements #

Simplifying and speeding up Binder builds with BuildKit

Acknowledgements #

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Our problem #

Our goal #

An overview of our changes #

Improvements to our code and structure #

A GitHub Actions workflow for upgrading our JupyterHubs #

Concluding Remarks #

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #