Story of failure: Attempts in self-hosting Bitwarden enterprise organization
We recently attempted to set up a self-hosted Bitwarden instance on OpenShift for our enterprise organization. As we didn’t find any blog posts talking about this and ran into a few issues, we wanted to share our experience – even though, in the end, we decided against going forward with a self-hosted Bitwarden solution and chose Bitwarden’s SaaS solution instead.
Introduction and Requirements
Back in 2008, when AWS’s EC2 was a rosy-cheeked 2-year-old and the Trollface meme was first created, Puzzle’s in-house password manager Cryptopus was born. This open-source project has been internally used and maintained since then.
Over the years, open-source, web-based password managers such as Bitwarden (first release August 10, 2016) have become more popular for both private and business use. In the same time span, Puzzle has grown from 35 members in 2008 to 138 members in 2024. Making the maintenance of Cryptopus increasingly complex and eventually making it fall behind in terms of features. Since the founding of Puzzle’s security branch in 2020, which became the /security division in 2022, an outstanding task has been to replace our legacy, in-house password manager Cryptopus.
We therefore took a look at the landscape of available password managers, with the following wishlist in mind:
- open-source
- self-hostable on OpenShift and therefore runnable without privileged containers
- supports SSO
- can be used simultaneously by multiple users and supports teams
- (ideally) with an officially supported deployment method
- If self-hosting isn’t an option, the data should at least be in Switzerland or the EU
After comparing open-source solutions, especially their ease of installation on OpenShift, we quickly came to the conclusion that none of them will work out of the box with OpenShift. Since Bitwarden was the most mature solution, we decided to explore building our own deployment based on existing solutions.
Challenges
The first step was to evaluate the official Bitwarden Helm chart. After all, that would be the officially supported way to deploy it. The initial attempt at a naïve deployment resulted in failure – and the reasons for that could be found in the ReadMe. The chart simply deploys the same containers that are used to run Bitwarden on Docker. These containers required extra capabilities, which are not included in the purposefully restricted profile that OpenShift uses.
By default, OpenShift provides containers with a minimal set of capabilities. In this particular case, the entry points for the official Bitwarden images expect to be run as a privileged user.
This left us only the option to either increase the capabilities of the container (which was vetoed by the infrastructure team) or to examine where the permissions are required and to then adjust the Helm Chart.
An additional challenge we faced was that the Helm chart expected a persistent volume storage class with the ReadWriteMany (RWX) access mode. Our infrastructure for production does not support this. It is unfortunate that Bitwarden does not support a more suitable cloud-ready solution (e.g., S3-compatible object stores). Most storage solutions only support ReadWriteOnce, and generally with good reason. It would remove a significant compatibility barrier if Bitwarden switched.
Another complication was the database, Microsoft SQL Server, as we have no expertise (nor interest) in operating it, which makes using it for such an important service a difficult proposition.
To summarize, the initial challenges:
- Container Privileges
- Lacking ReadWriteMany Storageclass
- No Microsoft SQL Server
Solutions
No Microsoft SQL Server
While looking for solutions to incorporate an alternative relational database management system, we found the Bitwarden unified deployment image. This alternative Bitwarden image is still in beta, but it promised to enable the user to “Utilize different database solutions such as MSSQL, PostgreSQL, SQLite, and MySQL/MariaDB”, enabling us to use our in-house managed PostgreSQL. Which it did, despite still being in beta (which made us wonder why it still is, but we dismissed any gnawing doubts).
ReadWriteMany Storage Class
The unified deployment image (which allows us to use our in-house PostgeSQL) solved the ReadWriteMany storage class problem as well, as all pods were now deployed in a single container, which meant that ReadWriteOnce would now work fine. To illustrate, here’s an overview of which volumes are actually shared within the container and which services mount them:
| “dataprotection” | “licenses” | “attachments” | “applogs” | |
|---|---|---|---|---|
| Pod “admin” | ✅ | ✅ | ✅ | |
| Pod “api” | ✅ | ✅ | ✅ | ✅ |
| Pod “attachments” | ✅ | |||
| Pod “events” | ✅ | |||
| Pod “icons” | ✅ | |||
| Pod “identity” | ✅ | ✅ | ✅ | |
| Pod “notifications” | ✅ | |||
| Pod “scim” | ✅ | ✅ | ✅ | ✅ |
| Pod “sso” | ✅ | ✅ | ✅ | |
| Pod “web” |
From a seasoned cloud engineer’s point of view, file sharing between services should not be necessary for an application with a microservice architecture. This couples the services much too tightly. A better solution for this type of shared storage would be to natively support some type of object storage, e.g., S3-compatible buckets or a database with a blob data type. Ideally both.
Container Privileges
After looking at the container and its entrypoint, we realized that the extra capabilities were primarily used for creating folders and setting permissions on files and folders. In the Helm chart, the folder creation and corresponding owner change happen in the entry point script. Changing the owner of a file or folder is always a privileged operation, so the entry point won’t work as a non-privileged user.
As a brief aside, in OpenShift the default setting will result in a randomly generated a UID, but the user will have the GID set to 0. That can then be used to predictably set permissions. Below, you can see an example of how that might look.
$ id uid=1002720000(1002720000) gid=0(root) groups=0(root),1002720000
On a positive note, the entrypoint script will always ultimately relinquish the elevated rights and start the Bitwarden services as an unprivileged user. That means that Bitwarden itself does not expect elevated rights. Hence, if we can get OpenShift to set up the files and folders for us, we should be able to run the containers.
That solved the issue of having a privileged container, but we found a new, more specific challenge.
New Challenge Unlocked: File System Permissions
If an application needs to store some data in files on the file system, persistent or ephemeral, the way to go is to mount a volume (backed by respectively either a PersistentVolume or an ephemeral volume) at the corresponding mount point. This ensures the container user can access it. OpenShift achieves this by setting file and folder permissions as the volume is being mounted.
Quoting the documentation:
«By default, Kubernetes recursively changes ownership and permissions for the contents of each volume to match the fsGroup specified in a Pod’s securityContext when that volume is mounted. […]»
Utilizing this, we could now create a persistent volume with the ReadWriteOnce (RWO) access mode and some ephemeral volumes, for the logs and potential caching. We then mounted individual subPaths of these volumes into the container. This allowed us to control precisely which folders existed and to set the correct permissions. The entrypoint script was now largely obsolete. All we needed now was to copy the command to launch the services into the deployment manifest.
The (seemingly) last thing in our way to Bitwarden running in our OpenShift cluster were strange errors that didn’t make sense to us, so we experimented with setting a certain set of environment variables to several different, plausible, options, and disable others. These choices were based on what we saw in the official Helm chart or sources. What seemingly worked, though, was simply unsetting them all. That silenced the errors and Bitwarden launched.
Extra Bonus Challenge: Time Zone Shenanigans
With Bitwarden up and running, we checked its features to see if everything worked as expected.
This included testing Bitwarden’s Send feature with an expiration time of one hour. “Send” is Bitwarden terminology for an expiring link to some chunk of sensitive text or a file.
However, any Send with an expiration time of one hour or less would immediately count as expired. Since we are in the GMT+1 time zone, we suspected that there was a problem with a local date time being saved as a UTC timestamp. To fix this, we started setting time zones in the container in various variants. The log timestamps started looking good, the time zone settings were working, but our problem still persisted.
In the end, the mismatch was introduced by our managed Postgres database. Our databases were set to the local time zone, but Bitwarden expected UTC. We fixed that with an init container that executed psql with a SET TIMEZONE "UTC"; command. Not the most beautiful solution, perhaps, but simple and it worked.
Like most problems in integration hell, this took us weeks to figure out and barely any time to fix.
Undocumented assumptions and enterprise features
However, there were other problems we discovered that were much more critical than Sends having a too fast expiry time.
Bitwarden distinguishes between decryption and authentication. Authentication, as always, is about identifying a user. It was clear that we would use the SSO option, connecting Bitwarden to our existing Keycloak instance. This way, our users would be protected by strong, FIDO-based authentication and not have to manage a separate account. To keep the UX smooth, we were also planning to utilize the Trusted Device Encryption (TDE) that Bitwarden offers. TDE is one of three ways that Bitwarden offers for the decryption step, with the other two being the Key Connector and a master password.
If you’re using a master password to log in, it’s an obvious choice for decryption, but for the convenience of our users, we did not want them to need a master password. They’d have to store it somewhere securely, too, and we’d rather not force them into having a second password management solution. In general, we are trying to reduce the reliance on passwords as much as we can.
The Key Connector needs to be approved by Bitwarden, and introduces an extra service and depends on having another DB for the keys. That’s several more dependencies that we wanted to avoid if we could.
Therefore, we decided to use Trusted Device Encryption. The short summary of TDE is that any device (mobile app, browser, desktop app etc.) that a user logs in to, gets access to a key which is then used to decrypt the vault. The very first device has to be approved by an admin, but further devices can be approved by the user themselves, as long as the first device is of the right type – in this case, a mobile app. A new device can request approval from the admins, or ask for approval from another logged-in device, which sends a notification to the approving device.
Our setup seemingly worked, especially after we fixed the timezone issue, up to the point of attempting to add a new device. Our Android-based phones would show the expected notification and allow us to confirm or deny the login attempt. However, both actions (i.e. Accept and Refuse the login attempt) resulted in the same outcome: Nothing. The requesting device remained stuck.
The logs were unhelpful, logging almost no details about what was going on. There was an error related to registering push notifications that seemed promising, but the logs only noted that a 400 Bad Request response was received from the Bitwarden Push Relay. The push notification seemed to go through fine regardless, and a lot of other requests surrounding the attempt were logged as successful.
While troubleshooting, we also started using iOS devices – just to see if there was a difference in behavior. Indeed, there was, and it was that they worked even worse. The initial admin approval simply did not work. They would get to the step of requesting admin approval, the admin would see the request in the admin console, approve it… and the device would simply not do anything further.
At this point, we had contacted Bitwarden support and been escalated beyond the first tier of support. Our contact did not have a lot of new information, but did suggest that perhaps the identity certificate was related to the problem with the iOS devices.
The identity certificate is not mentioned in the documentation at all, so we simply copied the approach that the official Helm chart takes and generated a self-signed, long-lived certificate. The certificate can’t be reloaded dynamically and wasn’t the certificate that would be used for TLS and in browsers, so this seemed like a reasonable approach. However, apparently it could be helpful to use a certificate that would be trusted on the devices (this might be a requirement from iOS devices).
For testing purposes, we switched the certificate out for a certificate generated by Let’s Encrypt, rather than figuring out how we would distribute a long-lived certificate to all our members and their devices. This helped! Apple devices were no longer stuck, and we had achieved parity between Apple and all other devices. However, we still could not use our devices to approve new devices, and we weren’t willing to force our admins to approve each individual device for all our users. A one-time effort for the initial device per user (and the smattering of people who would first log in on a device that couldn’t approve more devices) would be manageable, but not this.
At this point, we were on our own. Bitwarden support, while friendly, didn’t have any more information and was willing to blame the unified deployment and Postgres. We couldn’t imagine any reason for the latter, and didn’t understand why the former would be a problem.
So we decided to spend one more burst of effort on this and dug in.
Code diving
Bitwarden being (mostly) open source enabled us to clone the source code of the server and really search through it all. We found the settings to increase log levels and set those. (We aren’t sure whether the enableDevLogging setting did anything, but it is included for completeness’ sake.)
globalSettings__enableDevLogging=true
logging__console__loglevel__default=Trace
logging__console__loglevel__system=Trac
logging__console__loglevel__microsoft=Trace
logging__console__includeScopes=True
The increased log levels gave us a few more breadcrumbs, but nothing too exciting.
Eventually, we stumbled across the globalSettings__baseServiceUri__internalX set of settings. They weren’t mentioned in the documentation, but some of the setup provided for self-hosting would set them, and they were referenced in code we were looking at. They seemed to just duplicate the globalSettings__baseServiceUri__X set of settings, with several of the examples using the same values for both. Remember that we disabled some settings in the beginning (in “New Challenge Unlocked: File System Permissions”) to get rid of some pesky errors and get Bitwarden to start? Here they were again.
So we set them again, using the same values as for the non-internal variables. Bitwarden broke again. We were no longer able to log in at all, let alone use TDE (remember, that’s what we were trying to fix at the start of this!). So we started reading the code that references these variables and what they could be set to.
We found that one of the defaults would set them to <name>:5000. That probably makes sense in the original Helm chart where nginx is configured appropriately, but wouldn’t work for the unified deployment. We could set them to what seemed like the closest equivalent: localhost:500x, with each service on its own port.
That looked better. SSO and TDE still didn’t work, but at least with master passwords (which our admin accounts still had), we could log in again.
We went back to our notes and realized that we had also removed three other values from our new setup:
- globalSettings__internalIdentityKey
- globalSettings__oidcIdentityClientKey
- globalSettings__duo__aKey
They were randomly generated, so we had assumed that we would either set them to something specific once we needed them or that they wouldn’t be necessary if we didn’t use certain features, like the Duo integration. Since Bitwarden started and did not throw any errors, we felt safe in that assumption.
We could see in the code that the internalIdentityKey was being used along with one of the internalX URIs to change some behavior, so we decided to do what the Helm chart does and randomly generated 64 alphanumeric characters. That caused a new error to show up in the log:
Client secret exceeds maximum length.
Mildly perturbed, we dug through the code to see if we could find a limitation on the length of these keys. We didn’t find anything directly, but we did find a likely library and some other traces in the code that suggested a maximum length of 30 characters. We truncated the secrets to 29 characters (just in case) and tried again.
It worked. Logging in worked, TDE worked, Bitwarden was happy. We obviously had very mixed feelings, though we were proud of having figured it out and gotten it working.
Lesser problems
There were other issues that were less critical but which we also weren’t able to solve.
We wanted to use the domain verification feature. This is an SSO-related feature that nicely streamlines the user experience with SSO. Rather than having to enter their email address, followed by an identifier for which SSO instance should be used, having a verified domain allows users with an email address on said verified domain to directly be sent to the correct login.
On the SaaS instance, it makes some degree of sense that Bitwarden needs a way to distinguish the various SSO integrations, but this flow shouldn’t exist at all on the self-hosted instance. There’s only a single SSO solution configured at any one time, so any user that chooses to log in with SSO can simply be sent there. At least using domain verification, we could bypass this flow.
However, not only is there no way to do this for two instances (staging and prod, or even a hot standby for disaster recovery purposes), since they both ask for their own random value (but with the same name so you can’t just set both) and cannot be taught to look for a different value, but our instances would occasionally simply lose the domain verification. Neither of the two ways in which Bitwarden logs things recorded anything in those cases. If we actively deleted the domain in the GUI, deactivating the domain verification, that would show up, both in the event logs (an audit log only accessible in the web UI) and the regular application logs. In the cases where it disappeared, the logs wouldn’t show anything.
Bitwarden is also very chatty in terms of emails being sent to new users, and there is no way to reduce that. The moment that you synchronize your users into Bitwarden, they will all get emails about it, even if you’re not ready yet and, e.g., still have to migrate data and set up permissions. If email isn’t configured, Bitwarden will refuse logins, so be prepared to provide an email account that pretends to send emails but in fact sends them all to /dev/null during such a migration period.
In general, migrating is tricky to test, as there is also no documented way to run a staging or even development environment without having another license and another organization. If we wanted to run a staging environment that realistically represented our users and data, we would flat out double our costs. Having a hot standby for disaster recovery purposes would seem to be the same.
That meant that we created a test organization that only had 5 seats, just enough for the security team to log in and test. We wouldn’t be able to test the directory sync that way, but at least we got a mostly-running instance. Ultimately, this didn’t help us much. The staging environment was internal-only, and that seemed to cause some issues with push notifications, so we did all of our testing on the intended-to-be-prod environment that was on a route reachable from the web.
Synchronizing users and groups with Directory Sync also revealed that the documentation refers to features that weren’t actually enabled in the current version of the application. We had to manually edit the configuration file to include groups from our LDAP and then had to avoid opening the configuration tab in the UI, as that would overwrite it again. Still, the UI nicely displays the results of a dry run, so that was helpful – until it just mysteriously failed on an actual run.
We suspect that in this case, the code to connect to Postgres is not fully mature yet, as we were able to “import” all users just fine by just creating them all individually instead of using the import call that Directory Connector uses. Similarly, it proved impossible to update Collections to grant Groups access – but doing the inverse, adding Collections to Groups, worked fine. And both worked fine through the web app, it was just the specific API call that failed. Clearly, only specific code paths dealing with the database in specific orders threw errors, but overall, Bitwarden was working, which made us think that it was just because the support of non-MSSQL databases is still beta and not fully battle-tested.
Retrospective
We spent a lot of time slowly figuring out assumptions that the Bitwarden setup makes, but which weren’t included in the documentation. Most of those problems were trivial in the grand scheme of things, like the database timezone, but they took up time that we should have spent figuring out the actually hard problems. We haven’t even attempted to run a second deployment for fail-over purposes, because after all the small things, we got stuck on essential functionality simply not working as documented.
Some of the issues were pretty clearly because the self-hosting options are not being maintained well. Sure, we misidentified the errors we got early on and simply unset some variables – but the Helm chart set them to broken values, so it wouldn’t have helped with understanding or debugging if we had left them set. We got into a less broken state by leaving them out, in fact, with fewer errors. Things were still broken, but no errors showed up, and most things worked.
Again, a broader and more detailed documentation, paired with more (and more descriptive) error messages, would have helped a lot.
As we wrote above, we have some experience-based answers for those questions now, but we often can’t give a well-founded answer. The identity certificate must be trusted on the end devices, but what is it actually used for? We found some of the uses, but we don’t know if those are all of them. That would require a full audit of the code base.
At the same time, we shouldn’t need to know why for all of these issues; we should just know what we need to configure to run Bitwarden successfully. That requires the documentation to be expanded and include more of the configuration, rather than assuming that the provided Helm chart or Docker Compose file sets things up correctly. Especially as they didn’t actually set things up correctly. It’s likely that the self-hosting options were simply never tested beyond the most simple setup: a single user or perhaps a family, using master passwords and Android devices. Testing was also complicated by the fact that admins and owners are exempt from some policies, so some behavior was always going to be different for us than it was going to be for our users.
Support was friendly but ultimately couldn’t help a lot and tended to blame the unified deployment and Postgres, with no justification as to why those would have the effects we observed. Raising an issue in the community forum got us a ChatGPT response that didn’t actually engage with the problem. Reporting an issue on GitHub was also frustrating, as it first got closed based on a misunderstanding and was only actually properly triaged and accepted after we presented detailed info, including a screen recording.
Where did this leave us?
So, where did this leave us? Going back to our initial wishlist and checking our Bitwarden solution:
| Requirements | Achieved |
|---|---|
| Data stays in CH | ✅ |
| Open-Source | ✅ |
| SSO-enabled | ✅ |
| Collaboration-focused | ✅ |
| Hostable on OpenShift | 🏗️ |
| No privileged containers | 🏗️ |
| Official deployment method | 🏗️ |
| Works | ❌ |
We did not manage to solve all the issues we encountered. We had a running, self-hosted Bitwarden instance that, unfortunately, did not support all the features we were looking for. Getting even this far, was far more effort than we expected, and we hadn’t even set up a failover instance; we had to add data in one way and not another, and we hadn’t been able to use Directory Connector successfully. At least we had tested our backups once during this process and found that we could successfully restore Bitwarden. Overall, this ordeal did leave us with a bit of a sour taste in our mouths.
Epilogue: Bitwarden support finally has answers
At that point, Bitwarden support escalated one of our issues even further, and we finally got official word:
The unified deployment image isn’t just a beta; it’s not meant to support enterprise deployments at all.
It might not be compatible with all features going forward, and Bitwarden would not spend effort in making TDE work, for example. We weren’t just trying to make a beta work and provide potentially useful feedback; we were trying to force something to work that was never intended to work.
That left us with the option of either going back and just using the Docker images (with privileges and all) or switching tracks and going with Bitwarden SaaS – something else entirely!
Quickly looking over competing products, we decided that we were comfortable enough with Bitwarden’s code quality and that deploying anything else would come with its own troubles, so we would switch to Bitwarden’s SaaS solution, hosted in the EU.
We’ve been using that for about two months now, and it has worked smoothly and stably for us.