Path to success for system upgrades

Samuel Tai · Jun 16, 2020

Path to success for system upgrades

Too many folks here just blindly upgrade, and find themselves in a pickle afterwards (jails no longer working, encrypted pools not unlocking, or other nastiness). All too often, their predicaments are addressed in the upgrade's release notes, or in the Guide section 1.1 on the changes adopted in the new version. So before you blithely click Upgrade, please consider the following:

Read the release notes carefully, along with the Guide for the new version, section 1.1. Look for any gotchas, deprecated features, or one-way actions afterwards (like upgrading ZFS pool features) that would prevent going back. Read them again. Read them a third time. In fact you should re-read them both several times fully until you understand all the ways the developers have foreseen you could run into difficulty after the upgrade.
Prepare an action plan to mitigate all the gotchas in your specific installation.
- In particular, backup your GELI keys, recovery keys, SSH keys, and system configuration.
- It's prudent to have replacement media on hand for your boot pool, along with a USB 2.0 thumb drive prepped with the installer for the new version.
- Have printed screenshots of all relevant configuration screens, in case your configuration backup doesn't work, and you need to re-enter everything again by hand. Make sure these are current.
- Likewise, have screenshots of the relevant jail/plugin/VM configurations, in case you need to rebuild these from scratch.
- Backup your pool, in case you need to reconstitute it.
- Test your backups to make sure they work.
- Review the steps needed to reboot back into your current version from the Guide, section 2.5.5.
All this prep work may seem like overkill, but if something goes wrong, you'll have a plan to address it, and as a last resort, revert back to the last known working configuration.
Once you're in the new version, and have solved all the post-upgrade issues, leave the old boot environment in place, and don't upgrade any ZFS feature flags for at least a week. If you find an issue that appears after the immediate aftermath, you'll still have a way to revert back to the last known working configuration. After this feeling out period, you can consider making one-way changes that will prevent you from going back.

Remember, a system upgrade is a high-risk event. Many things can go wrong, and an ounce of prevention is better than a pound of cure.

winnielinnie · Jun 20, 2020

Samuel Tai said:
All this prep work may seem like overkill, but if something goes wrong, you'll have a plan to address it, and as a last resort, revert back to the last known working configuration.

In your opinion, do you believe all of these steps and efforts are as necessary even with "-U" updates? I understand jumping from 11.2 to 11.3, but aside from bugfixes and security patches, there is no risk of sudden changes in features from 11.3-U1 to -U2 to -U3, correct?

Regardless, I always have a recent working backup of my pool, yet I only save configs, geli keys, ssh keys, and such when I upgrade from one major version to the next (11.1 to 11.2 to 11.3 and in the future to 12.0).

Samuel Tai · Jun 20, 2020

winnielinnie said:
do you believe all of these steps and efforts are as necessary even with with "-U" updates?

From Un to Un+1, there can still be significant changes, so I would do as much of this as I'm paranoid about. Only minor hotfixes, like from U3.1 to U3.2, where the release notes say only bug fixes are included, would I perform an abbreviated procedure..

urza · Jul 14, 2020

This is crazy and it is the reason I am considering going back to debian + ZOL for my home NAS. I dont want to spend whole weekends with my NAS upgrades. This time my Bhyve VSs stopped working. It feels like Linux 20 years ago. Freenas developers are creating really terrible user experince. It looks like it is better to not upgrade at all or go Linux way..

diedrichg · Jul 17, 2020

Before I upgrade I always shutdown all jails and plugins. I also turn off all services that aren't necessary (SSH and SMB), leaving SMART and UPS running.

Jono_K · Jul 23, 2020

With the merger of the code base for the community and Enterprise edition is this effort still required?

Agreed, the recommendations above would mitigate risks of a system change, but is it reasonable to expect this from your users?
Who are the users of TrueNAS? Do the recommendations above really reflect what the design teams captured when they created user stories for this project?

I get an email from my system that a new update is available. I login to the system, and it has an alert telling me where to go to do the update. It offers the option for a config backup, nice. I do that and i think I'm off to the races. If I should do these other things too, shouldn't the system help ensure that? Perhaps include a link to the release notes there, or in the alert email. Perhaps download link for all the other things you said I should backup.

As was mentioned above, this is 20 years ago stuff. Proper patching and upgrading is a solved problem now. Is it really this risky and terrible with FreeNAS/TrueNAS?

Do I need to do all this hand wringing with QNAP? Synology?

Samuel Tai · Jul 23, 2020

Jono_K said:
With the merger of the code base for the community and Enterprise edition is this effort still required?

It's your data. How paranoid you are with it is up to you.

Jono_K said:
Agreed, the recommendations above would mitigate risks of a system change, but is it reasonable to expect this from your users?
Who are the users of TrueNAS? Do the recommendations above really reflect what the design teams captured when they created user stories for this project?

I wrote this for the home users who aren't used to thinking in enterprise terms. Many of them were bitten in the 11.2->11.3 upgrade, which introduced several discontinuities:

Removal of passphrases from encrypted system dataset pools
Removal of warden jail management
New ACL management model

When I read the release notes and 11.3 Guide prior to upgrading, I knew about all the gotchas relevant to my particular system, and had taken the appropriate precautions. As a result, I had an uneventful upgrade, and rebuilt my jails and VMs afterwards, with configs reloaded from backups. However, on this board, there are still many users who update blindly and kill their systems or worse, their data.

Jono_K said:
I get an email from my system that a new update is available. I login to the system, and it has an alert telling me where to go to do the update. It offers the option for a config backup, nice. I do that and i think I'm off to the races. If I should do these other things too, shouldn't the system help ensure that? Perhaps include a link to the release notes there, or in the alert email. Perhaps download link for all the other things you said I should backup.

As was mentioned above, this is 20 years ago stuff. Proper patching and upgrading is a solved problem now. Is it really this risky and terrible with FreeNAS/TrueNAS?

This was mostly true in the FreeNAS 9 era, but 11.x has a greater frequency of hotfixes, which can be taken as an indicator of declining update quality. Better to be prepared than not.

Jono_K said:
Do I need to do all this hand wringing with QNAP? Synology?

Every vendor goes through patches like this. It's an inherent risk in software engineering. No one can foresee all possible outcomes and work around them. There will always be corner cases. Most of the time, we don't run into these corner cases, but Murphy always has a say.

Yorick · Jul 23, 2020

urza said:
This is crazy and it is the reason I am considering going back to debian + ZOL for my home NAS

I think OP is coming from a standpoint of "I don't want to lose my data", where a backup is always, always a good idea anyway.

In my case, I've started with 11.2RC1, went to 11.3 early, and now to 12.0 early, and I haven't had issues that weren't easily resolved. In order to keep things chill for me, I am:

- Not using GELI encryption, that's too brittle / scary for my tastes
- Snapshotting all datasets individually, recursively, with a 2-week lifetime
- Backing up my config daily and sync that to cloud
- Following recommendation 4) in the OP. Keep boot environment; feature flags only when I know I will never ever go back.
- Using "boring" hardware - Intel NIC, Supermicro board, ECC memory, SSD boot. 9 out of 10 FreeNAS/TrueNAS issues are hardware-related.

I haven't had issues that caused me to have to go back, and, a boot environment makes it trivial to do so, and the snapshot means that if there's a change to iocage that breaks things (looking at you, NAT setting in 11.3), that's easily and rapidly rolled back.

For my home use, upgrades have been uneventful. And, things can go wrong, and hence all the caveats.

ornias · Jul 23, 2020

urza said:
This is crazy and it is the reason I am considering going back to debian + ZOL for my home NAS. I dont want to spend whole weekends with my NAS upgrades. This time my Bhyve VSs stopped working. It feels like Linux 20 years ago. Freenas developers are creating really terrible user experince. It looks like it is better to not upgrade at all or go Linux way..

This advice is what generally should be done on any system:
- Read the breaking changes
- Cross reference other change with what you are using
- Make a backup
- Make a recovery plan.

This doesn't have much to do with FreeNAS.
Yet you go on a rampage making it look like this is some sort of a special requirement for FreeNAS: It isn't.

urza said:
"This time my Bhyve VSs stopped working"

Bhyve is still mostly RC or Beta quality software. Thats not even (primarily) maintained by IX systems, they just present a frontend for it.

urza · Jul 25, 2020

@ornias @Yorick @Jono
Then I recomment that FreeNAS should be more clearly marketed as GEEK ONLY product if this is required with updates.
I started using FreeNAS specifically to have "run and forget" solution for my NAS. Otherwise I would stick with Zfs on Linux, which has better features. Perhaps I misunderstood the product offering.
Also tells somthing about FreeNAS that you guys consider "encryption scary".. that is mandatory feature for my privacy requirements, not an option.

Yorick · Jul 25, 2020

urza said:
that is mandatory feature for my privacy requirements

Yup, OpenZFS 2.0 / TrueNAS 12.0 is your friend there. ZFS-level encryption I do not find scary at all. GELI-level encryption, I do, only because of the occasional data-destroying bug in the middleware. Your mileage will vary - the geekier someone is, the less scary GELI-level encryption becomes. It's definitely not "set and forget". No encryption really is, you always need a backup of the keys. ZFS-level, I am much more comfortable with.

Yorick · Jul 25, 2020

To be clear though, @urza: This level of prep for a system upgrade would also be appropriate for Linux, MacOS, Windows, you name it. How far someone goes depends entirely on their risk tolerance. The main point I think is: Go in with open eyes. Have a plan, have a plan for failure, know how to execute that plan. Know your own risk profile and risk tolerance, be at peace with whatever decisions you make.

Too often folk update their machines, and then rage when it went sideways. That's worst in the consumer space, as you may expect. This article, as I interpret it, argues for a "less consumer, a little more techie" approach. Be ready for failure, so if it happens (*), it's just "execute rollback plan", not "oh shit I'm stuck, rage, now what, $&#^% this system".

(*) Big if. I've got a near pathological need for that next update, and I've not encountered major issues, yet. That said, I also aren't afraid to use boot environments and snapshots, if it comes to that. I have a backup of the config file, daily. I don't print anything out (Greta would frown if I did), I do not back up my media files (if I lose my pool entirely, I'll just have to rip it all again). I do have "replacement media for the boot pool", but only in the sense that there's a silly amount of USB sticks flying about that I could use in a pinch. I don't really expect my boot SSD to die on me, and if it does, I can be without the system for a little while. It's only being used for entertainment and as a backup target. And that's the risk tolerance: I'm okay if the NAS is down for an hour, or a day, or even a week. Knowing what constitutes an acceptable outage is simple, and drives decisions as to which precautions to take.

ornias · Jul 25, 2020

urza said:
Then I recomment that FreeNAS should be more clearly marketed as GEEK ONLY product if this is required with updates.

I just explained very carefully that those guidelines are a general best practice for all (enterprise storage) systems.

urza said:
I started using FreeNAS specifically to have "run and forget" solution for my NAS

Which it is, as long as you, like any storage solution on custom hardware, don't try to update. When you update, you are expected to follow best-practices somewhat, no mater how "low maintenance" your system is when running.

urza said:
Otherwise I would stick with Zfs on Linux, which has better features.

With OpenZFS2.0 (Beta of which is included in the TrueNAS 12 Beta), that isn't the case anymore.

urza said:
Perhaps I misunderstood the product offering.

Why blame the product?
It's just a great guideline for all updating (storage) software solutions. You can NEVER be 100% sure an update doesn't freak out, certainly not on your own custom hardware. Thats what these guidelines are for. It has nothing to do with the product.

urza said:
Also tells somthing about FreeNAS that you guys consider "encryption scary"

Where did you get that bullshit from? Many of us run encryption and many of us value privacy.
Encryption is just another point of failure during updates, which is precisely how Samuel explained it.

I'm starting to think you're either trolling or just not actually reading much of what is writen here....

urza · Jul 25, 2020

Yorick said:
This level of prep for a system upgrade would also be appropriate for Linux, MacOS, Windows

I never did any kind of this prep on Windows or Mac OS in 30 years and never lost any data or had to "roll back" or these kind of things. On Linux sure - when I was running ZFS on Debian, I knew that it is kind of on me. But I came to FreeNAS epecting that FeeNAS will take care of this all, just like Windows and Mac OS do.

Hence my disappointment. Clearly I just had wrong expectations about the mission of FeeNAS.

ornias · Jul 25, 2020

urza said:
I never did any kind of this prep on Windows or Mac OS in 30 years and never lost any data or had to "roll back" or these kind of things.

Just because you don't doesn't mean it doesn't happen. I myself had this about 5 times with different windows boxes in my life, 1 time on OSX and 1 time with FreeNAS... So it really isn't isolated to OS: Things can go wrong, regardless of OS.

urza said:
But I came to FreeNAS epecting that FeeNAS will take care of this all, just like Windows and Mac OS do.

FreeNAS takes more care in this than Windows and OSX do. FreeNAS keeps the complete old boot environment in tact to make it A LOT easier to go back. Something that can't be said about Windows and OSX. But again: Windows and OSX installs and updates are not bug free, just as FreeNAS isn't bugfree. Taking precautions is just to be prepared in case something goes wrong.

These guidelines are to make sure people are prepared just in case something goes wrong,not because it's expected.
Which I explained 2 times already, which you ignore.

urza said:
Hence my disappointment. Clearly I just had wrong expectations about the mission of FeeNAS.

Again: Where did you get the impression that FreeNAS had "guaranteed error free upgrades"?
(or the other operating systems do).

The facts are the opposite: All software has bugs and when doing upgrades those tend to show.
Just make sure you are prepared to deal with them, regardless of operating system.

A lot of systems also have breaking changes when upgrading between major versions, which you should also look for before upgrading.

The reason I called it trolling, is the fact you have joined this forum, only to complain about an upgrade bug, a breaking change and the existance of (general) best practices when upgrading. It seems you are just here to complain, constantly not-read answers (just repeat yourself) and/or not answer questions yourself.

Jono_K · Jul 25, 2020

ornias said:
Bhyve is still mostly RC or Beta quality software. Thats not even (primarily) maintained by IX systems, they just present a frontend for it.

This is good to know! I wasn't aware that Beta projects were being bundled into the stable release channel for FreeNAS.

ornias · Jul 25, 2020

Jono_K said:
This is good to know! I wasn't aware that Beta projects were being bundled into the stable release channel for FreeNAS.

Wait a second: I said it was "mostly RC or Beta quality software." not that it was beta software.
It's not beta, the quality is just "not that great" aka comparable with beta or RC software.

It's the best FreeBSD virtualisation software, that doesn't make it good virtualisation software... ;)

adrianwi · Jul 25, 2020

When you first read the OP it does sound a little overkill, but then if you read it again you realise these are pretty much all the things anyone with valuable data on their NAS should be doing daily anyway. They shouldn't be extra steps needed before running an update.

Not sure about the printed/screenshot information, but then I like to think I have enough backups of everything I might need. I do think a little bit of 'geek' is required to successfully manage a FreeNAS system, but I also don't think it's positioned as a direct competitor for Synology/QNAP/Netgear/etc. You certainly don't need to be a systems administrator for a living, but you probably need to be somewhat more than a typical office laptop user!

I'm running multiple jails and bhyve VMs (all of which have been rock solid and not felt at all RC or Beta like) and it's been some time since I've rolled back a system update as something critical broke. There was a bug with recursive replication settings in a recent update, but it's been fixed and wasn't enough of a problem to rollback. I'll be a little nervous about 12.0 but will take the necessary precautions, which will probably include waiting until at least the stable release, if not the .1 update. Once the two are one, I only expect things to become even more reliable which is pretty incredible for an open source solution I've come to depend so much on.

There's also a wealth of knowledge from some incredibly helpful people on here should things go wrong. If you're running a sensible configuration and have made some attempts to understand/resolve the issue, there's always someone willing to help.

ornias · Jul 25, 2020

adrianwi said:
I'll be a little nervous about 12.0 but will take the necessary precautions, which will probably include waiting until at least the stable release, if not the .1 update. Once the two are one, I only expect things to become even more reliable which is pretty incredible for an open source solution I've come to depend so much on.

There's also a wealth of knowledge from some incredibly helpful people on here should things go wrong. If you're running a sensible configuration and have made some attempts to understand/resolve the issue, there's always someone willing to help.

Just so you know: Jails are ROCK sollid on 12
I think if there is one person that would know its me, after about 200 jail (re)installs in the past 2 weeks, on 12 :P

HolyK · Jul 25, 2020

tl;dr OP is right, don't be jerks and prepare the update properly. Its YOUR data ... or YOUR head if the data are someone else's.

>> If you're sitting on a toilet and you're bored of playing the mobile games here is the full post ...

The way how the OP describes the process might get one thinking that FreeNAS is a sh!tty product which explodes into radioactive particles taking whole neighborhood with it upon system update. Reality is (as @ornias already nated) that the described process more or less matches the usual approach when talking about enterprise systems. (Which Free/TrueNAS actually is ... unless you pick the BETA or initial release - see THIS)

And this is not just some "i read some crap so i will just copy/paste it"... not at all. I am working for a company which is taking care of various (inter)national companies. I did major upgrades/migrations of systems which are crucial for the company operation (you don't want to get your trucks/logistic getting stuck on the way or whole warehouse going dark. Or how about all of your terminals in every single store you have across the country not being able accept payments?). There is no room for ANY mistakes and some of the projects are planned for several months before pulling the trigger. Yet sometimes shit hits the fan no matter how many of the preventive measures you did.

So if i drill down to OPs points and remove the relation to "FreeNAS" i can easily match them to what i am doing everytime i am about to update something on productive system...

Read the release notes carefully, along with the Guide for the new version, section 1.1.

Oh yeah, every (fcking) time RTFM !!! ... When i was rookie i did something on non-productive system twice w/o fully read the release notes/installation docs. It worked fine. Third time i screwed the system and we had to restore the system DB. Yea, been there as well.

Prepare an action plan to mitigate all the gotchas in your specific installation.

I don't like project managers, their excel sheets, their status meetings to which i am being invited, their questions where they will not understand the techie answers anyway, ... etc. BUT i respect the needs of a proper project plan. So yea, even the fact that you wrote couple of bullets on a paper/in notepad and look at them IS actually a project plan. You can easily spot a gap and mitigate potential issue.
Actually i did that HERE when i was about to update my FreeNAS 9.10 to 11.2

Have printed screenshots of all relevant configuration screens, in case your configuration backup doesn't work, and you need to re-enter everything again by hand. Make sure these are current.

Okay "printed" i don't like as i see this as a waste of paper (i am not some ecologic activist but i just don't understand the reason of printing something what i can easily view/show on my PC/Notebook. I saw ppl printing a fcking email conversation and distributing the copies on meeting ... USE THE FCKING BEAMER BEHIND YOU ... you moron! And forward the email to all participants if necessary). Anyway ... just make a screenshots. Or be sure you can export the configuration and you're able to read it outside the system! Some things tends to break/reset to default upon update (in general, not related to FreeNAS) so it is better to have screenshots so you could easily re-do the config w/o spending whole night on it (like when you initially set it up 2 years ago...)

Backup your pool, in case you need to reconstitute it.

Backup, backup, backup ... yes, we're doing backups of everything. Hell i even copy certain directories somewhere else before i do some specific actions even i know there is a full FS backups in place of all of the filesystems on that host.

Test your backups to make sure they work.

Oh yea, we're doing regular DB and FS restore tests to validate integrity of the backups. Few times i even requested a test system into which i could restore recent backup so i could do a test-upgrade there. Note that usually we have more than one nonprod system with the same software constellation like the productive system so we actively "testing" the update there before we go for prod system. Yet this is not 100% bulletproof and some things like data in the DB or size of it, or maybe cluster solution can make things different...

Review the steps needed to reboot back into your current version from the Guide, section 2.5.5.

There is ALWAYS a "Fallback" plan and "point of no return". If that gets crossed and something is not working ... weeellll .. it is restore timeeee!

All this prep work may seem like overkill, but if something goes wrong, you'll have a plan to address it, and as a last resort, revert back to the last known working configuration.

Yes and no ... it depends how important the system/data are. I can wreck some sandbox as many times as i want and i can always build it back from scratch or restore from backup. With QA/Preprod system i can no do that and IF there are issues i need to fix/revert them. At least i am not under that huge time pressure from client side. With prod systems ... as i said, there is no room for stupid mistakes (yet they're happening...).

Remember, a system upgrade is a high-risk event. Many things can go wrong, and an ounce of prevention is better than a pound of cure.

We don't even have to speak about upgrade (major changes). Even a tiny/small patch can wreak havoc. I recall a situation from few years ago when we were doing a regular patching of kernels (not OS but application). We did several systems - all OK. Then we did two non-prod systems with same constellation (at Mon and Wed). Client confirmed that everything is OK and we have a GO for productive system during weekend. We did it and we had confirmation from two persons from client side that everything is fine, system is stable, data consistent, communication works, etc ... our checklist was all green and the one on the other side as well. We called it a day around 11PM and went to bed. The sh*tstorm started at Monday morning when ppl actually come to office an put a load on the system. First it went slow, then crashed (more than once). I was called in for immediate check followed by war-room meeting with high management asking WTF is going on. We did investigation, collected data and immediately rolled back to previous version. Sadly (luckily?!) we found a data corruption so we had to restore ~15TB database.

All of the collected data were sent to vendor support as from our side everything was done properly. Well we faced a very nasty bug which nobody faced yet. The version passed vendor released tests, caused no issues across dozen of other clients which updated as well, also passed our test-updates. Yet we were the winners... A specific version of DB combined with specific data within DB and actions performed by end users caused unexpected behavior which just wrecked the system within few minutes and caused data loss.
Vendor immediately pulled back the patch release and also VeryHigh announcement was sent across clients that there is a risk of data loss with further descriptions about mitigation

See ... this is similar to the nasty bug of 11.3-U2 where you could actually lose data. Just for the record 11.3-U2 was released 7.4. 2020. The bug was reported 17.4.2020 and 11.3-U2.1 was released 22.4.2020 so 5 days after. I am not sure if the 11.3-U2 was pulled out from the update servers but the issue was heavily discussed on the forums in relevant section. Personally i am watching the "Installation and Updates" subforum closely when i am about to update my system and i am checking the recent posts/threads a second before i click the "Restart and apply update".

So yea, even the minor updates/patches can cause mayhem... I am not speaking about ransomware infecting 3500 workstations and few hundred servers (maybe you've heard about HYDRO last year? )

Sorry for quite long post. I just wanted to share my view here. A decade ago i would just hit the button and went for a beer or something. I grow up over the years ... And more over my wife would mostly kill me if i would lost her collection of black/white movies from early days of Czech cinematography...

--------------
//EDIT:

urza said:
I never did any kind of this prep on Windows or Mac OS in 30 years and never lost any data or had to "roll back" or these kind of things.
Hence my disappointment. Clearly I just had wrong expectations about the mission of FeeNAS.

See and i have different experience AND Windows is more like minor OS in our environment (it is mostly AIX and SLES). Yet i saw Windows hosts BSODing after regular patching the moment MSSQL services started. I saw windows native cluster silently screwed because network card arrangement switched places for NO bloody reason! Sadly nobody noticed (or even thought that this could happen by a simple patchset which had nothing to do with network at the first place... ). We found it by the hard way when one of the host was shut down by the clusterware due to HW issue. All of the systems were moved to other cluster node prior the emergency stop yet some of them failed to start. Well our Windows admins found that the NICs are messed up. It was fixed quickly but the retrospective investigation WHY that happened led to the one patchset which messed it up...

So yes, even the MAJOR players on enterprise playfield (Microsoft, IBM, Oracle, SAP, SUSE, ...) doing mistakes. So pardon me when i get triggered when someone shares his disappointment with FreeNAS due to the fact that he should do few precaution tasks prior updating the system...

Important Announcement for the TrueNAS Community.

Path to success for system upgrades

Never underestimate your own stupidity

MVP

Never underestimate your own stupidity

Dabbler

Wizard

Cadet

Never underestimate your own stupidity

Wizard

Wizard

Dabbler

Wizard

Wizard

Wizard

Dabbler

Wizard

Cadet

Wizard

Guru

Wizard

Ninja Turtle

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Path to success for system upgrades"

Similar threads