Difference between revisions of "Hardware"

Jump to navigation Jump to search
6,294 bytes added ,  20:35, 27 June 2020
m
Formatted NVMe low level formatting
m (Formatted NVMe low level formatting)
(10 intermediate revisions by the same user not shown)
Line 149: Line 149:


As of 2017, Solid state storage using NAND-flash with PCI-E interfaces are widely available on the market. They are predominantly enterprise drives that utilize a NVMe interface that has lower overhead than the ATA used in SATA or SCSI used in SAS. There is also an interface known as M.2 that is primarily used by consumer SSDs, although not necessarily limited to them. It can provide electrical connectivity for multiple buses, such as SATA, PCI-E and USB. M.2 SSDs appear to use either SATA or NVME.
As of 2017, Solid state storage using NAND-flash with PCI-E interfaces are widely available on the market. They are predominantly enterprise drives that utilize a NVMe interface that has lower overhead than the ATA used in SATA or SCSI used in SAS. There is also an interface known as M.2 that is primarily used by consumer SSDs, although not necessarily limited to them. It can provide electrical connectivity for multiple buses, such as SATA, PCI-E and USB. M.2 SSDs appear to use either SATA or NVME.
== NVMe low level formatting ==
Many NVMe SSDs support both 512-byte sectors and 4096-byte sectors. They often ship with 512-byte sectors, which are less performant than 4096-byte sectors. Some also support metadata for T10/DIF CRC to try to improve reliability, although this is unnecessary with ZFS.
NVMe drives should be [https://filers.blogspot.com/2018/12/how-to-format-nvme-drive.html formatted] to use 4096-byte sectors without metadata prior to being given to ZFS for best performance unless they indicate that 512-byte sectors are as performant as 4096-byte sectors, although this is unlikely. Lower numbers in the Rel_Perf of Supported LBA Sizes from `smartctl -a /dev/$device_namespace` (for example `smartctl -a /dev/nvme1n1`) indicate higher performance low level formats, with 0 being the best. The current formatting will be marked by a plus sign under the format Fmt.
You may format a drive using `nvme format /dev/nvme1n1 -l $ID`. The $ID corresponds to the Id field value from the Supported LBA Sizes SMART information.
== Power Failure Protection ==
== Power Failure Protection ==


Line 167: Line 176:
* Samsung XS1715
* Samsung XS1715
* Toshiba ZD6300
* Toshiba ZD6300
* Seagate Nytro 5000 M.2 (XP1920LE30002 tested; '''''read notes below before buying''''')
** Inexpensive 22110 M.2 enterprise drive using consumer MLC that is optimized for read mostly workloads.
** The [https://www.seagate.com/www-content/support-content/enterprise-storage/solid-state-drives/nytro-5000/_shared/docs/nytro-5000-mp2-pm-100810195d.pdf manual] for this drive specifies airflow requirements. If the drive does not receive sufficient airflow from case fans, it will overheat at idle. It's thermal throttling will severely degrade performance such that write throughput performance will be limited to 1/10 of the specification and read latencies will reach several hundred milliseconds. Under continuous load, the device will continue to become hotter until it suffers a "degraded reliability" event where all data on at least one NVMe namespace is lost. The NVMe namespace is then unusable until a secure erase is done. Even with sufficient airflow under normal circumstances, data loss is possible under load following the failure of fans in an enterprise environment. Anyone deploying this into production in an enterprise environment should be mindful of this failure mode.
** Those who wish to use this drive in a low airflow situation can workaround this failure mode by placing a passive heatsink such as [https://smile.amazon.com/gp/product/B07BDKN3XV this] on the NAND flash controller. It is the chip under the sticker closest to the capacitors. This was tested by placing the heatsink over the sticker (as removing it was considered undesirable). The heatsink will prevent the drive from overheating to the point of data loss, but it will not fully alleviate the overheating situation under load without active airflow. A scrub will cause it to overheat after a few hundred gigabytes are read. However, the thermal throttling will quickly cool the drive from 76 degrees Celsius to 74 degrees Celsius, restoring performance.
*** It might be possible to use the heatsink in an enterprise environment to provide protection against data loss following fan failures. However, this was not evaluated. Furthermore, operating temperatures for consumer NAND flash should be at or above 40 degrees Celsius for long term data integrity. Therefore, the use of a heatsink to provide protection against data loss following fan failures in an enterprise environment should be evaluated before deploying drives into production to ensure that the drive is not overcooled.


=== SAS drives with power failure protection ===
=== SAS drives with power failure protection ===
Line 227: Line 241:
Ensuring that computers are properly grounded is highly recommended. There have been cases in user homes where machines experienced random failures when plugged into power receptacles that had open grounds (i.e. no ground wire at all). This can cause random failures on any computer system, whether it uses ZFS or not.
Ensuring that computers are properly grounded is highly recommended. There have been cases in user homes where machines experienced random failures when plugged into power receptacles that had open grounds (i.e. no ground wire at all). This can cause random failures on any computer system, whether it uses ZFS or not.


Power should also be relatively stable. Large dips in voltages from brownouts are preferably avoided through the use of UPS units or line conditioners. Systems subject to unstable power that do not outright shutdown can exhibit undefined behavior.
Power should also be relatively stable. Large dips in voltages from brownouts are preferably avoided through the use of UPS units or line conditioners. Systems subject to unstable power that do not outright shutdown can exhibit undefined behavior. PSUs with longer hold-up times should be able to provide partial protection against this, but hold up times are often undocumented and are not a substitute for a UPS or line conditioner.
 
== PWR_OK signal ==
 
PSUs are supposed to deassert a PWR_OK signal to indicate that provided voltages are no longer within the rated specification. This should force an immediate shutdown. However, the system clock of a developer workstation was observed to significantly deviate from the expected value following during a series of ~1 second brown outs. This machine did not use a UPS at the time. However, the PWR_OK mechanism should have protected against this. The observation of the PWR_OK signal failing to force a shutdown with adverse consequences (to the system clock in this case) suggests that the PWR_OK mechanism is not a strict guarantee.
 
== PSU Hold-up Times ==
 
A PSU hold-up time is the amount of time that a PSU can continue to output power at maximum output within standard voltage tolerances following the loss of input power. This is important for supporting UPS units because [https://www.sunpower-uk.com/glossary/what-is-transfer-time/ the transfer time] taken by a standard UPS to supply power from its battery can leave machines without power for "5-12 ms". [https://paginas.fe.up.pt/~asousa/pc-info/atxps09_atx_pc_pow_supply.pdf Intel's ATX Power Supply design guide] specifies a hold up time of 17 milliseconds at maximum continuous output. The hold-up time is a inverse function of how much power is being output by the PSU, with lower power output increasing holdup times.
 
Capacitor aging in PSUs will lower the hold-up time below what it was when new, which could cause reliability issues as the equipment ages. Machines using substandard PSUs with hold-up times below the specification therefore require higher end UPS units for protection to ensure that the transfer time does not exceed the hold-up time. A hold-up time below the transfer time during a transfer to battery power can cause undefined behavior should the machine not entirely power off.
 
If in doubt, use a double conversion UPS unit. Double conversion UPS units always run off the battery, such that the transfer time is 0. This is unless they are high efficiency models that are hybrids between standard UPS units and double conversion UPS units, although these are reported to have much lower transfer times than standard PSUs. You could also contact your PSU manufacturer for the hold up time specification, but if reliability for years is a requirement, you should use a higher end UPS with a low transfer time.
 
Note that double conversion units are at most 94% efficient unless they support a high efficiency mode, which adds latency to the time to transition to battery power.
Editor
348

edits

Navigation menu