NAN Archive: Difference between revisions

From Network for Advanced NMR
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
This page is currently under development - please excuse any issues
This page is currently under development please excuse any issues.


== Overview ==
== Overview ==
While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.
While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.


The NAN archive is composed of:
The NAN Archive consists of:
* A Postgres database holding all metadata records across the NAN portal and NDTS
* Network-attached storage (NAS) for all files associated with datasets
* Disaster recovery storage containing immutable dataset backups


* Postgres database which holds all metadata records across the entire NAN portal and NDTS
== NAN Postgres Database ==
* Network attached storage which holds all files associated with datasets
* Hosted as a virtual machine (VM) with a virtual datastore
* Disaster recovery storage which hold an immutable copy of all datasets
* Replicated in near real-time to a second VM on a different physical server in a separate datacenter
 
** The secondary VM uses a distinct NAS system for added resilience
== NAN Postgres database ==
** In case of failure, the replica can be promoted to primary within minutes
* The Postgres database is hosted as a virtual machine with a virtual datastore
* Hourly backups to a separate NAS system
* The Postgres database is replicated in near real-time to a second virtual machined hosted on a different physical server in a physically separate datacenter. The datastore is also virtual and store on a separate network attached storage system as the primary database.
** In the unlikely event both the primary and replica fail, the system can be restored from backups (recovery may take several hours)
** In the unlikely event of a failure of the primary Postgres database, the replica database can be changed into the primary database in a few minutes.
* All NAS systems supporting the database and backups:
* The Postgres database is backed up hourly to a separate network attached storage system
** Are continuously monitored
** In the very unlikely failure of both the primary database and replica the entire NAN database may be rebuilt from the backups, but recovery would take hours to complete.
** Feature high data durability
* Multiple network attached storage systems are utilized for the virtual VM datastores and for the database backups. Those network attached storage systems are monitored continuously and have high levels of data durability to prevent data loss due to disk or node failures. They are under vendor hardware and software support. In addition, these systems all utilize snapshots, at least daily, with snapshots being kept for weeks to recover any data loss that may possibly occur. These snapshots can also be utilized to recover the virtual machine that hosts the Postgres database providing two additional backups of the databases current to within the snapshot time-frame.
** Are under vendor hardware/software support
** Employ daily (or more frequent) snapshots retained for weeks, allowing recovery of VMs and recent states


=== Metadata Provenance Tracking ===
=== Metadata Provenance Tracking ===
 
* Changes to dataset metadata are stored in immutable audit tables
* Changes to dataset metadata are captured in immutable audit tables in the NAN database, preserving a complete change history.
* Complete change history is preserved to ensure traceability


== Data Storage ==
== Data Storage ==


=== NAN Primary Storage (Dell PowerScale (Isilon) A3000) ===
=== Primary Storage: Dell PowerScale (Isilon) A3000 ===
 
* Four A3000 nodes, each with 400 TB raw capacity (1.6 PB total)
* NAN utilizes four A3000 nodes each with 400 GB raw capacity for a total capacity of 1.6 PB
* Erasure coding reduces usable capacity to ~900 TB
* The system utilizes Erasure coding for data protection reducing the total usable size to 900 TB
* OneFS clustered architecture scales to 252 nodes (up to 100 PB)
* The A3000 uses a distributed, fully symmetric clustered architecture with OneFS and can scale to 252 nodes (100 PB)
 
=== NAN Disaster Recovery Storage (HP Scality RING) ===


* Scality RING is a software platform that uses a peer-to-peer architecture, distributing data and metadata across multiple nodes and datacenters to ensure high availability and eliminate single points of failure.
=== Disaster Recovery Storage: HP Scality RING ===
** NAN utilizes the the UConn HPC facilities system which is GeoSpread across four separate datacenters, two in Farmington CT and two in Storrs CT
* Distributed, peer-to-peer architecture across four datacenters (Farmington and Storrs, CT)
** The system provide 14 x 9s of data durability through Erasure coding, replication, and self-healing capability
* 14×9s data durability via erasure coding, replication, and self-healing
* NAN utilizes a WORM (Write-Once-Read-Many) S3 bucket for disaster recovery. All data is protected from accidental or malicious deletion due to the WORM capability. All files have WORM leases that auto-renew quarterly and data cannot be removed by users, system administrators, nor the hardware vendor performing maintenance. The WORM also protects against Ransomware attacks as data cannot be modified preventing a malicious actor from encrypting the files.
* WORM (Write-Once-Read-Many) S3 bucket ensures:
** Protection from accidental/malicious deletion
** Automatic lease renewal
** No file deletions by users, admins, or vendors
** Resilience against ransomware encryption attempts


=== Landing Zone (Qumulo) ===
=== Landing Zone: Qumulo NAS ===
* Data from NDTS Gateways arrives at the Landing Zone
* Hosted on Qumulo NAS and replicated in real-time to a second NAS
* Data is only deleted after verified transfer to both primary and disaster recovery storage
* At minimum, two independent copies exist from the moment of data arrival


* NDTS transfers data from Gateway computers to the NAN receiver's landing zone. The Landing zone is a disk partition hosted on a Qumulo NAS and is replicated to a second Qumulo NAS in real-time. No data is removed from the Landing Zone until a copies of data have been verified on the NAN primary and disaster recovery storage. Thus, at all times, from the moment data arrives in the datacenter there are at least two independent copies of the data.
== Access Control & Monitoring ==
* Access is restricted using role-based Active Directory (AD) groups
* All logins require key-based SSH
* All code changes are Git-versioned with rollback support
* CrowdStrike Falcon agent monitors systems for suspicious activity
* Server logs are centrally aggregated and retained for audit
* These measures ensure operational integrity and compliance with security protocols


== NSF Trusted CI Center of Excellence ==
== NSF Trusted CI Center of Excellence ==
 
* NAN participates as an NSF Trusted CI Center of Excellence
* As an NSF Trusted CI Center of Excellence, NAN undergoes periodic third-party reviews and aligns its policies with research cyberinfrastructure best practices.
* Undergoes periodic third-party reviews
 
* Aligns policies with research cyberinfrastructure best practices
 
 
Access to NAN components is limited through role-based Active Directory groups. Key-based SSH is required for all logins. Code changes are version-controlled with Git and can be rolled back if needed. Systems are continuously monitored for suspicious activity using the CrowdStrike Falcon agent. Server logs are centrally aggregated and retained for auditing. As an NSF Trusted CI Center of Excellence, NAN undergoes periodic third-party reviews and aligns its policies with research cyberinfrastructure best practices.

Revision as of 19:32, 1 August 2025

This page is currently under development – please excuse any issues.

Overview

While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.

The NAN Archive consists of:

  • A Postgres database holding all metadata records across the NAN portal and NDTS
  • Network-attached storage (NAS) for all files associated with datasets
  • Disaster recovery storage containing immutable dataset backups

NAN Postgres Database

  • Hosted as a virtual machine (VM) with a virtual datastore
  • Replicated in near real-time to a second VM on a different physical server in a separate datacenter
    • The secondary VM uses a distinct NAS system for added resilience
    • In case of failure, the replica can be promoted to primary within minutes
  • Hourly backups to a separate NAS system
    • In the unlikely event both the primary and replica fail, the system can be restored from backups (recovery may take several hours)
  • All NAS systems supporting the database and backups:
    • Are continuously monitored
    • Feature high data durability
    • Are under vendor hardware/software support
    • Employ daily (or more frequent) snapshots retained for weeks, allowing recovery of VMs and recent states

Metadata Provenance Tracking

  • Changes to dataset metadata are stored in immutable audit tables
  • Complete change history is preserved to ensure traceability

Data Storage

Primary Storage: Dell PowerScale (Isilon) A3000

  • Four A3000 nodes, each with 400 TB raw capacity (1.6 PB total)
  • Erasure coding reduces usable capacity to ~900 TB
  • OneFS clustered architecture scales to 252 nodes (up to 100 PB)

Disaster Recovery Storage: HP Scality RING

  • Distributed, peer-to-peer architecture across four datacenters (Farmington and Storrs, CT)
  • 14×9s data durability via erasure coding, replication, and self-healing
  • WORM (Write-Once-Read-Many) S3 bucket ensures:
    • Protection from accidental/malicious deletion
    • Automatic lease renewal
    • No file deletions by users, admins, or vendors
    • Resilience against ransomware encryption attempts

Landing Zone: Qumulo NAS

  • Data from NDTS Gateways arrives at the Landing Zone
  • Hosted on Qumulo NAS and replicated in real-time to a second NAS
  • Data is only deleted after verified transfer to both primary and disaster recovery storage
  • At minimum, two independent copies exist from the moment of data arrival

Access Control & Monitoring

  • Access is restricted using role-based Active Directory (AD) groups
  • All logins require key-based SSH
  • All code changes are Git-versioned with rollback support
  • CrowdStrike Falcon agent monitors systems for suspicious activity
  • Server logs are centrally aggregated and retained for audit
  • These measures ensure operational integrity and compliance with security protocols

NSF Trusted CI Center of Excellence

  • NAN participates as an NSF Trusted CI Center of Excellence
  • Undergoes periodic third-party reviews
  • Aligns policies with research cyberinfrastructure best practices