NAN Archive: Difference between revisions

From Network for Advanced NMR
Jump to navigationJump to search
Created page with "While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive. The NAN archive is composed of: * Postgres database which holds all metadata records across the entire NAN portal and NDTS * Network attached storage which holds all files associated with datasets * Disaster recovery storage which hold an immutable copy of all datasets Postgres database * The Postgres database is hosted as a virtual machine with a..."
 
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Overview ==
While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.
While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.


The NAN archive is composed of:
The NAN Archive consists of:
* A Postgres database holding all metadata records across the NAN portal and NDTS
* Network-attached storage (NAS) for all files associated with datasets
* Disaster recovery storage containing immutable dataset backups


* Postgres database which holds all metadata records across the entire NAN portal and NDTS
== NAN Postgres Database ==
* Network attached storage which holds all files associated with datasets
* Hosted as a virtual machine (VM) with a virtual datastore
* Disaster recovery storage which hold an immutable copy of all datasets
* Replicated in near real-time to a second VM on a different physical server in a separate datacenter
** The secondary VM uses a distinct NAS system for added resilience
** In case of failure, the replica can be promoted to primary within minutes
* Hourly backups to a separate NAS system
** In the unlikely event both the primary and replica fail, the system can be restored from backups (recovery may take several hours)
* All NAS systems supporting the database and backups:
** Are continuously monitored
** Feature high data durability
** Are under vendor hardware/software support
** Employ daily (or more frequent) snapshots retained for weeks, allowing recovery of VMs and recent states


Postgres database
=== Metadata Provenance Tracking ===
* Changes to dataset metadata are stored in immutable audit tables
* Complete change history is preserved to ensure traceability


* The Postgres database is hosted as a virtual machine with a virtual datastore
== Data Storage ==
* The Postgres database is replicated in near real-time to a second virtual machined hosted on a different physical server in a physically separate datacenter. The datastore is also virtual and store on a separate network attached storage system as the primary database.
** In the unlikely event of a failure of the primary Postgres database, the replica database can be changed into the primary database in a few minutes.
* The Postgres database is backed up hourly to a separate network attached storage system
** In the very unlikely failure of both the primary database and replica the entire NAN database may be rebuilt from the backups, but recovery would take hours to complete.
* Multiple network attached storage systems are utilized for the virtual VM datastores and for the database backups. Those network attached storage systems are monitored continuously and have high levels of data durability to prevent data loss due to disk or node failures. They are under vendor hardware and software support. In addition, these systems all utilize snapshots, at least daily, with snapshots being kept for weeks to recover any data loss that may possibly occur. These snapshots can also be utilized to recover the virtual machine that hosts the Postgres database providing two additional backups of the databases current to within the snapshot time-frame.


=== Primary Storage: Dell PowerScale (Isilon) A3000 ===
* Four A3000 nodes, each with 400 TB raw capacity (1.6 PB total)
* Erasure coding reduces usable capacity to ~900 TB
* OneFS clustered architecture scales to 252 nodes (up to 100 PB)


While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data. Its layered security strategy combines technical safeguards with tightly controlled administrative workflows. Access to NAN components is limited through role-based Active Directory groups. Key-based SSH is required for all logins. Code changes are version-controlled with Git and can be rolled back if needed. Systems are continuously monitored for suspicious activity using the CrowdStrike Falcon agent. Server logs are centrally aggregated and retained for auditing. As an NSF Trusted CI Center of Excellence, NAN undergoes periodic third-party reviews and aligns its policies with research cyberinfrastructure best practices.
=== Disaster Recovery Storage: HP Scality RING ===
* Distributed, peer-to-peer architecture across four datacenters (Farmington and Storrs, CT)
* 14×9s data durability via erasure coding, replication, and self-healing
* WORM (Write-Once-Read-Many) S3 bucket ensures:
** Protection from accidental/malicious deletion
** Automatic lease renewal
** No file deletions by users, admins, or vendors
** Resilience against ransomware encryption attempts


All datasets are stored on a fault-tolerant Dell PowerEdge storage appliance with active monitoring. Daily snapshots enable recovery from accidental deletion. A disaster recovery copy of all datasets is written to a WORM (Write Once Read Many) S3 bucket on a Scality RING object storage cluster. Each object receives a renewable quarterly retention lease, during which it cannot be modified or deleted, even by administrators. Leases renew automatically unless a dataset is flagged to be purged. Object versioning maintains a complete history of dataset changes.
=== Landing Zone: Qumulo NAS ===
* Data from NDTS Gateways arrives at the Landing Zone
* Hosted on Qumulo NAS and replicated in real-time to a second NAS
* Data is only deleted after verified transfer to both primary and disaster recovery storage
* At minimum, two independent copies exist from the moment of data arrival


The PostgreSQL database replicates in real time to a secondary datacenter for high availability. Daily backups support full recovery if needed. Critical services are distributed across two restricted-access datacenters to eliminate single points of failure. Changes to dataset metadata are captured in immutable audit tables in the NAN database, preserving a complete change history.
== Access Control & Monitoring ==
* Access is restricted using role-based Active Directory (AD) groups
* All logins require key-based SSH
* All code changes are Git-versioned with rollback support
* CrowdStrike Falcon agent monitors systems for suspicious activity
* Server logs are centrally aggregated and retained for audit
* These measures ensure operational integrity and compliance with security protocols
 
== NSF Trusted CI Center of Excellence ==
* NAN participates as an NSF Trusted CI Center of Excellence
* Undergoes periodic third-party reviews
* Aligns policies with research cyberinfrastructure best practices

Latest revision as of 19:35, 1 August 2025

Overview

While no system can guarantee perfect security or trust, NAN goes to great lengths to safeguard data in the NAN Archive.

The NAN Archive consists of:

  • A Postgres database holding all metadata records across the NAN portal and NDTS
  • Network-attached storage (NAS) for all files associated with datasets
  • Disaster recovery storage containing immutable dataset backups

NAN Postgres Database

  • Hosted as a virtual machine (VM) with a virtual datastore
  • Replicated in near real-time to a second VM on a different physical server in a separate datacenter
    • The secondary VM uses a distinct NAS system for added resilience
    • In case of failure, the replica can be promoted to primary within minutes
  • Hourly backups to a separate NAS system
    • In the unlikely event both the primary and replica fail, the system can be restored from backups (recovery may take several hours)
  • All NAS systems supporting the database and backups:
    • Are continuously monitored
    • Feature high data durability
    • Are under vendor hardware/software support
    • Employ daily (or more frequent) snapshots retained for weeks, allowing recovery of VMs and recent states

Metadata Provenance Tracking

  • Changes to dataset metadata are stored in immutable audit tables
  • Complete change history is preserved to ensure traceability

Data Storage

Primary Storage: Dell PowerScale (Isilon) A3000

  • Four A3000 nodes, each with 400 TB raw capacity (1.6 PB total)
  • Erasure coding reduces usable capacity to ~900 TB
  • OneFS clustered architecture scales to 252 nodes (up to 100 PB)

Disaster Recovery Storage: HP Scality RING

  • Distributed, peer-to-peer architecture across four datacenters (Farmington and Storrs, CT)
  • 14×9s data durability via erasure coding, replication, and self-healing
  • WORM (Write-Once-Read-Many) S3 bucket ensures:
    • Protection from accidental/malicious deletion
    • Automatic lease renewal
    • No file deletions by users, admins, or vendors
    • Resilience against ransomware encryption attempts

Landing Zone: Qumulo NAS

  • Data from NDTS Gateways arrives at the Landing Zone
  • Hosted on Qumulo NAS and replicated in real-time to a second NAS
  • Data is only deleted after verified transfer to both primary and disaster recovery storage
  • At minimum, two independent copies exist from the moment of data arrival

Access Control & Monitoring

  • Access is restricted using role-based Active Directory (AD) groups
  • All logins require key-based SSH
  • All code changes are Git-versioned with rollback support
  • CrowdStrike Falcon agent monitors systems for suspicious activity
  • Server logs are centrally aggregated and retained for audit
  • These measures ensure operational integrity and compliance with security protocols

NSF Trusted CI Center of Excellence

  • NAN participates as an NSF Trusted CI Center of Excellence
  • Undergoes periodic third-party reviews
  • Aligns policies with research cyberinfrastructure best practices