NDTS Daemon Operation: Difference between revisions

From Network for Advanced NMR
Jump to navigationJump to search
Created page with "= Running and Monitoring the Daemon = This page explains how to control the **data-transport-daemon** service, verify connectivity, and interpret the logs produced on each spectrometer workstation. == '''Starting, Stopping, and Checking Status''' == <pre> # Start the daemon sudo /sbin/service data-transport-daemon start # Stop the daemon sudo /sbin/service data-transport-daemon stop # Restart (reloads configuration) sudo /sbin/service data-transport-daemon restart #..."
 
No edit summary
 
(23 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Running and Monitoring the Daemon =
{{NDTS_Navbox}}


This page explains how to control the **data-transport-daemon** service, verify connectivity, and interpret the logs produced on each spectrometer workstation.
== Overview ==
This page explains how to control the '''data-transport-daemon''' service, verify connectivity, and interpret the daemon’s log and audit files on every spectrometer workstation.


== '''Starting, Stopping, and Checking Status''' ==
== TopSpin version prior to 4.x ==
The daemon tracks experiment start/stop times from the TopSpin account file. By default, TopSpin accounting is not enabled. For NAN to properly automatically harvest datasets, TopSpin accounting must be enabled in each workstation users account. Follow these [[Enabling TopSpin Accounting|instructions]] to enable Accounting.
 
== Service Control ==
<pre>
<pre>
# Start the daemon
# Start the daemon
Line 17: Line 21:
sudo /sbin/service data-transport-daemon status
sudo /sbin/service data-transport-daemon status
</pre>
</pre>
*The daemon will refuse to start if an instance is already running on the workstation.*
*Note, the daemon will not start again if another instance is already running
 
== Heartbeats and Connectivity ==
 
On a regular basis (by default, every '''10&nbsp;minutes'''), each NDTS daemon sends a '''heartbeat''' message to the Gateway. These messages serve as a continuous health check and confirm that the daemon is active and communicating. Each heartbeat contains a set of diagnostic and identity information used for system monitoring and troubleshooting and includes:
* Workstation hostname
* Current local datetime on the workstation
* Workstation IP address
* Currently selected NMRhub user
* Daemon version
* Facility and spectrometer identifiers
* Operating system details
* Uptime and system load metrics


== '''Heartbeat and Connectivity''' ==
The Gateway receives the heartbeat, appends its own identifying information (including Gateway UUID and timestamps), and forwards the full message to the NDTS Receiver. These heartbeats are recorded in the NAN Repository and are viewable by Facility Managers via the virtual NAN Operations Center (vNOC).
* By default the daemon sends a **heartbeat** to the Gateway every **10 minutes**. 
* The Gateway forwards that heartbeat to the NDTS Receiver, where it is logged in the NAN Repository and surfaced in vNOC.


=== Slack Notifications ===
=== Slack Notifications ===
When a heartbeat is missed, the Receiver posts alerts to the facility’s Slack channel.
 
When heartbeats stop, the Receiver alerts the facility via the associated Slack channel.


{| class="wikitable"
{| class="wikitable"
! Condition !! Time-out !! Action !! Slack message
|+ Automated actions based on heartbeat status
! Condition
! Time-out
! Receiver Action
! Slack Message
|-
| First missed heartbeat
| &gt; 20 min
| Mark workstation '''offline'''
| ''offline''
|-
|-
| Missed heartbeat &gt; 20 min || ≈ 20 min || Daemon marked '''offline''' || “*offline*” message (repeats once)
| Still missing at next poll
| + 8 min
| Repeat ''offline'' (max 3)
| ''offline''
|-
|-
| Heartbeat resumes || – || Daemon marked '''online''' || “*online*” message
| Heartbeat resumes
| –  
| Mark workstation '''online'''
| ''online''
|}
|}


Channels are named:
In practical terms, if you subscribe to the Slack channel for your facility you will know within 20 minutes if a daemon went off-line, but you will see a maximum of three ''offline'' messages to not flood the Slack channel with the same ''offline'' message over and over. A single, ''online'', message will appear in the Slack channel when heartbeats resume.


* <code>ccrc-ndts-notifications</code>
Slack channels (one per facility):
* <code>nmrfam-ndts-notifications</code>
* <code>ccrc-ndts-notifications</code>
* <code>nmrfam-ndts-notifications</code>
* <code>uchc-ndts-notifications</code>
* <code>uchc-ndts-notifications</code>


== '''Version Information''' ==
== Logging ==
* On daemon start-up, the version is written to the log file (see below).
Logging of the Daemon is performed in different files. (A) The <code>file running_workstation_version-X.Y.Z</code> is used to log the current workstation version. (B) The <code>ndtd_audit.txt</code> file for monitoring actions performed by the Daemon such as when datasets complete and what the harvesting status for the dataset. (C) the <code>nan-dtdaemon.log</code> is used to log status changes needed by the daemon to properly harvest datasets and associate them to the correct NAN users.
* A file named **<code>/opt/nan-dtdaemon/running_workstation_version-X.Y.Z</code>** is created, timestamped with the start time.
 
=== (A) Version Tracking ===
* When the Daemon starts it writes the active version number in two places ...
  <pre>/opt/nan-dtdaemon/running_workstation_version-X.Y.Z</pre>which records the version number and a timestamp of when the daemon started, and
  <pre>/opt/nan-dtdaemon/logs/nan-dtdaemon.log</pre>
which write an INFO message stating the version number and which is time-stamped
 
=== (B) Experiment Transfer Audit File ===
Each time the NDTS daemon processes a completed experiment, it logs a detailed audit entry to the file 
<pre>/opt/nan-dtdaemon/logs/ndtd_audit.txt</pre>This audit trail provides traceable records of all harvesting actions performed by the daemon on a given spectrometer workstation
 
Each line in the audit file includes the following fields:
 
# Timestamp when the log entry was written
# Workstation OS-level username (Linux/Windows)
# Selected NMRhub username (or <code>unselected</code> if none was selected)
# Experiment start time
# Experiment end time
# Full path to the experiment data directory
# Version number of the NDTS daemon
# Action performed by the daemon
 
==== Possible Actions Logged ====
{| class="wikitable"
|+
! Action !! Meaning
|-
| <code>sent</code> || Experiment was successfully transferred to the Gateway
|-
| <code>spooled</code> || Experiment was queued locally for later transfer
|-
| <code>sent-spooled</code> || Experiment was sent from a previously spooled location
|-
| <code>skipped-trivial</code> || Experiment was ignored because it was deemed trivial (e.g., shim or calibration)
|-
| <code>skipped-disabled</code> || Experiment was skipped due to daemon or GUI configuration disabling harvesting
|}
 
=== (C) Daemon Logging ===
Detailed logging of the daemon for detecting things such as workstation users changing, TopSpin starting, which users version of TopSpin is controlling the spectrometer, when a different NAN user is selected, which files to monitor for experiment completions, and others. The logs are saved to


== '''Experiment Transfer Audit''' ==
<pre>/opt/nan-dtdaemon/logs/nan-dtdaemon.log</pre>
Every processed experiment adds one line to 
<pre>/opt/nan-dtdaemon/logs/ndtd_audit.txt</pre>


Fields:
==== Log Levels ====
Each line begins with a level tag.  The level is controlled by the
<code>log_level</code> parameter in the configuration file <code>ndtd_configuration.dat</code>.


# Timestamp  # Workstation user  # NMRhub user (or ‘‘unselected’’)  # Start & End time 
{| class="wikitable"
# Path to data  # Daemon version  # Action 
! Level !! Verbosity !! Typical Use
(sent | spooled | sent-spooled | skipped-trivial | skipped-disabled)
|-
| fatal || Highest-priority, least frequent || Events that make the daemon shut down and cannot be auto-recovered
|-
| error || Critical problems || Failures that stop normal operation but daemon continues running
|-
| warning || Important but non-fatal issues || Conditions worth attention; daemon recovers automatically
|-
| info || Default || Unusual or noteworthy events; normal operations generate very little output
|-
| debug || Diagnostic detail || Ongoing list of major operations; log grows steadily
|-
| trace || Maximum detail || Every internal step; use only for short troubleshooting sessions
|}


== '''Daemon Logs''' ==
==== Log File Example ====
* Main log: <pre>/opt/nan-dtdaemon/logs/nan-dtdaemon.log</pre>
The fragment below is reproduced verbatim from the PDF (pp. 14-15):
* **log_level** is set in <code>ndtd_configuration.dat</code> 
  (fatal &lt; error &lt; warning &lt; info &lt; debug &lt; trace).


Example start-up excerpt (level INFO):
<pre>
<pre>
Thu Sep 28 13:17:03 2023 LOG_START Started dtd logger.
Thu Sep 28 13:17:03 2023 LOG_START Started dtd logger.
Line 65: Line 144:
Thu Sep 28 13:17:03 2023 INFO *** This is a Topspin Workstation ***
Thu Sep 28 13:17:03 2023 INFO *** This is a Topspin Workstation ***
Thu Sep 28 13:17:03 2023 INFO Ndtd Control Processor listening.
Thu Sep 28 13:17:03 2023 INFO Ndtd Control Processor listening.
Thu Sep 28 13:17:03 2023 INFO Entering polling loop...
Thu Sep 28 13:17:03 2023 INFO Workstation user has changed!
Thu Sep 28 13:17:03 2023 INFO workstation user is nmradmin
Thu Sep 28 13:17:03 2023 INFO User nmradmin is included in NAN data collection!
Thu Sep 28 13:17:03 2023 INFO Harvesting setting for user nmradmin is on
Thu Sep 28 13:17:03 2023 INFO Topspin program has been detected and is running.
Thu Sep 28 13:17:03 2023 INFO Setting directory to watch to /opt/topspin4.2.0/prog/curdir/nmradmin/shmem
</pre>
</pre>


== '''Troubleshooting Checklist''' ==
===== Analysis of the Example =====
{| class="wikitable"
# '''"LOG_START"''' — is the first line indicating the new instance of the daemon and the data and time that it started.
! Symptom !! Check
# '''"Workstation version is 1.0.15”''' — indicates the daemon is running and the version number  
|-
# '''“*** This is a Topspin Workstation ***”''' — indicates that this is a Topspin workstation  
| No new data in NAN | • <code>service data-transport-daemon status</code>  
# '''“Ndtd Control Processor listening.”''' — indicates that the daemon is listening for incoming control commands 
• Heartbeat timestamp in vNOC  
# '''“Entering polling loop…”''' — indicates that the daemon has entered the acquisition polling loop 
• Gateway log for incoming files
# '''"Workstation user has changed!"''' and next three lines — indicates that the workstation user has changed to nmradmin, that nmradmin is configured to harvest data, and that the harvesting setting is on
|-
# '''“Topspin program has been detected and is running.”''' — daemon detected that the Topspin acquisition directory running 
| Slack “offline” alerts | Workstation powered off? Network drop? Firewall blocking port 60195?
# '''“Setting directory to watch to …”''' — shows the location of the Topspin directory which the daemon will watch for file modifications that indicate the start and end of an acquisition
|-
| Log file grows rapidly | <code>log_level trace</code> left enabled → reset to '''info'''
|-
| Experiments marked ‘‘spooled’’ only | Gateway unreachable → verify IP/port and gateway service status
|}

Latest revision as of 20:05, 9 June 2025

Overview

This page explains how to control the data-transport-daemon service, verify connectivity, and interpret the daemon’s log and audit files on every spectrometer workstation.

TopSpin version prior to 4.x

The daemon tracks experiment start/stop times from the TopSpin account file. By default, TopSpin accounting is not enabled. For NAN to properly automatically harvest datasets, TopSpin accounting must be enabled in each workstation users account. Follow these instructions to enable Accounting.

Service Control

# Start the daemon
sudo /sbin/service data-transport-daemon start

# Stop the daemon
sudo /sbin/service data-transport-daemon stop

# Restart (reloads configuration)
sudo /sbin/service data-transport-daemon restart

# Check status
sudo /sbin/service data-transport-daemon status
  • Note, the daemon will not start again if another instance is already running

Heartbeats and Connectivity

On a regular basis (by default, every 10 minutes), each NDTS daemon sends a heartbeat message to the Gateway. These messages serve as a continuous health check and confirm that the daemon is active and communicating. Each heartbeat contains a set of diagnostic and identity information used for system monitoring and troubleshooting and includes:

  • Workstation hostname
  • Current local datetime on the workstation
  • Workstation IP address
  • Currently selected NMRhub user
  • Daemon version
  • Facility and spectrometer identifiers
  • Operating system details
  • Uptime and system load metrics

The Gateway receives the heartbeat, appends its own identifying information (including Gateway UUID and timestamps), and forwards the full message to the NDTS Receiver. These heartbeats are recorded in the NAN Repository and are viewable by Facility Managers via the virtual NAN Operations Center (vNOC).

Slack Notifications

When heartbeats stop, the Receiver alerts the facility via the associated Slack channel.

Automated actions based on heartbeat status
Condition Time-out Receiver Action Slack Message
First missed heartbeat > 20 min Mark workstation offline offline
Still missing at next poll + 8 min Repeat offline (max 3) offline
Heartbeat resumes Mark workstation online online

In practical terms, if you subscribe to the Slack channel for your facility you will know within 20 minutes if a daemon went off-line, but you will see a maximum of three offline messages to not flood the Slack channel with the same offline message over and over. A single, online, message will appear in the Slack channel when heartbeats resume.

Slack channels (one per facility):

  • ccrc-ndts-notifications
  • nmrfam-ndts-notifications
  • uchc-ndts-notifications

Logging

Logging of the Daemon is performed in different files. (A) The file running_workstation_version-X.Y.Z is used to log the current workstation version. (B) The ndtd_audit.txt file for monitoring actions performed by the Daemon such as when datasets complete and what the harvesting status for the dataset. (C) the nan-dtdaemon.log is used to log status changes needed by the daemon to properly harvest datasets and associate them to the correct NAN users.

(A) Version Tracking

  • When the Daemon starts it writes the active version number in two places ...
/opt/nan-dtdaemon/running_workstation_version-X.Y.Z

which records the version number and a timestamp of when the daemon started, and

/opt/nan-dtdaemon/logs/nan-dtdaemon.log

which write an INFO message stating the version number and which is time-stamped

(B) Experiment Transfer Audit File

Each time the NDTS daemon processes a completed experiment, it logs a detailed audit entry to the file

/opt/nan-dtdaemon/logs/ndtd_audit.txt

This audit trail provides traceable records of all harvesting actions performed by the daemon on a given spectrometer workstation

Each line in the audit file includes the following fields:

  1. Timestamp when the log entry was written
  2. Workstation OS-level username (Linux/Windows)
  3. Selected NMRhub username (or unselected if none was selected)
  4. Experiment start time
  5. Experiment end time
  6. Full path to the experiment data directory
  7. Version number of the NDTS daemon
  8. Action performed by the daemon

Possible Actions Logged

Action Meaning
sent Experiment was successfully transferred to the Gateway
spooled Experiment was queued locally for later transfer
sent-spooled Experiment was sent from a previously spooled location
skipped-trivial Experiment was ignored because it was deemed trivial (e.g., shim or calibration)
skipped-disabled Experiment was skipped due to daemon or GUI configuration disabling harvesting

(C) Daemon Logging

Detailed logging of the daemon for detecting things such as workstation users changing, TopSpin starting, which users version of TopSpin is controlling the spectrometer, when a different NAN user is selected, which files to monitor for experiment completions, and others. The logs are saved to

/opt/nan-dtdaemon/logs/nan-dtdaemon.log

Log Levels

Each line begins with a level tag. The level is controlled by the log_level parameter in the configuration file ndtd_configuration.dat.

Level Verbosity Typical Use
fatal Highest-priority, least frequent Events that make the daemon shut down and cannot be auto-recovered
error Critical problems Failures that stop normal operation but daemon continues running
warning Important but non-fatal issues Conditions worth attention; daemon recovers automatically
info Default Unusual or noteworthy events; normal operations generate very little output
debug Diagnostic detail Ongoing list of major operations; log grows steadily
trace Maximum detail Every internal step; use only for short troubleshooting sessions

Log File Example

The fragment below is reproduced verbatim from the PDF (pp. 14-15):

Thu Sep 28 13:17:03 2023 LOG_START Started dtd logger.
Thu Sep 28 13:17:03 2023 INFO NDTD Workstation version is 1.0.15
Thu Sep 28 13:17:03 2023 INFO *** This is a Topspin Workstation ***
Thu Sep 28 13:17:03 2023 INFO Ndtd Control Processor listening.
Thu Sep 28 13:17:03 2023 INFO Entering polling loop...
Thu Sep 28 13:17:03 2023 INFO Workstation user has changed!
Thu Sep 28 13:17:03 2023 INFO workstation user is nmradmin
Thu Sep 28 13:17:03 2023 INFO User nmradmin is included in NAN data collection!
Thu Sep 28 13:17:03 2023 INFO Harvesting setting for user nmradmin is on
Thu Sep 28 13:17:03 2023 INFO Topspin program has been detected and is running.
Thu Sep 28 13:17:03 2023 INFO Setting directory to watch to /opt/topspin4.2.0/prog/curdir/nmradmin/shmem
Analysis of the Example
  1. "LOG_START" — is the first line indicating the new instance of the daemon and the data and time that it started.
  2. "Workstation version is 1.0.15” — indicates the daemon is running and the version number
  3. “*** This is a Topspin Workstation ***” — indicates that this is a Topspin workstation
  4. “Ndtd Control Processor listening.” — indicates that the daemon is listening for incoming control commands
  5. “Entering polling loop…” — indicates that the daemon has entered the acquisition polling loop
  6. "Workstation user has changed!" and next three lines — indicates that the workstation user has changed to nmradmin, that nmradmin is configured to harvest data, and that the harvesting setting is on
  7. “Topspin program has been detected and is running.” — daemon detected that the Topspin acquisition directory running
  8. “Setting directory to watch to …” — shows the location of the Topspin directory which the daemon will watch for file modifications that indicate the start and end of an acquisition