{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NCAR JupyterHub Large Data Example Notebook\n",
"\n",
"_Note: If you do not have access to the NCAR machine, please look at the\n",
"AWS-LENS example notebook instead._\n",
"\n",
"This notebook demonstrates how to compare large datasets on glade with ldcpy. In\n",
"particular, we will look at data from CESM-LENS1 project\n",
"(http://www.cesm.ucar.edu/projects/community-projects/LENS/data-sets.html). In\n",
"doing so, we will start a DASK client from Jupyter. This notebook is meant to be\n",
"run on NCAR's JupyterHub (https://jupyterhub.hpc.ucar.edu). We will use a subset of\n",
"the CESM-LENS1 data on glade is located in\n",
"/glade/campaign/cisl/asap/ldcpy_sample_data/lens.\n",
"\n",
"We assume that you have a copy of the ldcpy code on NCAR's glade filesystem,\n",
"obtained via:
`git clone https://github.com/NCAR/ldcpy.git`\n",
"\n",
"When you launch a NCAR JupyterHub session, you will need to indicate a\n",
"machine and then you will need your charge account. You can\n",
"then launch the session and navigate to this notebook.\n",
"\n",
"NCAR's JupyterHub documentation:
\n",
"https://www2.cisl.ucar.edu/resources/jupyterhub-ncar\n",
"\n",
"Here's another good resource for using NCAR's JupyterHub:
\n",
"https://ncar-hackathons.github.io/jupyterlab-tutorial/jhub.html)\n",
"\n",
"**You can run your notebook with the \"NPL 2023a\" kernel (choose from the\n",
"dropdown in the upper left.)**\n",
"\n",
"Note that the compressed data that we are using was generated for this paper:\n",
"\n",
"Allison H. Baker, Dorit M. Hammerling, Sheri A. Mickelson, Haiying Xu, Martin B.\n",
"Stolpe, Phillipe Naveau, Ben Sanderson, Imme Ebert-Uphoff, Savini Samarasinghe,\n",
"Francesco De Simone, Francesco Carbone, Christian N. Gencarelli, John M. Dennis,\n",
"Jennifer E. Kay, and Peter Lindstrom, “Evaluating Lossy Data Compression on\n",
"Climate Simulation Data within a Large Ensemble.” Geoscientific Model\n",
"Development, 9, pp. 4381-4403, 2016\n",
"(https://gmd.copernicus.org/articles/9/4381/2016/)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Let's set up our environment. First, make sure that you are using an appropriate kernel (like NPL 2023a). Then you will need to modify the path below to indicate\n",
"where you have cloned ldcpy. \n",
"\n",
"If you want to use the dask dashboard, then the dask.config link must be set\n",
"below (modify for your path in your browser).\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Add ldcpy root to system path (MODIFY FOR YOUR LDCPY CODE LOCATION IF NOT IN the kernel)\n",
"import sys\n",
"\n",
"# sys.path.insert(0, '/glade/u/home/abaker/repos/my_ldcpy')\n",
"sys.path.insert(0, '../../../')\n",
"\n",
"import ldcpy\n",
"\n",
"# Display output of plots directly in Notebook\n",
"%matplotlib inline\n",
"\n",
"# Automatically reload module if it is editted\n",
"%reload_ext autoreload\n",
"%autoreload 2\n",
"\n",
"# silence warnings\n",
"import warnings\n",
"\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Connect to DASK distributed cluster :\n",
"\n",
"Since we use the PBS Pro scheduler at NCAR, we will use the PBSCluster scheduler from dask-jobqueue. Initialization is similar to a LocalCluster, but with unique parameters specific to creating batch jobs.\n",
"Helpful info about using DASK at NCAR: https://github.com/NCAR/Xarray-Dask-ESDS-2024/blob/main/notebooks/02-dask-intro.ipynb"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import dask\n",
"from dask_jobqueue import PBSCluster"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You'll need to customize the parameters of the PBSCluster template for the resources that will be assigned to each batch job.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"cluster = PBSCluster(\n",
" # Basic job directives\n",
" job_name='ldcpy-largedata',\n",
" queue='casper',\n",
" walltime='60:00',\n",
" # Make sure you change the project code if running this notebook!!\n",
" account='NTDD0004',\n",
" log_directory='dask-logs',\n",
" # These settings impact the resources assigned to the job\n",
" cores=1,\n",
" memory='10GiB',\n",
" resource_spec='select=1:ncpus=1:mem=10GB',\n",
" # These settings define the resources assigned to a worker\n",
" processes=1,\n",
" # This controls where Dask will write data to disk if memory is exhausted\n",
" local_directory='/local_scratch/pbs.$PBS_JOBID/dask/spill',\n",
" # This specifies which network interface the cluster will use\n",
" interface='ext',\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check your job script:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#!/usr/bin/env bash\n",
"\n",
"#PBS -N ldcpy-largedata\n",
"#PBS -q casper\n",
"#PBS -A NTDD0004\n",
"#PBS -l select=1:ncpus=1:mem=10GB\n",
"#PBS -l walltime=60:00\n",
"#PBS -e dask-logs/\n",
"#PBS -o dask-logs/\n",
"\n",
"/glade/u/apps/opt/conda/envs/npl-2024a/bin/python -m distributed.cli.dask_worker tcp://128.117.208.118:36515 --nthreads 1 --memory-limit 10.00GiB --name dummy-name --nanny --death-timeout 60 --local-directory /local_scratch/pbs.$PBS_JOBID/dask/spill --interface ext\n",
"\n"
]
}
],
"source": [
"print(cluster.job_script())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've created a cluster using PBSCluster(), and now we need Dask to provide an object called the Client for interacting with the cluster and workers. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
Client-cee87cd0-97cd-11ef-851e-ac1f6bab1e16
\n", "| Connection method: Cluster object | \n", "Cluster type: dask_jobqueue.PBSCluster | \n", " \n", "
| \n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/abaker/proxy/8787/status\n", " | \n", "\n", " |
07d66607
\n", "| \n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/abaker/proxy/8787/status\n", " | \n", "\n", " Workers: 0\n", " | \n", "
| \n", " Total threads: 0\n", " | \n", "\n", " Total memory: 0 B\n", " | \n", "
Scheduler-a041bbd3-14db-4d2b-a3d7-6856b1332ee9
\n", "| \n", " Comm: tcp://128.117.208.118:36515\n", " | \n", "\n", " Workers: 0\n", " | \n", "
| \n", " Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/abaker/proxy/8787/status\n", " | \n", "\n", " Total threads: 0\n", " | \n", "
| \n", " Started: Just now\n", " | \n", "\n", " Total memory: 0 B\n", " | \n", "
<xarray.Dataset>\n",
"Dimensions: (collection: 2, time: 1032, lat: 192, lon: 288)\n",
"Coordinates:\n",
" * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
" * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
" * time (time) object 1920-02-01 00:00:00 ... 2006-01-01 00:00:00\n",
" cell_area (lat, collection, lon) float64 dask.array<chunksize=(192, 1, 288), meta=np.ndarray>\n",
" * collection (collection) <U5 'orig' 'lossy'\n",
"Data variables:\n",
" PS (collection, time, lat, lon) float32 dask.array<chunksize=(1, 500, 192, 288), meta=np.ndarray>\n",
"Attributes: (12/16)\n",
" Conventions: CF-1.0\n",
" source: CAM\n",
" case: b.e11.B20TRC5CNBDRD.f09_g16.031\n",
" title: UNSET\n",
" logname: mickelso\n",
" host: ys0219\n",
" ... ...\n",
" history: Tue Nov 3 13:51:10 2020: ncks -L 5 PS.monthly.192001-2...\n",
" NCO: netCDF Operators version 4.7.9 (Homepage = http://nco.s...\n",
" cell_measures: area: cell_area\n",
" data_type: cam-fv\n",
" file_size: {'orig': 127663040, 'lossy': 39865015}\n",
" weighted: True| \n", " | orig | \n", "lossy | \n", "
|---|---|---|
| mean | \n", "98509 | \n", "98493 | \n", "
| variance | \n", "8.4256e+07 | \n", "8.425e+07 | \n", "
| standard deviation | \n", "9179.1 | \n", "9178.8 | \n", "
| min value | \n", "51967 | \n", "51952 | \n", "
| min (abs) nonzero value | \n", "51967 | \n", "51952 | \n", "
| max value | \n", "1.0299e+05 | \n", "1.0298e+05 | \n", "
| probability positive | \n", "1 | \n", "1 | \n", "
| number of zeros | \n", "0 | \n", "0 | \n", "
| 99% real information cutoff bit | \n", "19 | \n", "19 | \n", "
| spatial autocorr - latitude | \n", "0.98434 | \n", "0.98434 | \n", "
| spatial autocorr - longitude | \n", "0.99136 | \n", "0.99136 | \n", "
| entropy estimate | \n", "0.40644 | \n", "0.11999 | \n", "
| \n", " | lossy | \n", "
|---|---|
| max abs diff | \n", "31.992 | \n", "
| min abs diff | \n", "0 | \n", "
| mean abs diff | \n", "15.906 | \n", "
| mean squared diff | \n", "253.01 | \n", "
| root mean squared diff | \n", "18.388 | \n", "
| normalized root mean squared diff | \n", "0.0003587 | \n", "
| normalized max pointwise error | \n", "0.00062698 | \n", "
| pearson correlation coefficient | \n", "1 | \n", "
| ks p-value | \n", "1.6583e-05 | \n", "
| spatial relative error(% > 0.0001) | \n", "69.085 | \n", "
| max spatial relative error | \n", "0.00048247 | \n", "
| DSSIM | \n", "0.91512 | \n", "
| file size ratio | \n", "3.2 | \n", "
<xarray.Dataset>\n",
"Dimensions: (collection: 2, time: 25100, lat: 192, lon: 288)\n",
"Coordinates:\n",
" * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
" * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
" * time (time) object 1920-01-01 00:00:00 ... 1988-10-07 00:00:00\n",
" cell_area (lat, collection, lon) float64 dask.array<chunksize=(192, 1, 288), meta=np.ndarray>\n",
" * collection (collection) <U5 'orig' 'lossy'\n",
"Data variables:\n",
" TS (collection, time, lat, lon) float32 dask.array<chunksize=(1, 500, 192, 288), meta=np.ndarray>\n",
"Attributes: (12/16)\n",
" Conventions: CF-1.0\n",
" source: CAM\n",
" case: b.e11.B20TRC5CNBDRD.f09_g16.031\n",
" title: UNSET\n",
" logname: mickelso\n",
" host: ys0219\n",
" ... ...\n",
" history: Tue Nov 3 13:56:03 2020: ncks -L 5 TS.daily.19200101-2...\n",
" NCO: netCDF Operators version 4.7.9 (Homepage = http://nco.s...\n",
" cell_measures: area: cell_area\n",
" data_type: cam-fv\n",
" file_size: {'orig': 3962086636, 'lossy': 1330827000}\n",
" weighted: True| \n", " | orig | \n", "lossy | \n", "
|---|---|---|
| mean | \n", "284.49 | \n", "284.43 | \n", "
| variance | \n", "533.99 | \n", "533.44 | \n", "
| standard deviation | \n", "23.108 | \n", "23.096 | \n", "
| min value | \n", "216.73 | \n", "216.69 | \n", "
| min (abs) nonzero value | \n", "216.73 | \n", "216.69 | \n", "
| max value | \n", "315.58 | \n", "315.5 | \n", "
| probability positive | \n", "1 | \n", "1 | \n", "
| number of zeros | \n", "0 | \n", "0 | \n", "
| 99% real information cutoff bit | \n", "18 | \n", "18 | \n", "
| spatial autocorr - latitude | \n", "0.99392 | \n", "0.99392 | \n", "
| spatial autocorr - longitude | \n", "0.9968 | \n", "0.9968 | \n", "
| entropy estimate | \n", "0.41487 | \n", "0.13675 | \n", "
| \n", " | lossy | \n", "
|---|---|
| max abs diff | \n", "0.12497 | \n", "
| min abs diff | \n", "0 | \n", "
| mean abs diff | \n", "0.059427 | \n", "
| mean squared diff | \n", "0.0035316 | \n", "
| root mean squared diff | \n", "0.069462 | \n", "
| normalized root mean squared diff | \n", "0.0006603 | \n", "
| normalized max pointwise error | \n", "0.0012642 | \n", "
| pearson correlation coefficient | \n", "1 | \n", "
| ks p-value | \n", "0.36817 | \n", "
| spatial relative error(% > 0.0001) | \n", "73.293 | \n", "
| max spatial relative error | \n", "0.00048733 | \n", "
| DSSIM | \n", "0.97883 | \n", "
| file size ratio | \n", "2.98 | \n", "
<xarray.Dataset>\n",
"Dimensions: (collection: 2, time: 25100, lat: 192, lon: 288)\n",
"Coordinates:\n",
" * lat (lat) float64 -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0\n",
" * lon (lon) float64 0.0 1.25 2.5 3.75 5.0 ... 355.0 356.2 357.5 358.8\n",
" * time (time) object 2006-01-01 00:00:00 ... 2074-10-07 00:00:00\n",
" cell_area (lat, collection, lon) float64 dask.array<chunksize=(192, 1, 288), meta=np.ndarray>\n",
" * collection (collection) <U5 'orig' 'lossy'\n",
"Data variables:\n",
" PRECT (collection, time, lat, lon) float32 dask.array<chunksize=(1, 500, 192, 288), meta=np.ndarray>\n",
"Attributes: (12/16)\n",
" Conventions: CF-1.0\n",
" source: CAM\n",
" case: b.e11.BRCP85C5CNBDRD.f09_g16.031\n",
" title: UNSET\n",
" logname: mickelso\n",
" host: ys1023\n",
" ... ...\n",
" history: Tue Nov 3 14:13:51 2020: ncks -L 5 PRECT.daily.2006010...\n",
" NCO: netCDF Operators version 4.7.9 (Homepage = http://nco.s...\n",
" cell_measures: area: cell_area\n",
" data_type: cam-fv\n",
" file_size: {'orig': 4909326733, 'lossy': 3446890300}\n",
" weighted: True| Workers | 36 |
|---|---|
| Cores | 144 |
| Memory | 435.96 GB |
Dashboard: https://jupyterhub.ucar.edu/ch/user/abaker/proxy/8787/status
\n" } }, "89582c55d64d4db59236c8eb6defbcf6": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "8b3f06d82fd148d88f7347284fcb61b9": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "8bb91036d39246f8b534742f472d7a32": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "97703ce7fd5e4c57b146c27420e5a869": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_fa056ccf1dbf4184bc4b871018f7df52", "IPY_MODEL_1cafd188512946a4975b7434cf5bdb12", "IPY_MODEL_e7babefc52714907929776fabf4a22f4" ], "layout": "IPY_MODEL_89582c55d64d4db59236c8eb6defbcf6" } }, "9e07e779e4cd48b38f47086361e44e53": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": { "min_width": "500px" } }, "9e9f4be928974750b15a44abeebc2a1e": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ButtonStyleModel", "state": {} }, "a704d5489e39484eaf8b55d13e7837f0": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "becabee6fa304181a83e10f4c0d8385f": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "c3e7bfc2d0d7429297137158b66c61c4": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_30037a1881674a2490a391b2f998071a", "IPY_MODEL_79b42978946a4f87959327d6debf30fa" ], "layout": "IPY_MODEL_ca98fb05915c41deb329f8527ad3ef1d" } }, "ca98fb05915c41deb329f8527ad3ef1d": { "model_module": "@jupyter-widgets/base", "model_module_version": "1.2.0", "model_name": "LayoutModel", "state": {} }, "d5d8be56dddf4735a420f059498e38f2": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "d8632483be53473c8f1c6c9da3bc5c3b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "AccordionModel", "state": { "_titles": { "0": "Manual Scaling", "1": "Adaptive Scaling" }, "children": [ "IPY_MODEL_c3e7bfc2d0d7429297137158b66c61c4", "IPY_MODEL_97703ce7fd5e4c57b146c27420e5a869" ], "layout": "IPY_MODEL_9e07e779e4cd48b38f47086361e44e53" } }, "e49a487f126e4335a5a0a4e7a54a1949": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "HBoxModel", "state": { "children": [ "IPY_MODEL_6829bddcadc34e25a81ef7e6aa62e887", "IPY_MODEL_d8632483be53473c8f1c6c9da3bc5c3b" ], "layout": "IPY_MODEL_becabee6fa304181a83e10f4c0d8385f" } }, "e7babefc52714907929776fabf4a22f4": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "ButtonModel", "state": { "description": "Adapt", "layout": "IPY_MODEL_16af2f817e22463a810b23b4d8e220de", "style": "IPY_MODEL_140e167494514e18a48f42003447bc03" } }, "f4cf333f343144e09978f157443c170d": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "DescriptionStyleModel", "state": { "description_width": "" } }, "f9f8af4e296247fb9cd613b1abc75a3b": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "VBoxModel", "state": { "children": [ "IPY_MODEL_3caec2a2244a43f1a039d12e523920b7", "IPY_MODEL_e49a487f126e4335a5a0a4e7a54a1949", "IPY_MODEL_7f1de88de25c401eb699365e5a2e4c97" ], "layout": "IPY_MODEL_5b30e8e6e42c403a8b0ccf4fce880e78" } }, "fa056ccf1dbf4184bc4b871018f7df52": { "model_module": "@jupyter-widgets/controls", "model_module_version": "1.5.0", "model_name": "IntTextModel", "state": { "description": "Minimum", "layout": "IPY_MODEL_16af2f817e22463a810b23b4d8e220de", "step": 1, "style": "IPY_MODEL_f4cf333f343144e09978f157443c170d" } } }, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }