809 lines
31 KiB
Plaintext
Vendored
809 lines
31 KiB
Plaintext
Vendored
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Tutorial: PINA and PyTorch Lightning, training tips and visualizations \n",
|
|
"\n",
|
|
"[](https://colab.research.google.com/github/mathLab/PINA/blob/master/tutorials/tutorial11/tutorial.ipynb)\n",
|
|
"\n",
|
|
"In this tutorial, we will delve deeper into the functionality of the `Trainer` class, which serves as the cornerstone for training **PINA** [Solvers](https://mathlab.github.io/PINA/_rst/_code.html#solvers). \n",
|
|
"\n",
|
|
"The `Trainer` class offers a plethora of features aimed at improving model accuracy, reducing training time and memory usage, facilitating logging visualization, and more thanks to the amazing job done by the PyTorch Lightning team!\n",
|
|
"\n",
|
|
"Our leading example will revolve around solving the `SimpleODE` problem, as outlined in the [*Introduction to PINA for Physics Informed Neural Networks training*](https://github.com/mathLab/PINA/blob/master/tutorials/tutorial1/tutorial.ipynb). If you haven't already explored it, we highly recommend doing so before diving into this tutorial.\n",
|
|
"\n",
|
|
"Let's start by importing useful modules, define the `SimpleODE` problem and the `PINN` solver."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"## routine needed to run the notebook on Google Colab\n",
|
|
"try:\n",
|
|
" import google.colab\n",
|
|
" IN_COLAB = True\n",
|
|
"except:\n",
|
|
" IN_COLAB = False\n",
|
|
"if IN_COLAB:\n",
|
|
" !pip install \"pina-mathlab\"\n",
|
|
"\n",
|
|
"import torch\n",
|
|
"\n",
|
|
"from pina import Condition, Trainer\n",
|
|
"from pina.solver import PINN\n",
|
|
"from pina.model import FeedForward\n",
|
|
"from pina.problem import SpatialProblem\n",
|
|
"from pina.operator import grad\n",
|
|
"from pina.domain import CartesianDomain\n",
|
|
"from pina.equation import Equation, FixedValue\n",
|
|
"\n",
|
|
"class SimpleODE(SpatialProblem):\n",
|
|
"\n",
|
|
" output_variables = ['u']\n",
|
|
" spatial_domain = CartesianDomain({'x': [0, 1]})\n",
|
|
"\n",
|
|
" # defining the ode equation\n",
|
|
" def ode_equation(input_, output_):\n",
|
|
" u_x = grad(output_, input_, components=['u'], d=['x'])\n",
|
|
" u = output_.extract(['u'])\n",
|
|
" return u_x - u\n",
|
|
"\n",
|
|
" # conditions to hold\n",
|
|
" conditions = {\n",
|
|
" 'bound_cond': Condition(domain=CartesianDomain({'x': 0.}), equation=FixedValue(1)), # We fix initial condition to value 1\n",
|
|
" 'phys_cond': Condition(domain=CartesianDomain({'x': [0, 1]}), equation=Equation(ode_equation)), # We wrap the python equation using Equation\n",
|
|
" }\n",
|
|
"\n",
|
|
" # defining the true solution\n",
|
|
" def truth_solution(self, pts):\n",
|
|
" return torch.exp(pts.extract(['x']))\n",
|
|
" \n",
|
|
"\n",
|
|
"# sampling for training\n",
|
|
"problem = SimpleODE()\n",
|
|
"problem.discretise_domain(1, 'random', domains=['bound_cond'])\n",
|
|
"problem.discretise_domain(20, 'lh', domains=['phys_cond'])\n",
|
|
"\n",
|
|
"# build the model\n",
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
")\n",
|
|
"\n",
|
|
"# create the PINN object\n",
|
|
"pinn = PINN(problem, model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Till now we just followed the extact step of the previous tutorials. The `Trainer` object\n",
|
|
"can be initialized by simiply passing the `PINN` solver"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"GPU available: False, used: False\n",
|
|
"TPU available: False, using: 0 TPU cores\n",
|
|
"HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"trainer = Trainer(solver=pinn)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Trainer Accelerator\n",
|
|
"\n",
|
|
"When creating the trainer, **by defualt** the `Trainer` will choose the most performing `accelerator` for training which is available in your system, ranked as follow:\n",
|
|
"1. [TPU](https://cloud.google.com/tpu/docs/intro-to-tpu)\n",
|
|
"2. [IPU](https://www.graphcore.ai/products/ipu)\n",
|
|
"3. [HPU](https://habana.ai/)\n",
|
|
"4. [GPU](https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html#:~:text=What%20does%20GPU%20stand%20for,video%20editing%2C%20and%20gaming%20applications) or [MPS](https://developer.apple.com/metal/pytorch/)\n",
|
|
"5. CPU"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For setting manually the `accelerator` run:\n",
|
|
"\n",
|
|
"* `accelerator = {'gpu', 'cpu', 'hpu', 'mps', 'cpu', 'ipu'}` sets the accelerator to a specific one"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"GPU available: False, used: False\n",
|
|
"TPU available: False, using: 0 TPU cores\n",
|
|
"HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"as you can see, even if in the used system `GPU` is available, it is not used since we set `accelerator='cpu'`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Trainer Logging\n",
|
|
"\n",
|
|
"In **PINA** you can log metrics in different ways. The simplest approach is to use the `MetricTraker` class from `pina.callbacks` as seen in the [*Introduction to PINA for Physics Informed Neural Networks training*](https://github.com/mathLab/PINA/blob/master/tutorials/tutorial1/tutorial.ipynb) tutorial.\n",
|
|
"\n",
|
|
"However, expecially when we need to train multiple times to get an average of the loss across multiple runs, `pytorch_lightning.loggers` might be useful. Here we will use `TensorBoardLogger` (more on [logging](https://lightning.ai/docs/pytorch/stable/extensions/logging.html) here), but you can choose the one you prefer (or make your own one).\n",
|
|
"\n",
|
|
"We will now import `TensorBoardLogger`, do three runs of training and then visualize the results. Notice we set `enable_model_summary=False` to avoid model summary specifications (e.g. number of parameters), set it to true if needed.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n",
|
|
"/home/matte_b/.local/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 27.90it/s, v_num=51, val_loss=0.000392, bound_cond_loss=5.29e-6, phys_cond_loss=0.000459, train_loss=0.000465] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 23.95it/s, v_num=51, val_loss=0.000392, bound_cond_loss=5.29e-6, phys_cond_loss=0.000459, train_loss=0.000465]\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n",
|
|
"/home/matte_b/.local/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 28.54it/s, v_num=52, val_loss=0.00267, bound_cond_loss=2.42e-5, phys_cond_loss=0.00144, train_loss=0.00146] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 24.59it/s, v_num=52, val_loss=0.00267, bound_cond_loss=2.42e-5, phys_cond_loss=0.00144, train_loss=0.00146]\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n",
|
|
"/home/matte_b/.local/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 29.41it/s, v_num=53, val_loss=0.00363, bound_cond_loss=1.02e-5, phys_cond_loss=0.000846, train_loss=0.000856] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 24.69it/s, v_num=53, val_loss=0.00363, bound_cond_loss=1.02e-5, phys_cond_loss=0.000846, train_loss=0.000856]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from pytorch_lightning.loggers import TensorBoardLogger\n",
|
|
"\n",
|
|
"# three run of training, by default it trains for 1000 epochs\n",
|
|
"# we reinitialize the model each time otherwise the same parameters will be optimized\n",
|
|
"for _ in range(3):\n",
|
|
" model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
" pinn = PINN(problem, model)\n",
|
|
" trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" logger=TensorBoardLogger(save_dir='simpleode'),\n",
|
|
" enable_model_summary=False)\n",
|
|
" trainer.train()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can now visualize the logs by simply running `tensorboard --logdir=simpleode/` on terminal, you should obtain a webpage as the one shown below:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<p align=\\\"center\\\">\n",
|
|
"<img src=\"logging.png\" alt=\\\"Logging API\\\" width=\\\"400\\\"/>\n",
|
|
"</p>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"as you can see, by default, **PINA** logs the losses which are shown in the progress bar, as well as the number of epochs. You can always insert more loggings by either defining a **callback** ([more on callbacks](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html)), or inheriting the solver and modify the programs with different **hooks** ([more on hooks](https://lightning.ai/docs/pytorch/stable/common/lightning_module.html#hooks))."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Trainer Callbacks"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Whenever we need to access certain steps of the training for logging, do static modifications (i.e. not changing the `Solver`) or updating `Problem` hyperparameters (static variables), we can use `Callabacks`. Notice that `Callbacks` allow you to add arbitrary self-contained programs to your training. At specific points during the flow of execution (hooks), the Callback interface allows you to design programs that encapsulate a full set of functionality. It de-couples functionality that does not need to be in **PINA** `Solver`s.\n",
|
|
"Lightning has a callback system to execute them when needed. Callbacks should capture NON-ESSENTIAL logic that is NOT required for your lightning module to run.\n",
|
|
"\n",
|
|
"The following are best practices when using/designing callbacks.\n",
|
|
"\n",
|
|
"* Callbacks should be isolated in their functionality.\n",
|
|
"* Your callback should not rely on the behavior of other callbacks in order to work properly.\n",
|
|
"* Do not manually call methods from the callback.\n",
|
|
"* Directly calling methods (eg. on_validation_end) is strongly discouraged.\n",
|
|
"* Whenever possible, your callbacks should not depend on the order in which they are executed.\n",
|
|
"\n",
|
|
"We will try now to implement a naive version of `MetricTraker` to show how callbacks work. Notice that this is a very easy application of callbacks, fortunately in **PINA** we already provide more advanced callbacks in `pina.callbacks`.\n",
|
|
"\n",
|
|
"<!-- Suppose we want to log the accuracy on some validation poit -->"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from lightning.pytorch.callbacks import Callback\n",
|
|
"from lightning.pytorch.callbacks import EarlyStopping\n",
|
|
"import torch\n",
|
|
"\n",
|
|
"# define a simple callback\n",
|
|
"class NaiveMetricTracker(Callback):\n",
|
|
" def __init__(self):\n",
|
|
" self.saved_metrics = []\n",
|
|
"\n",
|
|
" def on_train_epoch_end(self, trainer, __): # function called at the end of each epoch\n",
|
|
" self.saved_metrics.append(\n",
|
|
" {key: value for key, value in trainer.logged_metrics.items()}\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's see the results when applyed to the `SimpleODE` problem. You can define callbacks when initializing the `Trainer` by the `callbacks` argument, which expects a list of callbacks. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n",
|
|
"/home/matte_b/.local/lib/python3.12/site-packages/lightning/pytorch/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 28.21it/s, v_num=18, val_loss=0.000348, bound_cond_loss=7.54e-5, phys_cond_loss=0.000956, train_loss=0.00103] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=1000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 999: 100%|██████████| 1/1 [00:00<00:00, 23.99it/s, v_num=18, val_loss=0.000348, bound_cond_loss=7.54e-5, phys_cond_loss=0.000956, train_loss=0.00103]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
"pinn = PINN(problem, model)\n",
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" enable_model_summary=False,\n",
|
|
" callbacks=[NaiveMetricTracker()]) # adding a callbacks\n",
|
|
"trainer.train()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can easily access the data by calling `trainer.callbacks[0].saved_metrics` (notice the zero representing the first callback in the list given at initialization)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'val_loss': tensor(1.0595),\n",
|
|
" 'bound_cond_loss': tensor(1.0607),\n",
|
|
" 'phys_cond_loss': tensor(0.0043),\n",
|
|
" 'train_loss': tensor(1.0650)},\n",
|
|
" {'val_loss': tensor(1.0503),\n",
|
|
" 'bound_cond_loss': tensor(1.0522),\n",
|
|
" 'phys_cond_loss': tensor(0.0038),\n",
|
|
" 'train_loss': tensor(1.0560)},\n",
|
|
" {'val_loss': tensor(1.0412),\n",
|
|
" 'bound_cond_loss': tensor(1.0439),\n",
|
|
" 'phys_cond_loss': tensor(0.0033),\n",
|
|
" 'train_loss': tensor(1.0471)}]"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"trainer.callbacks[0].saved_metrics[:3] # only the first three epochs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"PyTorch Lightning also has some built in `Callbacks` which can be used in **PINA**, [here an extensive list](https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#built-in-callbacks). \n",
|
|
"\n",
|
|
"We can for example try the `EarlyStopping` routine, which automatically stops the training when a specific metric converged (here the `train_loss`). In order to let the training keep going forever set `max_epochs=-1`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 6468: 100%|██████████| 1/1 [00:00<00:00, 19.10it/s, v_num=19, val_loss=0.000129, bound_cond_loss=1.57e-8, phys_cond_loss=3.01e-6, train_loss=3.02e-6] \n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# ~5 mins\n",
|
|
"\n",
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
"pinn = PINN(problem, model)\n",
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" max_epochs = -1,\n",
|
|
" enable_model_summary=False,\n",
|
|
" callbacks=[EarlyStopping('train_loss')]) # adding a callbacks\n",
|
|
"trainer.train()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we can see the model automatically stop when the logging metric stopped improving!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Trainer Tips to Boost Accuracy, Save Memory and Speed Up Training\n",
|
|
"\n",
|
|
"Untill now we have seen how to choose the right `accelerator`, how to log and visualize the results, and how to interface with the program in order to add specific parts of code at specific points by `callbacks`.\n",
|
|
"Now, we well focus on how boost your training by saving memory and speeding it up, while mantaining the same or even better degree of accuracy!\n",
|
|
"\n",
|
|
"\n",
|
|
"There are several built in methods developed in PyTorch Lightning which can be applied straight forward in **PINA**, here we report some:\n",
|
|
"\n",
|
|
"* [Stochastic Weight Averaging](https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/) to boost accuracy\n",
|
|
"* [Gradient Clippling](https://deepgram.com/ai-glossary/gradient-clipping) to reduce computational time (and improve accuracy)\n",
|
|
"* [Gradient Accumulation](https://lightning.ai/docs/pytorch/stable/common/optimization.html#id3) to save memory consumption \n",
|
|
"* [Mixed Precision Training](https://lightning.ai/docs/pytorch/stable/common/optimization.html#id3) to save memory consumption \n",
|
|
"\n",
|
|
"We will just demonstrate how to use the first two, and see the results compared to a standard training.\n",
|
|
"We use the [`Timer`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.Timer.html#lightning.pytorch.callbacks.Timer) callback from `pytorch_lightning.callbacks` to take the times. Let's start by training a simple model without any optimization (train for 2000 epochs)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Seed set to 42\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 22.76it/s, v_num=20, val_loss=4.61e-5, bound_cond_loss=1.22e-6, phys_cond_loss=0.000171, train_loss=0.000172] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 19.28it/s, v_num=20, val_loss=4.61e-5, bound_cond_loss=1.22e-6, phys_cond_loss=0.000171, train_loss=0.000172]\n",
|
|
"Total training time 92.35361 s\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from lightning.pytorch.callbacks import Timer\n",
|
|
"from lightning.pytorch import seed_everything\n",
|
|
"\n",
|
|
"# setting the seed for reproducibility\n",
|
|
"seed_everything(42, workers=True)\n",
|
|
"\n",
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
"\n",
|
|
"pinn = PINN(problem, model)\n",
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" deterministic=True, # setting deterministic=True ensure reproducibility when a seed is imposed\n",
|
|
" max_epochs = 2000,\n",
|
|
" enable_model_summary=False,\n",
|
|
" callbacks=[Timer()]) # adding a callbacks\n",
|
|
"trainer.train()\n",
|
|
"print(f'Total training time {trainer.callbacks[0].time_elapsed(\"train\"):.5f} s')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we do the same but with StochasticWeightAveraging"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Seed set to 42\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1598: 100%|██████████| 1/1 [00:00<00:00, 30.61it/s, v_num=21, val_loss=4.54e-5, bound_cond_loss=5.03e-6, phys_cond_loss=0.000247, train_loss=0.000252] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:Swapping scheduler `ConstantLR` for `SWALR`\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 28.21it/s, v_num=21, val_loss=3.45e-5, bound_cond_loss=2.41e-7, phys_cond_loss=9.02e-5, train_loss=9.04e-5] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 22.97it/s, v_num=21, val_loss=3.45e-5, bound_cond_loss=2.41e-7, phys_cond_loss=9.02e-5, train_loss=9.04e-5]\n",
|
|
"Total training time 86.25178 s\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from lightning.pytorch.callbacks import StochasticWeightAveraging\n",
|
|
"\n",
|
|
"# setting the seed for reproducibility\n",
|
|
"seed_everything(42, workers=True)\n",
|
|
"\n",
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
"pinn = PINN(problem, model)\n",
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" deterministic=True,\n",
|
|
" max_epochs = 2000,\n",
|
|
" enable_model_summary=False,\n",
|
|
" callbacks=[Timer(),\n",
|
|
" StochasticWeightAveraging(swa_lrs=0.005)]) # adding StochasticWeightAveraging callbacks\n",
|
|
"trainer.train()\n",
|
|
"print(f'Total training time {trainer.callbacks[0].time_elapsed(\"train\"):.5f} s')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As you can see, the training time does not change at all! Notice that around epoch `1600`\n",
|
|
"the scheduler is switched from the defalut one `ConstantLR` to the Stochastic Weight Average Learning Rate (`SWALR`).\n",
|
|
"This is because by default `StochasticWeightAveraging` will be activated after `int(swa_epoch_start * max_epochs)` with `swa_epoch_start=0.7` by default. Finally, the final `mean_loss` is lower when `StochasticWeightAveraging` is used.\n",
|
|
"\n",
|
|
"We will now now do the same but clippling the gradient to be relatively small."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Seed set to 42\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
|
|
"INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1598: 100%|██████████| 1/1 [00:00<00:00, 27.78it/s, v_num=22, val_loss=1.52e-5, bound_cond_loss=5.29e-8, phys_cond_loss=4.07e-5, train_loss=4.08e-5] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:Swapping scheduler `ConstantLR` for `SWALR`\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 28.11it/s, v_num=22, val_loss=0.000427, bound_cond_loss=0.000311, phys_cond_loss=0.000849, train_loss=0.00116] "
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2000` reached.\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch 1999: 100%|██████████| 1/1 [00:00<00:00, 22.88it/s, v_num=22, val_loss=0.000427, bound_cond_loss=0.000311, phys_cond_loss=0.000849, train_loss=0.00116]\n",
|
|
"Total training time 87.27789 s\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# setting the seed for reproducibility\n",
|
|
"seed_everything(42, workers=True)\n",
|
|
"\n",
|
|
"model = FeedForward(\n",
|
|
" layers=[10, 10],\n",
|
|
" func=torch.nn.Tanh,\n",
|
|
" output_dimensions=len(problem.output_variables),\n",
|
|
" input_dimensions=len(problem.input_variables)\n",
|
|
" )\n",
|
|
"pinn = PINN(problem, model)\n",
|
|
"trainer = Trainer(solver=pinn,\n",
|
|
" accelerator='cpu',\n",
|
|
" max_epochs = 2000,\n",
|
|
" enable_model_summary=False,\n",
|
|
" gradient_clip_val=0.1, # clipping the gradient\n",
|
|
" callbacks=[Timer(),\n",
|
|
" StochasticWeightAveraging(swa_lrs=0.005)])\n",
|
|
"trainer.train()\n",
|
|
"print(f'Total training time {trainer.callbacks[0].time_elapsed(\"train\"):.5f} s')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we can see we by applying gradient clipping we were able to even obtain lower error!\n",
|
|
"\n",
|
|
"## What's next?\n",
|
|
"\n",
|
|
"Now you know how to use efficiently the `Trainer` class **PINA**! There are multiple directions you can go now:\n",
|
|
"\n",
|
|
"1. Explore training times on different devices (e.g.) `TPU` \n",
|
|
"\n",
|
|
"2. Try to reduce memory cost by mixed precision training and gradient accumulation (especially useful when training Neural Operators)\n",
|
|
"\n",
|
|
"3. Benchmark `Trainer` speed for different precisions."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|