Physics-Informed ML (PINN / SciML) | AI & ML for Electrochemists

1. Why Physics-Informed?

Battery systems are governed by well-established physical principles—mass conservation, charge neutrality, diffusion, electrochemical kinetics. Traditional "black-box" neural networks ignore these constraints and must attempt to rediscover them from scratch using vast amounts of data. This approach suffers from three chronic problems that Physics-Informed Machine Learning (SciML/PINN) directly addresses.

1.1 Overcoming Data Inefficiency

High-quality aging and performance datasets for batteries are expensive and time-consuming to acquire. By encoding known physics into the loss function, we provide a powerful form of regularization. This allows the model to learn from sparse data points, as the physics "fills in the gaps." As a result, PINNs can achieve high accuracy with a fraction of the data required by purely data-driven models.

[Learning Curve Plot: A graph with 'Number of Data Points' on a log x-axis and 'Prediction Error' on the y-axis. Three curves are shown: 1. A 'Pure NN' curve starts with high error and decreases slowly. 2. A 'PINN' curve starts much lower and decreases quickly to a low error plateau. 3. A 'FEM/Solver' curve is a flat line, representing high accuracy but being data-independent.]

1.2 Ensuring Physical Plausibility

A purely data-driven model, trained on a limited operational range, can make wildly unphysical predictions when extrapolating. For example, it might predict a battery's capacity increasing during a high-current discharge, violating the laws of thermodynamics. PINNs solve this by constraining the entire solution space. The physics residual in the loss function acts as a "guardrail," ensuring that any prediction, even in unseen domains, must conform to the governing equations.

[Physical Constraint Space Diagram: A 2D space representing all possible solutions. A large, amorphous cloud represents the "Solution Space of a Pure NN." Inside it, a smaller, well-defined convex shape is labeled "Physically Plausible Solutions (Constrained by PINN)." Data points are scattered within the smaller shape.]

1.3 Enhancing Interpretability

Standard neural networks are often "black boxes," making it difficult to understand their internal reasoning. In contrast, a PINN learns a continuous surrogate model of the system's state variables. This means we can probe the trained model to visualize and extract hidden states that are difficult or impossible to measure experimentally, such as the Li-ion concentration profile across an electrode or the evolution of SEI layer thickness over time. This transforms the model from a simple predictor into a tool for scientific discovery.

2. PINN Framework — Nuts & Bolts

The core of a PINN is a standard feed-forward neural network, but its training and application are fundamentally different from traditional deep learning. It's designed not just to fit data, but to obey the laws of physics.

2.1 The Neural Network as a Function Approximator

Thanks to the Universal Approximation Theorem, a sufficiently deep neural network can approximate any smooth, continuous function. In a PINN, the network \(\text{NN}_\theta(t, \mathbf{x})\) is trained to be a surrogate for the solution of a PDE, \(u(t, \mathbf{x})\). It takes the independent variables (time \(t\), spatial coordinates \(\mathbf{x}\)) as inputs and outputs the predicted value of the solution \(u_\theta\). This approach is mesh-free and provides a solution that is continuous and differentiable everywhere.

2.2 Enforcing Physics via Collocation Points

How do we make the network obey a physical law? We define the law as a differential equation, and its residual, \(r(t, \mathbf{x})\), which should be zero for an exact solution.

\[r(t, \mathbf{x}) := \mathcal{N}[u_\theta(t, \mathbf{x})] \quad (\text{e.g., } \mathcal{N}[u] = \frac{\partial u}{\partial t} - D \frac{\partial^2 u}{\partial x^2})\]

We then sample thousands of collocation points across the entire spatio-temporal domain. These are points where we don't have data, but where we know the physics must hold. The physics loss, \(\mathcal{L}_{\text{PDE}}\), is the mean squared error of the residual at these points, driving it towards zero everywhere.

Collocation Sampling Strategies

The choice of sampling strategy for collocation points is crucial for stable training. While a uniform grid is simple, it can be inefficient. Quasi-random sequences provide much better domain coverage for the same number of points, preventing alignment with coordinate axes and ensuring a more uniform exploration of the solution space.

[Two-panel comparison of collocation points in a 2D square domain. Left: 'Uniform Grid Sampling' shows points in a rigid grid pattern. Right: 'Quasi-random Sobol Sampling' shows points that are more evenly spread out, with no clear linear patterns.]

Advanced methods like Residual-Based Adaptive Refinement (RAR) go a step further, iteratively adding new collocation points in regions where the PDE residual is currently highest, focusing the network's attention where it's struggling the most.

2.3 Handling Boundary and Initial Conditions

Boundary (BC) and Initial (IC) conditions are critical for a unique PDE solution. PINNs can enforce them in two main ways:

Soft Constraints (Penalty Method)

This is the most common approach. We treat the BC/IC as another loss term, penalizing the model for deviating from the required values at the domain boundaries. For a boundary condition \(u(t, x_b) = g(t)\), the loss is:

\[\mathcal{L}_{\text{BC}} = \frac{1}{N_b} \sum_{i=1}^{N_b} \| u_\theta(t_i, x_{b,i}) - g(t_i) \|^2\]

This method is flexible but relies on proper weighting in the composite loss function.

Hard Constraints (By Construction)

A more elegant approach is to design the network's output to satisfy the BC/IC by construction. This is done by multiplying the raw network output by a carefully chosen function that is zero at the boundaries. For example, to enforce a Dirichlet BC \(u(t,0)=A\) and \(u(t,1)=B\) on a domain \(x \in [0,1]\), we can define a transformed output \(\hat{u}_\theta\):

\[ \hat{u}_\theta(t,x) = (1-x)A + xB + x(1-x)\text{NN}_\theta(t,x) \]

This formulation guarantees that \(\hat{u}_\theta(t,0)=A\) and \(\hat{u}_\theta(t,1)=B\) regardless of the output of \(\text{NN}_\theta\), removing the need for a boundary loss term entirely.

[Two-panel 3D surface plot. Left: 'Raw NN Output' shows a surface that does not respect the boundary values. Right: 'Hard-Constraint Output' shows the same surface transformed, now perfectly matching the required values at the boundaries x=0 and x=1.]

# PyTorch pseudocode for a hard-constraint transformation
def forward(self, t, x):
    nn_output = self.network(torch.cat([t, x], dim=1))
    # Enforce u(t,0)=A and u(t,1)=B
    A, B = 0.0, 1.0 
    transformed_output = (1 - x) * A + x * B + x * (1 - x) * nn_output
    return transformed_output

2.4 The Scope of Solvable PDEs

The PINN framework is highly general and can be applied to a wide range of differential equations encountered in science and engineering.

PDE Type	Example Name	Equation	Application Area
1D, Time-dependent	Heat/Diffusion Eq.	\(\frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2}\)	Li-ion diffusion in 1D
2D, Steady-state	Poisson's Eq.	\(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = f(x,y)\)	Electrostatics, heat distribution
2D, Time-dependent	Wave Eq.	\(\frac{\partial^2 u}{\partial t^2} = c^2 \left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right)\)	Acoustics, electromagnetics
Higher-order	Navier-Stokes Eq.	\(\frac{\partial \mathbf{v}}{\partial t} + (\mathbf{v} \cdot \nabla)\mathbf{v} = -\frac{1}{\rho}\nabla p + \nu \nabla^2 \mathbf{v}\)	Fluid dynamics in electrolytes

3. Composite Loss Design

A PINN's objective is a multi-task loss, typically a weighted sum of terms for the data, the PDE residual, and the boundary/initial conditions:

\[\mathcal{L}=\lambda_{d}\,\mathcal{L}_{\text{data}}+\lambda_{p}\,\mathcal{L}_{\text{PDE}}+\lambda_{b}\,\mathcal{L}_{\text{BC}}+\lambda_{i}\,\mathcal{L}_{\text{IC}}\]

The balance between these terms is the most critical aspect of successful PINN training. If the PDE loss dominates too early, the model might ignore the data and converge to a trivial solution (e.g., zero). If the data loss dominates, the model may overfit and violate physics. Choosing the weights (\(\lambda\)) is a key challenge.

3.1 Static Weighting Strategies

Manual Tuning: The simplest approach, but requires extensive trial and error. It's often difficult to find a single set of weights that works well throughout the entire training process.
Scheduled Annealing: A more structured approach where weights change over time. A common strategy is to start with a high weight on the data term (\(\lambda_d\)) to anchor the model to observations, and then gradually increase the weight on the physics term (\(\lambda_p\)) to enforce the physical laws.

[A timeline graph showing loss weight scheduling. The x-axis is 'Training Epochs'. Two lines are plotted: '\(\lambda_{data}\)' starts high and decreases. '\(\lambda_{physics}\)' starts low and increases, crossing over the data weight midway through training.]

3.2 Adaptive Weighting Strategies

Modern approaches automate the balancing act by dynamically adjusting the weights during training based on the behavior of the gradients.

Gradient-Normalization (GradNorm)

The core idea is to prevent any single loss term from producing overwhelmingly large gradients that dominate the training updates. GradNorm dynamically adjusts the weights \(\lambda_k\) to keep the gradient norms for each loss term \(\mathcal{L}_k\) on a similar scale.

\[ \text{Goal: } \|\nabla_\theta (\lambda_k \mathcal{L}_k)\| \approx \text{Average Gradient Norm} \]

This ensures a more balanced "tug-of-war" between the different objectives, leading to more stable training.

# Pseudocode for one step of Gradient Normalization
for each loss term L_k:
    grad_norm_k = compute_gradient_norm(L_k, model.parameters())
    
avg_grad_norm = average(all grad_norm_k)

for each loss term L_k:
    loss_ratio = avg_grad_norm / grad_norm_k
    # Update the weight for this loss term (with some learning rate alpha)
    lambda_k = lambda_k * (loss_ratio ** alpha)

Neural Tangent Kernel (NTK) Weighting

A more advanced technique based on the insight that different loss terms can cause the network to learn at very different speeds. The Neural Tangent Kernel (NTK) can be used to estimate these learning rates. NTK-based weighting adjusts the \(\lambda\) values so that all loss terms contribute roughly equally to the training dynamics, ensuring that the model doesn't get "stuck" learning only the "easiest" part of the problem.

\[ \lambda_k(t+1) = \lambda_k(t) \exp \left( \frac{\text{diag}(\hat{\Theta})}{\text{diag}(\hat{\Theta}_k)} - 1 \right) \]

Here, \(\hat{\Theta}\) is the trace of the NTK for the total loss and \(\hat{\Theta}_k\) is for the individual loss term. This method essentially re-weights the losses so that their effective learning rates are equalized.

[Animated GIF concept: A 2D plot of loss curves. The x-axis is 'Epochs', y-axis is 'Loss Value'. Two curves, 'Data Loss' and 'Physics Loss', are shown. A slider controls '\(\lambda_{phys}\) / \(\lambda_{data}\)'. As the slider moves, the animation shows how one loss curve flattens while the other drops sharply, illustrating the trade-off.]

4. Electrochemical Equations to Encode

While simple ODEs can be encoded, the real power of PINNs in electrochemistry comes from their ability to solve systems of coupled, non-linear PDEs that describe battery behavior. The Doyle-Fuller-Newman (DFN) model, also known as the Pseudo-2D (P2D) model, is the canonical example.

4.1 The Doyle-Fuller-Newman (DFN) Model Framework

The DFN model is not a single equation, but a system of coupled equations describing ion transport and reaction kinetics across the different components of a Li-ion cell: the negative electrode (anode), separator, and positive electrode (cathode).

[Coupled Equation Flowchart for DFN Model. Shows three main blocks: 'Solid Phase', 'Electrolyte Phase', and 'Interface'. Arrows indicate coupling: Electrolyte concentration \(c_e\) and solid potential \(\phi_s\) feed into 'Interface (Butler-Volmer)'. The output of Butler-Volmer, \(j_{int}\), acts as a source/sink term for the 'Electrolyte Phase' and as a boundary condition for the 'Solid Phase'.]

A Minimal DFN Model for PINNs

A PINN for a DFN model would typically solve for four key state variables: \(c_s(t,x,r)\), \(c_e(t,x)\), \(\phi_s(t,x)\), and \(\phi_e(t,x)\). The governing equations are enforced as residuals in the loss function:

Component	Physics	Governing Equation
Solid Phase (Anode/Cathode)	Li-ion diffusion in active material particles (spherical coordinates).	\(\frac{\partial c_s}{\partial t} = \frac{1}{r^2} \frac{\partial}{\partial r} \left( r^2 D_s \frac{\partial c_s}{\partial r} \right)\)
Electrolyte Phase	Li-ion transport via diffusion and migration in the electrolyte.	\(\epsilon_e \frac{\partial c_e}{\partial t} = \frac{\partial}{\partial x} \left( D_e^{\text{eff}} \frac{\partial c_e}{\partial x} \right) + \frac{a_s(1-t_+^0)}{F} j_{\text{int}}\)
Interface (Pore Walls)	Electrochemical reaction kinetics.	\(j_{\text{int}} = i_0 \left( \exp\left(\frac{\alpha_a F \eta}{RT}\right) - \exp\left(-\frac{\alpha_c F \eta}{RT}\right) \right)\)
Potential Fields	Charge conservation (Ohm's law) in solid and electrolyte phases.	\(\nabla \cdot (\sigma_s^{\text{eff}} \nabla \phi_s) = -a_s F j_{\text{int}}\) \(\nabla \cdot (\kappa_e^{\text{eff}} \nabla \phi_e + \dots) = a_s F j_{\text{int}}\)

4.2 Mapping Physics to the Neural Network

To implement this, we define the inputs and outputs of our neural network and the variables it needs to predict.

Variable & Unit Definitions

Key Variables:

\(c_s\): Li-ion concentration in solid (mol·m⁻³)
\(c_e\): Li-ion concentration in electrolyte (mol·m⁻³)
\(\phi_s\): Solid phase potential (V)
\(\phi_e\): Electrolyte phase potential (V)
\(j_{\text{int}}\): Volumetric transfer current density (A·m⁻³)
\(\eta\): Overpotential (V)
\(D_s, D_e\): Diffusion coefficients (m²·s⁻¹)

Network Input-Output Mapping

A single neural network can be trained to predict all state variables simultaneously. The independent variables of the system are the inputs to the network, and the state variables are the outputs.

Role	Variables	Description
Network Inputs	\(t, x, r\)	Time, spatial position across the cell, and radial position within a particle.
Network Outputs	\(c_s, c_e, \phi_s, \phi_e\)	The four primary state variables the PINN learns to approximate.
Derived Quantities	\(j_{\text{int}}, \eta\), etc.	Calculated from the network outputs and their gradients using the physical equations (e.g., \(\eta = \phi_s - \phi_e - U\)).

5. Hybrid Architectures in SciML

For very stiff or multiscale problems where pure PINNs can struggle, hybrid architectures offer robust and efficient alternatives by combining the strengths of traditional numerical solvers and neural networks.

5.1 Grey-Box Models: Learning the Unknown Physics

In many systems, we have reliable models for some physical processes but not for others. A grey-box model uses a traditional numerical solver (e.g., Finite Volume Method) for the well-understood parts and inserts a neural network to learn a difficult-to-model "closure term."

Example: Learning Battery Kinetics

Consider modeling a porous electrode. The macro-scale diffusion and migration in the electrolyte can be handled efficiently by a standard FVM solver. However, the interfacial reaction kinetics (the Butler-Volmer equation) might be complex, non-ideal, or dependent on unknown degradation states. In a grey-box approach, the FVM solver would compute the concentration and potential fields at each time step and pass them to a neural network. The NN then predicts the local reaction rate (\(j_{int}\)), which is fed back into the solver as a source term for the next time step.

[Grey-box Pipeline Diagram: A loop is shown. A block labeled 'FVM Solver' computes \(c_e, \phi_e, \phi_s\). An arrow points from this to a block labeled 'Neural Network Kinetics', which takes these values as input and outputs a learned reaction rate, \(j_{NN}\). An arrow points back from the NN to the FVM solver, indicating \(j_{NN}\) is used as a source term.]

This approach leverages the stability and accuracy of numerical solvers while using the flexibility of neural networks to capture complex, data-driven phenomena where first-principles models are inadequate.

5.2 Neural ODEs: Learning the Dynamics

For systems described by Ordinary Differential Equations (ODEs), particularly in time-series analysis, Neural ODEs are a powerful choice. Instead of modeling the state \(x(t)\) directly, a neural network \(f_\theta\) is used to learn its derivative, \(dx/dt = f_\theta(x,t)\). This network is then embedded within a high-quality adaptive ODE solver (like Runge-Kutta) which handles the time integration. This avoids the need for time-based collocation points and can be more robust for stiff or complex temporal dynamics.

5.3 Operator Learners: Learning the Solution Operator

Operator learners like Fourier Neural Operator (FNO) and DeepONet represent a paradigm shift. Instead of solving a single problem instance, they learn the entire solution operator \(\mathcal{G}\) that maps from a space of input functions (e.g., initial conditions, boundary conditions) to the solution function: \(u = \mathcal{G}(a)\).

After a long, offline training phase on thousands of simulation examples, a trained operator can perform inference for a *new* input function almost instantly (\(\ll 1\) second). This makes them incredibly powerful for building fast surrogate models for design optimization, uncertainty quantification, and control applications where many forward solves are needed.

5.4 Performance Trade-offs: Speed vs. Accuracy

The choice of architecture involves a trade-off between inference speed, training cost, and data requirements.

[Speed-Accuracy Trade-off Scatter Plot. X-axis: 'Inference Time (log scale)', Y-axis: 'Prediction Accuracy (%)'. - Top-left: 'Traditional Solver (e.g., COMSOL)' has highest accuracy but very long inference time. - Top-right: 'Operator Learner (FNO)' has high accuracy and extremely fast inference time (microseconds). - Middle: 'PINN' has good accuracy and moderate inference time (milliseconds). The plot illustrates that FNO is ideal for rapid repeated queries after a one-time, high training cost, while PINNs are better for solving specific inverse problems or when training data is sparse.]

6. Training Strategies & AD

What: Training a PINN involves minimizing a composite loss function using gradient descent.

Why: Standard training can be unstable; specialized strategies for sampling and optimization are crucial for success.

How: Use Automatic Differentiation to compute PDE residuals, sample collocation points wisely, and use modern optimizers.

[Training Pipeline Flowchart: Input (t,x) → Neural Network → Output (u) → Automatic Differentiation (calculates ∂u/∂t, ∂²u/∂x²) → Form PDE Residual → Calculate Composite Loss → Optimizer (Adam) updates Network Weights.]

6.1 Automatic Differentiation (AD)

AD is the engine that makes PINNs possible. It's a technique used by frameworks like PyTorch and JAX to compute exact derivatives of any function, no matter how complex.

① The framework builds a computational graph of all operations. → ② It applies the chain rule backwards through this graph (reverse-mode AD). → ③ This yields the exact gradient of the output (e.g., \(u_\theta\)) with respect to any input (e.g., \(t, x\)) or parameter (\(\theta\)), allowing us to define PDE residuals without manual derivation.

6.2 Collocation Sampling

Smart sampling of collocation points is vital for stable and efficient training. The goal is to focus the network's attention where it's needed most.

[Mistake vs. Correction (2-panel plot). Left: 'Poor Sampling' shows collocation points on a coarse, uniform grid, with a heatmap of high PDE residual in between the points. Right: 'Good Sampling' (Sobol + RAR) shows points clustered in the high-residual area, leading to a uniform, low residual across the domain.]

# Generating Sobol sequence points is easy
from scipy.stats import qmc

sampler = qmc.Sobol(d=2, scramble=True)
sample = sampler.random_base2(m=10) # 2^10 = 1024 points
# ... scale sample to your domain ...
# For full code, see github.com/user/repo

6.3 Domain Decomposition

For problems with different physical domains or sharp interfaces (e.g., anode/separator/cathode), a single PINN can struggle. Domain decomposition is a powerful strategy to handle this.

① Train a separate neural network for each subdomain. → ② Enforce the primary physics (PDEs) within each subdomain as usual. → ③ Add extra loss terms that enforce continuity and flux conservation at the interfaces between the subdomains, ensuring a smooth and physically correct global solution.

[Domain Decomposition Diagram: Three blocks labeled 'NN Anode', 'NN Separator', 'NN Cathode'. Arrows show continuity constraints (e.g., \(\phi_{anode} = \phi_{sep}\)) and flux constraints (e.g., \(J_{anode} = J_{sep}\)) at the interfaces.]

6.4 Optimizer & Hyperparameter Tips

The final piece of the puzzle is the optimization algorithm and its settings.

Hyperparameter	Typical Range	Guideline
Optimizer	Adam, L-BFGS	Start with Adam for global search, then fine-tune with L-BFGS.
Learning Rate	1e-3 to 1e-4	Use a scheduler (e.g., exponential decay) to reduce over time.
\(\lambda_{PDE}\), \(\lambda_{data}\)	0.1 to 100	Use adaptive methods (Ch. 3) or anneal during training.
Batch Size	256 to 4096	Larger batches give more stable gradients.

Pre-Training Checklist

All inputs/outputs are non-dimensionalized and scaled to \(\sim \mathcal{O}(1)\).
Collocation points cover the entire domain, preferably with a quasi-random sequence.
The network architecture is sufficiently deep/wide for the problem's complexity.
Loss weights (\(\lambda\)) are set, either manually or with an adaptive scheme.
The learning rate scheduler is configured.

Further Reading:

PINN training stability depends on two pillars: proper scaling and balanced sampling.

7. Case Study: Hybrid Model for Battery Prognosis

This case study examines a state-of-the-art hybrid model that combines a Variational Autoencoder (VAE) with a PINN to predict battery State-of-Health (SoH) and Remaining Useful Life (RUL).

7.1 The Challenge: Predicting Battery Lifespan

Accurately predicting how a battery will degrade is extremely difficult. Every battery is slightly different due to manufacturing variations, and their aging paths are highly sensitive to usage patterns. Traditional models struggle because they either require vast amounts of run-to-failure data (which is expensive) or they fail to capture the stochastic, cell-to-cell variability.

7.2 The Approach: A VAE-PINN Hybrid

This research introduces a powerful hybrid architecture to tackle the problem by leveraging the strengths of both generative and physics-informed models.

① The VAE (Data Interpreter): A Variational Autoencoder first takes high-dimensional battery data (like a voltage curve from a single cycle) and compresses it into a low-dimensional latent space. This space represents the fundamental "health state" of the battery, effectively filtering out noise and capturing the unique signature of each cell.
② The PINN (Physics Enforcer): A Physics-Informed Neural Network then operates on this latent health space. It learns the trajectory of the health state over time, but its learning is constrained by a known physical degradation model (e.g., an empirical ODE for capacity fade). The PINN ensures that the predicted degradation path is physically plausible.

[Architecture of the VAE-PINN. An input voltage curve V(t) is fed into a VAE Encoder, which outputs a latent vector 'z'. The PINN takes 'z' and cycle number 'N' as input, and predicts the next latent state 'z_N+1'. A VAE Decoder then reconstructs the full voltage curve from 'z_N+1'. The PINN's evolution is constrained by a physics loss.]

In essence, the VAE handles the "what" (what is the current health?), and the PINN handles the "how" (how does this health evolve according to physics?).

7.3 Key Findings and Impact

The results of this hybrid approach demonstrate a significant leap in battery prognosis:

Exceptional Data Efficiency: The model accurately predicted the RUL of batteries using data from only the first 100 cycles, outperforming traditional data-driven models that require much more data.
Robustness to Variation: By learning a probabilistic latent space, the VAE component effectively handled the inherent cell-to-cell variations, leading to more reliable predictions across a fleet of batteries.
Accurate Long-Term Forecasting: The physics constraints from the PINN component prevented the model from making unrealistic long-term predictions, a common failure mode for pure ML models. The model could accurately forecast capacity fade far into the future.

[Key result plot from the paper: Predicted vs. Actual Remaining Useful Life (RUL). The points for the VAE-PINN model lie very close to the y=x line, indicating high accuracy, even for predictions made from early-cycle data. Other models show much larger scatter.]

7.4 Source

He, J., He, S., Zhang, S. et al. Variational autoencoder-enhanced physics-informed neural networks for battery state-of-health and remaining useful life prediction. Nat Commun 15, 4088 (2024). https://doi.org/10.1038/s41467-024-48779-z

8. Lab: Modeling Battery Degradation

Goal: We have sparse measurements of a battery's capacity over its cycle life. We will build a PINN to learn a continuous capacity degradation function, \(C(t)\), constrained by a simple physical law: \( \frac{dC}{dt} = -kC \), where \(k\) is an unknown degradation rate constant that the model will also learn as a parameter.

9. Practical Tips & Debug Checklist

Non-dimensionalize Everything: Before training, scale all variables (time, space, concentrations) to be of a similar order of magnitude (\(\sim \mathcal{O}(1)\)). This is the single most important trick for stable training.
Check Your Gradients: If training is unstable, manually inspect the magnitude of the gradients from each loss term. If one is orders of magnitude larger than the others, it will dominate. Adjust loss weights accordingly.
Verify on a Known Problem: Before tackling an unknown system, always verify your PINN implementation on a problem with a known analytical solution to ensure the code is correct.
Use Gradient Checkpointing: For large models or high-order derivatives, use gradient checkpointing (e.g., `torch.utils.checkpoint`) to trade compute for a significant reduction in memory usage.

10. Key References & Toolkits

Raissi, Perdikaris, & Karniadakis, "Physics-informed neural networks...", J. Comp. Phys. 2019.
Karniadakis et al., "Physics-informed machine learning", Nature Reviews Physics 2021.
Lu L. et al., "DeepXDE: A deep learning library for solving differential equations", SIAM Review 2021.
Toolkits: DeepXDE, NVIDIA SimNet, SciML.ai (Julia), SciANN.