
What's New in PyTorch Lightning 1.5?
Earlier this month, PyTorch Lightning 1.5 became available, and it's a big update! It introduces support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!
What is PyTorch Lightning?
PyTorch Lightning provides a high-level interface for PyTorch, a popular deep learning framework, and is ideal for anyone doing high-performance AI research. It allows you to scale your models without extra bloat and boilerplate, and is definitely something you should take a look at if you use PyTorch. If you haven't had a chance to check out PyTorch Lightning yet, you an find a good overview and how to get started in our other blog posts:
- Introduction to PyTorch Lightning
- PyTorch Lightning Tutorial #1: Getting Started
- PyTorch Lightning Tutorial #2: Using TorchMetrics and Lightning Flash
Interested in a deep learning workstation?
Learn more about Exxact AI workstations starting at $3,700
Highlights in Version 1.5
With over 60 contributors working on features, bugfixes and documentation improvements, version 1.5 was their biggest release to date. Read on for highlights of this version.
Fault-tolerant Training
Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.
PL_FAULT_TOLERANT_TRAINING=1 python train.py
LightningLite
LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.
With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out the documentation here.
class Lite(LightningLite):
def run(self):
# Let Lite setup your dataloader(s)
train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))
model = Net() # .to() not needed
optimizer = optim.Adam(model.parameters())
# Let Lite setup your model and optimizer
model, optimizer = self.setup(model, optimizer)
for epoch in range(5):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data) # data is already on the device
loss = F.nll_loss(output, target)
self.backward(loss) # instead of loss.backward()
optimizer.step()
Lite(accelerator="gpu", devices="auto").run()
Loop Customization
With the new Lightning Loop API in v1.5, you can write your own training loops for any kind of research from active learning to recommendation systems.
The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of an effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.
Read a comprehensive introduction to loops here.
New Rich Progress Bar
They integrated with Rich and created a new and improved progress bar for Lightning. Try it out:
pip install rich
from pytorch_lightning import Trainer from pytorch_lightning.callbacks import RichProgressBar trainer = Trainer(callbacks=[RichProgressBar()])
New Trainer Arguments: Strategy and Devices
With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.
| Before | After |
|---|---|
Trainer(accelerator="ddp", gpus=2) | Trainer(accelerator="gpu", devices=2, strategy="ddp") |
Trainer(accelerator="ddp_cpu", num_processes=2) | Trainer(accelerator="cpu", devices=2, strategy="ddp") |
Trainer(accelerator="tpu_spawn", tpu_cores=8) | Trainer(accelerator="tpu", devices=8) |
The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.
from pytorch_lightning import Trainer trainer = Trainer(accelerator="auto", devices="auto")
LightningCLI V2
This release adds support for running not just Trainer.fit but any of the Trainer entry points.
python script.py fit python script.py test
LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:
python script.py \
--trainer.callbacks=EarlyStopping \
--trainer.callbacks.patience=5 \
--trainer.callbacks.LearningRateMonitor \
--trainer.callbacks.logging_interval=epoch \
--optimizer=Adam \
--optimizer.lr=0.01 \
--lr_scheduler=OneCycleLR \
--lr_scheduler=anneal_strategy=linear
They've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:
cli = LightningCLI(MyModel, run=False) cli.trainer.fit(cli.model)
CheckpointIO Plugins
As part of their commitment to extensibility, they've abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.
from pytorch_lightning.plugins import CheckpointIO
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(self, checkpoint, path):
# put all logic related to saving a checkpoint here
def load_checkpoint(self, path):
# put all logic related to loading a checkpoint here
def remove_checkpoint(self, path):
# put all logic related to deleting a checkpoint hereBFloat16 Support
PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:
from pytorch_lightning import Trainer trainer = Trainer(precision="bf16")
Enable Auto Parameters Tying
It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignment to alleviate this problem from TPUs.
Infinite Training
Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.
Note: you will want to avoid logging with
on_epoch=Truein case ofmax_steps=-1.
DeepSpeed Stage 1
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.
from pytorch_lightning import Trainer trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16) trainer.fit(model)
For even more memory savings and model sharding advice, check out stage 2 & 3 as well in their multi-GPU docs.
Gradient Clipping Customization
By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:
# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
self,
optimizer,
optimizer_idx,
gradient_clip_val,
gradient_clip_algorithm
):
if optimizer_idx == 1:
# Lightning will handle the gradient clipping
self.clip_gradients(
optimizer,
gradient_clip_val=gradient_clip_val,
gradient_clip_algorithm=gradient_clip_algorithm
)
This means you can now implement state-of-the-art clipping algorithms with Lightning.
Determinism
Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:
from pytorch_lightning import Trainer trainer = Trainer(deterministic=True)
Anomaly Detection
Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here.
from pytorch_lightning import Trainer trainer = Trainer(detect_anomaly=True)
DDP Debugging Improvements
Are you having a hard time debugging DDP on your remote machine? Now you can debug DDP locally on the CPU:
trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)
When everything works, switch back to GPU by changing only the accelerator. Check the documentation for more useful debugging tricks.
Note that this will not provide any speed benefits.
ModelSummary Callback
Generates a summary of all layers in a LightningModule. This currently works with the new RichProgressBar callback.
from pytorch_lightning import Trainer from pytorch_lightning.callbacks import ModelSummary trainer = Trainer(callbacks=[ModelSummary(max_depth=1)])
New Hooks
An on_exception Callback hook has been added which allows the user to perform custom exception handling.
class MyCallback(Callback):
def on_exception(self, trainer, pl_module, exception):
# whatever you want!
...Experimental Feature - Inter Batch Parallelism
The inter-batch parallelism feature aims at hiding the latency of host-to-device copy of input batches behind computationally intensive operations. In some use case, it can provide training speed up. This feature is experimental and subject to change, hence opt-in through an environment variable.
PL_INTER_BATCH_PARALLELISM=1 python train.py
Experimental Feature - Training Step With DataLoader Iterator
If your training_step signature takes a dataloader_iter, Lightning would pass it directly. This can be useful for recommendation engine optimization.
Experimental Feature - Meta Module
PyTorch 1.10 introduces the meta tensors, tensors without the data. In this continuation, PyTorch Lightning provides an init_meta_context context manager and materialize_module function to handle large sharded models.
Full Changelog
Added
- Added support for monitoring the learning rate without schedulers in
LearningRateMonitor(#9786) - Added registration of
ShardedTensorstate dict hooks inLightningModule.__init__if the PyTorch version supportsShardedTensor(#8944) - Added error handling including calling of
on_keyboard_interrupt()andon_exception()for all entrypoints (fit, validate, test, predict) (#8819) - Added a flavor of
training_stepthat takesdataloader_iteras an argument (#8807) - Added a
state_keyproperty to theCallbackbase class (#6886) - Added progress tracking to loops:
- Integrated
TrainingEpochLoop.total_batch_idx(#8598) - Added
BatchProgressand integratedTrainingEpochLoop.is_last_batch(#9657) - Avoid optional
Trackerattributes (#9320) - Reset
currentprogress counters when restarting an epoch loop that had already finished (#9371) - Call
reset_on_restartin the loop'sresethook instead of when loading a checkpoint (#9561) - Use
completedoverprocessedinreset_on_restart(#9656) - Renamed
reset_on_epochtoreset_on_run(#9658)
- Integrated
- Added
batch_sizeandrank_zero_onlyarguments forlog_dictto matchlog(#8628) - Added a check for unique GPU ids (#8666)
- Added
ResultCollectionstate_dict to the Loopstate_dictand added support for distributed reload (#8641) - Added DeepSpeed collate checkpoint utility function (#8701)
- Added a
handles_accumulate_grad_batchesproperty to the training type plugins (#8856) - Added a warning to
WandbLoggerwhen reusing a wandb run (#8714) - Added
log_graphargument forwatchmethod ofWandbLogger(#8662) LightningCLIadditions:- Added
LightningCLI(run=False|True)to choose whether to run aTrainersubcommand (#8751) - Added support to call any trainer function from the
LightningCLIvia subcommands (#7508) - Allow easy trainer re-instantiation (#7508)
- Automatically register all optimizers and learning rate schedulers (#9565)
- Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
- Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
- Support passing lists of callbacks via command line (#8815)
- Support shorthand notation to instantiate models (#9588)
- Support shorthand notation to instantiate datamodules (#10011)
- Added
multifileoption toLightningCLIto enable/disable config saving to preserve multiple files structure (#9073)
- Added
- Fault-tolerant training:
- Added
FastForwardSamplerandCaptureIterableDatasetinjection to data loading utilities (#8366) - Added
DataFetcherto control fetching flow (#8890) - Added
SharedCycleIteratorStateto prevent infinite loop (#8889) - Added
CaptureMapDatasetfor state management in map-style datasets (#8891) - Added Fault Tolerant Training to
DataFetcher(#8891) - Replaced old prefetch iterator with new
DataFetcherin training loop (#8953) - Added partial support for global random state fault-tolerance in map-style datasets (#8950)
- Converted state to tuple explicitly when setting Python random state (#9401)
- Added support for restarting an optimizer loop (multiple optimizers) (#9537)
- Added support for restarting within Evaluation Loop (#9563)
- Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
- Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
- Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
- Added
- Checkpoint saving and loading extensibility:
- Added
CheckpointIOplugin to expose checkpoint IO from training type plugin (#8743) - Refactored
CheckpointConnectorto offload validation logic to theCheckpointIOplugin (#9045) - Added
remove_checkpointtoCheckpointIOplugin by moving the responsibility out of theModelCheckpointcallback (#9373) - Added
XLACheckpointIOplugin (#9972)
- Added
- Loop customization:
- Added
ClosureandAbstractClosureclasses (#8642) - Refactored
TrainingBatchLoopand extractedOptimizerLoop, splitting off automatic optimization into its own loop (#9191) - Removed
TrainingBatchLoop.backward(); manual optimization now calls directly intoAccelerator.backward()and automatic optimization handles backward in newOptimizerLoop(#9265) - Extracted
ManualOptimizationlogic fromTrainingBatchLoopinto its own separate loop class (#9266) - Added
OutputResultandManualResultclasses (#9437, #9424) - Marked
OptimizerLoop.backwardas protected (#9514) - Marked
FitLoop.should_accumulateas protected (#9515) - Marked several methods in
PredictionLoopas protected:on_predict_start,on_predict_epoch_end,on_predict_end,on_predict_model_eval(#9516) - Marked several methods in
EvaluationLoopas protected:get_max_batches,on_evaluation_model_eval,on_evaluation_model_train,on_evaluation_start,on_evaluation_epoch_start,on_evaluation_epoch_end,on_evaluation_end,reload_evaluation_dataloaders(#9516) - Marked several methods in
EvaluationEpochLoopas protected:on_evaluation_batch_start,evaluation_step,evaluation_step_end(#9516) - Added
yielding_training_stepexample (#9983)
- Added
- Added support for saving and loading state of multiple callbacks of the same type (#7187)
- Added DeepSpeed Stage 1 support (#8974)
- Added
Python dataclasssupport forLightningDataModule(#8272) - Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger(#9031) - Added
InterBatchParallelDataFetcher(#9020) - Added
DataLoaderIterDataFetcher(#9020) - Added
DataFetcherwithinFit / EvaluationLoop (#9047) - Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
- Added Rich integration:
- Added input validation logic for precision (#9080)
- Added support for CPU AMP autocast (#9084)
- Added
on_exceptioncallback hook (#9183) - Added a warning to DeepSpeed when inferring batch size (#9221)
- Added
ModelSummarycallback (#9344) - Added
log_images,log_textandlog_tabletoWandbLogger(#9545) - Added
PL_RECONCILE_PROCESSenvironment variable to enable process reconciliation regardless of cluster environment settings (#9389) - Added
get_device_statsto the Accelerator interface and added its implementation for GPU and TPU (#9586) - Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLRis used with"interval": "epoch"(#9666) - Added
DeviceStatsMonitorcallback (#9712) - Added
enable_progress_barto the Trainer constructor (#9664) - Added
pl_legacy_patchload utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166) - Added support for
torch.use_deterministic_algorithms(#9121) - Added automatic parameters tying for TPUs (#9525)
- Added support for
torch.autograd.set_detect_anomalythroughTrainerconstructor argumentdetect_anomaly(#9848) - Added
enable_model_summaryflag to Trainer (#9699) - Added
strategyargument to Trainer (#8597) - Added
init_meta_context,materialize_moduleutilities (#9920) - Added
TPUPrecisionPlugin(#10020) - Added
torch.bfloat16support: - Added
kfoldexample for loop customization (#9965) - LightningLite:
- Added
PrecisionPlugin.forward_context, making it the default implementation for all{train,val,test,predict}_step_context()methods (#9988) - Added
DDPSpawnPlugin.spawn()for spawning new processes of a given function (#10018, #10022) - Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}methods (#9994, #10064) - Implemented
DataParallelPlugin._setup_model(#10010) - Implemented
DeepSpeedPlugin._setup_model_and_optimizers(#10009, #10064) - Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers(#10028, #10064) - Added optional
modelargument to theoptimizer_stepmethods in accelerators and plugins (#10023) - Updated precision attributes in
DeepSpeedPlugin(#10164) - Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn(#10162) - Added
pytorch_lightning.litepackage (#10175) - Added
LightningLitedocumentation (#10043) - Added
LightningLiteexamples (#9987) - Make the
_LiteDataLoaderan iterator and add supports for custom dataloader (#10279)
- Added
- Added
use_omegaconfargument tosave_hparams_to_yamlplugin (#9170) - Added
ckpt_pathargument forTrainer.fit()(#10061) - Added
auto_device_countmethod toAccelerators(#10222) - Added support for
devices="auto"(#10264) - Added a
filenameargument inModelCheckpoint.format_checkpoint_name(#9818) - Added support for empty
gpuslist to run on CPU (#10246) - Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
Changed
- Trainer now raises a
MisconfigurationExceptionwhen its methods are called withckpt_path="best"but a checkpoint callback isn't configured (#9841) - Setting
Trainer(accelerator="ddp_cpu")now does not spawn a subprocess ifnum_processesis kept1along withnum_nodes > 1(#9603) - Module imports are now catching
ModuleNotFoundErrorinstead ofImportError(#9867) pytorch_lightning.loggers.neptune.NeptuneLoggeris now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClientfrom the neptune-contrib repo (#6867)- Parsing of
enumstype hyperparameters to be saved in thehaprams.yamlfile by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170) - Parsing of the
gpusTrainer argument has changed:gpus="n"(str) no longer selects the GPU index n and instead selects the first n devices (#8770) iteration_countand other index attributes in the loops has been replaced with progress dataclasses (#8477)- The
trainer.lightning_modulereference is now properly set at the very beginning of a run (#8536) - The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
- The
Trainerfunctionsreset_{train,val,test,predict}_dataloader,reset_train_val_dataloaders, andrequest_dataloadermodelargument is now optional (#8536) - Saved checkpoints will no longer use the type of a
Callbackas the key to avoid issues with unpickling (#6886) - Improved string conversion for
ResultCollection(#8622) LightningCLIchanges:LightningCLI.init_parsernow returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser,LightningCLI.parse_argumentsnow take aparserargument (#8721)LightningCLI.instantiate_trainernow takes a config and a list of callbacks (#8721)- Split
LightningCLI.add_core_arguments_to_parserintoLightningCLI.add_default_arguments_to_parser+LightningCLI.add_core_arguments_to_parser(#8721)
- The accelerator and training type plugin
setuphooks no longer have amodelargument (#8536) - The accelerator and training type plugin
update_global_stephook has been removed (#8856) - The coverage of
self.log-ing in anyLightningModuleorCallbackhook has been improved (#8498) self.log-ing without aTrainerreference now raises a warning instead of an exception (#9733)- Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloadernow takes aRunningStageenum instance (#8858)- Changed
rank_zero_warntoNotImplementedErrorin the{train, val, test, predict}_dataloaderhooks thatLightning(Data)Moduleuses (#9161) - Moved
block_ddp_sync_behaviourout ofTrainingBatchLoopto loop utilities (#9192) - Executing the
optimizer_closureis now required when overriding theoptimizer_stephook (#9360) - Changed logging of
LightningModuleandLightningDataModulehyperparameters to raise an exception only if there are colliding keys with different values (#9496) seed_everythingnow fails when an invalid seed value is passed instead of selecting a random seed (#8787)- The Trainer now calls
TrainingTypePlugincollective APIs directly instead of going through the Accelerator reference (#9677, #9901) - The tuner now usees a unique filename to save a temporary checkpoint (#9682)
- Changed
HorovodPlugin.all_gatherto return atorch.Tensorinstead of a list (#9696) - Changed Trainer connectors to be protected attributes:
- Configuration Validator (#9779)
- The
current_epochandglobal_stepattributes now get restored irrespective of the Trainer task (#9413) - Trainer now raises an exception when requesting
amp_levelwith nativeamp_backend(#9755) - Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_normnow raises an exception if parameternorm_type <= 0(#9765)- Updated error message for interactive incompatible plugins (#9896)
- Moved the
optimizer_stepandclip_gradientshook from theAcceleratorandTrainingTypePlugininto thePrecisionPlugin(#10143, #10029) NativeMixedPrecisionPluginand its subclasses now take an optionalGradScalerinstance (#10055)- Trainer is now raising a
MisconfigurationExceptioninstead of a warning ifTrainer.{validate/test}is missing required methods (#10016) - Changed default value of the
max_stepsTrainer argument fromNoneto -1 (#9460) - LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)(#10227) - Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
- Raised
MisconfigurationExceptionwhen total length ofdataloaderacross ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827) - Changed the model size calculation using
ByteCounter(#10123) - Enabled
on_load_checkpointforLightningDataModulefor alltrainer_fn(#10238) - Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False(#10286)
Deprecated
- Deprecated Trainer argument
terminate_on_nanin favor ofdetect_anomaly(#9175) - Deprecated
Trainer.terminate_on_nanpublic attribute access (#9849) - Deprecated
LightningModule.summarize()in favor ofpytorch_lightning.utilities.model_summary.summarize()(#8513) - Deprecated
LightningModule.model_size(#8343) - Deprecated
DataModuleproperties:train_transforms,val_transforms,test_transforms,size,dims(#8851) - Deprecated
add_to_queue,get_from_queuefromLightningModulein favor of corresponding methods in theDDPSpawnPlugin(#9118) - Deprecated
LightningModule.get_progress_bar_dictandTrainer.progress_bar_dictin favor ofpytorch_lightning.callbacks.progress.base.get_standard_metricsandProgressBarBase.get_metrics(#8985) - Deprecated
prepare_data_per_nodeflag on Trainer and set it as a property ofDataHooks, accessible in theLightningModuleandLightningDataModule(#8958) - Deprecated the
TestTubeLogger(#9065) - Deprecated
on_{train/val/test/predict}_dataloader()fromLightningModuleandLightningDataModule(#9098) - Deprecated
on_keyboard_interruptcallback hook in favor of newon_exceptionhook (#9260) - Deprecated passing
process_positionto theTrainerconstructor in favor of adding theProgressBarcallback withprocess_positiondirectly to the list of callbacks (#9222) - Deprecated passing
flush_logs_every_n_stepsas a Trainer argument, instead pass it to the logger init if supported (#9366) - Deprecated
LightningLoggerBase.close,LoggerCollection.closein favor ofLightningLoggerBase.finalize,LoggerCollection.finalize(#9422) - Deprecated passing
progress_bar_refresh_rateto theTrainerconstructor in favor of adding theProgressBarcallback withrefresh_ratedirectly to the list of callbacks, or passingenable_progress_bar=Falseto disable the progress bar (#9616) - Deprecated
LightningDistributedand moved the broadcast logic toDDPPluginandDDPSpawnPlugindirectly (#9691) - Deprecated passing
stochastic_weight_avgto theTrainerconstructor in favor of adding theStochasticWeightAveragingcallback directly to the list of callbacks (#8989) - Deprecated Accelerator collective API
barrier,broadcast, andall_gatherin favor of calling theTrainingTypePlugincollective API directly (#9677) - Deprecated
checkpoint_callbackfrom theTrainerconstructor in favor ofenable_checkpointing(#9754) - Deprecated the
LightningModule.on_post_move_to_devicemethod (#9525) - Deprecated
pytorch_lightning.core.decorators.parameter_validationin favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters(#9525) - Deprecated passing
weights_summaryto theTrainerconstructor in favor of adding theModelSummarycallback withmax_depthdirectly to the list of callbacks (#9699) - Deprecated
log_gpu_memory,gpu_metrics, and util funcs in favor ofDeviceStatsMonitorcallback (#9921) - Deprecated
GPUStatsMonitorandXLAStatsMonitorin favor ofDeviceStatsMonitorcallback (#9924) - Deprecated setting
Trainer(max_steps=None); To turn off the limit, setTrainer(max_steps=-1)(default) (#9460) - Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasksattribute and marked it as protected (#10101) - Deprecated access to the
AcceleratorConnector.configure_slurm_ddpmethod and marked it as protected (#10101) - Deprecated passing
resume_from_checkpointto theTrainerconstructor in favor oftrainer.fit(ckpt_path=)(#10061) - Deprecated
ClusterEnvironment.creates_children()in favor ofClusterEnvironment.creates_processes_externally(property) (#10106) - Deprecated
PrecisionPlugin.master_params()in favor ofPrecisionPlugin.main_params()(#10105) - Deprecated
lr_sch_namesfromLearningRateMonitor(#10066) - Deprecated
ProgressBarcallback in favor ofTQDMProgressBar(#10134)
Removed
- Removed deprecated
metrics(#8586) - Removed the deprecated
outputsargument in both theLightningModule.on_train_epoch_endandCallback.on_train_epoch_endhooks (#8587) - Removed the deprecated
TrainerLoggingMixinclass (#8609) - Removed the deprecated
TrainerTrainingTricksMixinclass (#8679) - Removed the deprecated
optimizer_idxfromtraining_stepas an accepted argument in manual optimization (#8576) - Removed support for the deprecated
on_save_checkpointsignature. The hook now takes acheckpointpositional parameter (#8697) - Removed support for the deprecated
on_load_checkpointsignature. The hook now takes apl_modulepositional parameter (#8697) - Removed the deprecated
save_functionproperty inModelCheckpoint(#8680) - Removed the deprecated
modelargument fromModelCheckpoint.save_checkpoint(#8688) - Removed the deprecated
sync_stepargument fromWandbLogger(#8763) - Removed the deprecated
Trainer.truncated_bptt_stepsin favor ofLightningModule.truncated_bptt_steps(#8826) - Removed
LightningModule.write_predictionsandLightningModule.write_predictions_dict(#8850) - Removed
on_reset_*_dataloaderhooks in TrainingType Plugins and Accelerators (#8858) - Removed deprecated
GradInformationmodule in favor ofpytorch_lightning.utilities.grads(#8831) - Removed
TrainingTypePlugin.on_saveandAccelerator.on_save(#9023) - Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step(#9746) - Removed deprecated
connect_precision_pluginandconnect_training_type_pluginfromAccelerator(#9019) - Removed
on_train_epoch_endfromAccelerator(#9035) - Removed
InterBatchProcessorin favor ofDataLoaderIterDataFetcher(#9052) - Removed
Plugininbase_plugin.pyin favor of accessingTrainingTypePluginandPrecisionPlugindirectly instead (#9066) - Removed
teardownfromParallelPlugin(#8943) - Removed deprecated
profiled_functionsargument fromPyTorchProfiler(#9178) - Removed deprecated
pytorch_lighting.utilities.argparse_utilsmodule (#9166) - Removed deprecated property
Trainer.running_sanity_checkin favor ofTrainer.sanity_checking(#9209) - Removed deprecated
BaseProfiler.output_filenamearg from it and its descendants in favor ofdirpathandfilename(#9214) - Removed deprecated property
ModelCheckpoint.periodin favor ofModelCheckpoint.every_n_epochs(#9213) - Removed deprecated
auto_move_datadecorator (#9231) - Removed deprecated property
LightningModule.datamodulein favor ofTrainer.datamodule(#9233) - Removed deprecated properties
DeepSpeedPlugin.cpu_offload*in favor ofoffload_optimizer,offload_parametersandpin_memory(#9244) - Removed deprecated property
AcceleratorConnector.is_using_torchelasticin favor ofTorchElasticEnvironment.is_using_torchelastic()(#9729) - Removed
pytorch_lightning.utilities.debugging.InternalDebugger(#9680) - Removed
call_configure_sharded_model_hookproperty fromAcceleratorandTrainingTypePlugin(#9612) - Removed
TrainerPropertiesmixin and moved property definitions directly intoTrainer(#9495) - Removed a redundant warning with
ModelCheckpoint(monitor=None)callback (#9875) - Remove
epochfromtrainer.logged_metrics(#9904) - Removed
should_rank_save_checkpointproperty from Trainer (#9433) - Remove deprecated
distributed_backendfromTrainer(#10017) - Removed
process_idxfrom the{DDPSpawnPlugin,TPUSpawnPlugin}.new_processmethods (#10022) - Removed automatic patching of
{train,val,test,predict}_dataloader()on theLightningModule(#9764) - Removed
pytorch_lightning.trainer.connectors.OptimizerConnector(#10120)
Fixed
- Fixed ImageNet evaluation in example (#10179)
- Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
- Fixed
move_metrics_to_cpumoving the loss to CPU while training on device (#9308) - Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
- Fixed an issue with freeing memory of datafetchers during teardown (#9387)
- Fixed a bug where the training step output needed to be
deepcopy-ed (#9349) - Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end(#9386, #9915) - Fixed
BasePredictionWriternot returning the batch indices in a non-distributed setting (#9432) - Fixed an error when running in XLA environments with no TPU attached (#9572)
- Fixed check on torchmetrics logged whose
compute()output is a multielement tensor (#9582) - Fixed gradient accumulation for
DDPShardedPlugin(#9122) - Fixed missing DeepSpeed distributed call (#9540)
- Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin,DDPSpawnPlugin,DDPShardedPlugin,DDPSpawnShardedPlugin(#9096) - Fixed
trainer.accumulate_grad_batchesto be an int on init. The default value for it is nowNoneinside Trainer (#9652) - Fixed
broadcastinDDPPluginandDDPSpawnPluginto respect thesrcinput (#9691) - Fixed
self.log(on_epoch=True, reduce_fx=sum))for theon_batch_startandon_train_batch_starthooks (#9791) - Fixed
self.log(on_epoch=True)for theon_batch_startandon_train_batch_starthooks (#9780) - Fixed restoring training state during
Trainer.fitonly (#9413) - Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
- Fixed DeepSpeed GPU device IDs (#9847)
- Reset
val_dataloaderintuner/batch_size_scaling(#9857) - Fixed use of
LightningCLIin computer_vision_fine_tuning.py example (#9934) - Fixed issue with non-init dataclass fields in
apply_to_collection(#9963) - Reset
val_dataloaderintuner/batch_size_scalingfor binsearch (#9975) - Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check(#9902) - Fixed
train_dataloadergetting loaded twice when resuming from a checkpoint duringTrainer.fit()(#9671) - Fixed
LearningRateMonitorlogging with multiple param groups optimizer with no scheduler (#10044) - Fixed undesired side effects being caused by
Trainerpatching dataloader methods on theLightningModule(#9764) - Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
- Fixed
on_before_optimizer_stepgetting called before the optimizer closure (including backward) has run (#10167) - Fixed monitor value in
ModelCheckpointgetting moved to the wrong device in a special case where it becomes NaN (#10118) - Fixed creation of
dirpathinBaseProfilerif it doesn't exist (#10073) - Fixed incorrect handling of sigterm (#10189)
- Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)wouldn't reduce the value on step (#10227) - Fixed an issue with
pl.utilities.seed.reset_seedconverting thePL_SEED_WORKERSenvironment variable tobool(#10099) - Fixed iterating over a logger collection when
fast_dev_run > 0(#10232) - Fixed
batch_sizeinResultCollectionnot being reset to 1 on epoch end (#10242) - Fixed
distrib_typenot being set when training plugin instances are being passed to the Trainer (#10251)
Have any questions?
Contact Exxact Today

PyTorch Lightning 1.5 Released
What's New in PyTorch Lightning 1.5?
Earlier this month, PyTorch Lightning 1.5 became available, and it's a big update! It introduces support for LightningLite, Fault-tolerant Training, Loop Customization, Lightning Tutorials, LightningCLI V2, RichProgressBar, CheckpointIO Plugin, Trainer Strategy flag, and more!
What is PyTorch Lightning?
PyTorch Lightning provides a high-level interface for PyTorch, a popular deep learning framework, and is ideal for anyone doing high-performance AI research. It allows you to scale your models without extra bloat and boilerplate, and is definitely something you should take a look at if you use PyTorch. If you haven't had a chance to check out PyTorch Lightning yet, you an find a good overview and how to get started in our other blog posts:
- Introduction to PyTorch Lightning
- PyTorch Lightning Tutorial #1: Getting Started
- PyTorch Lightning Tutorial #2: Using TorchMetrics and Lightning Flash
Interested in a deep learning workstation?
Learn more about Exxact AI workstations starting at $3,700
Highlights in Version 1.5
With over 60 contributors working on features, bugfixes and documentation improvements, version 1.5 was their biggest release to date. Read on for highlights of this version.
Fault-tolerant Training
Fault-tolerant Training is a new internal mechanism that enables PyTorch Lightning to recover from a hardware or software failure. This is particularly interesting while training in the cloud with preemptive instances which can shutdown at any time. Once a Lightning experiment unexpectedly exits, a temporary checkpoint is saved that contains the exact state of all loops and the model. With this new experimental feature, you will be able to restore your training mid-epoch on the exact batch and continue training as if it never got interrupted.
PL_FAULT_TOLERANT_TRAINING=1 python train.py
LightningLite
LightningLite enables pure PyTorch users to scale their existing code to any kind of hardware while retaining full control over their own loops and optimization logic.
With just a few lines of code and no large refactoring, you get support for multi-device, multi-node, running on different accelerators (CPU, GPU, TPU), native automatic mixed precision (half and bfloat16), and double precision, in just a few seconds. And no special launcher required! Check out the documentation here.
class Lite(LightningLite):
def run(self):
# Let Lite setup your dataloader(s)
train_loader = self.setup_dataloaders(torch.utils.data.DataLoader(...))
model = Net() # .to() not needed
optimizer = optim.Adam(model.parameters())
# Let Lite setup your model and optimizer
model, optimizer = self.setup(model, optimizer)
for epoch in range(5):
for data, target in train_loader:
optimizer.zero_grad()
output = model(data) # data is already on the device
loss = F.nll_loss(output, target)
self.backward(loss) # instead of loss.backward()
optimizer.step()
Lite(accelerator="gpu", devices="auto").run()
Loop Customization
With the new Lightning Loop API in v1.5, you can write your own training loops for any kind of research from active learning to recommendation systems.
The new Loop API lets advanced users swap out the default gradient descent optimization loop at the core of Lightning with a different optimization paradigm. This is part of an effort to make Lightning the simplest, most flexible framework to take any kind of deep learning research to production.
Read a comprehensive introduction to loops here.
New Rich Progress Bar
They integrated with Rich and created a new and improved progress bar for Lightning. Try it out:
pip install rich
from pytorch_lightning import Trainer from pytorch_lightning.callbacks import RichProgressBar trainer = Trainer(callbacks=[RichProgressBar()])
New Trainer Arguments: Strategy and Devices
With the new strategy and devices arguments in the Trainer, it is now easer to switch from one hardware to another.
| Before | After |
|---|---|
Trainer(accelerator="ddp", gpus=2) | Trainer(accelerator="gpu", devices=2, strategy="ddp") |
Trainer(accelerator="ddp_cpu", num_processes=2) | Trainer(accelerator="cpu", devices=2, strategy="ddp") |
Trainer(accelerator="tpu_spawn", tpu_cores=8) | Trainer(accelerator="tpu", devices=8) |
The new devices argument is now agnostic to all accelerators, but the previous arguments gpus, tpu_cores, ipus are still available and work the same as before. In addition, it is now also possible to set devices="auto" or accelerator="auto" to select the best accelerator available on the hardware.
from pytorch_lightning import Trainer trainer = Trainer(accelerator="auto", devices="auto")
LightningCLI V2
This release adds support for running not just Trainer.fit but any of the Trainer entry points.
python script.py fit python script.py test
LightningCLI now supports registries for callbacks, optimizers, learning rate schedulers, LightningModules and LightningDataModules. This greatly improves the command line experience as only the class names and arguments are required as follows:
python script.py \
--trainer.callbacks=EarlyStopping \
--trainer.callbacks.patience=5 \
--trainer.callbacks.LearningRateMonitor \
--trainer.callbacks.logging_interval=epoch \
--optimizer=Adam \
--optimizer.lr=0.01 \
--lr_scheduler=OneCycleLR \
--lr_scheduler=anneal_strategy=linear
They've also added support for a manual mode where the CLI takes care of the instantiation but you have control over the Trainer calls:
cli = LightningCLI(MyModel, run=False) cli.trainer.fit(cli.model)
CheckpointIO Plugins
As part of their commitment to extensibility, they've abstracted the checkpointing logic into a CheckpointIO plugin. This enables users to adapt Lightning to their own infrastructure.
from pytorch_lightning.plugins import CheckpointIO
class CustomCheckpointIO(CheckpointIO):
def save_checkpoint(self, checkpoint, path):
# put all logic related to saving a checkpoint here
def load_checkpoint(self, path):
# put all logic related to loading a checkpoint here
def remove_checkpoint(self, path):
# put all logic related to deleting a checkpoint hereBFloat16 Support
PyTorch 1.10 introduces native Automatic Mixed Precision (AMP) support for torch.bfloat16 on CPU (was already supported for TPUs), enabling higher performance compared with torch.float16. Switch to bfloat16 training by setting the argument:
from pytorch_lightning import Trainer trainer = Trainer(precision="bf16")
Enable Auto Parameters Tying
It is pretty common to share parameters within a model. However, TPUs don't retain shared parameters once moved on the devices. Lightning now supports automatic detection and re-assignment to alleviate this problem from TPUs.
Infinite Training
Infinite training is now supported by setting Trainer(max_epochs=-1) for an unlimited number of epochs, or Trainer(max_steps=-1) for an endless epoch.
Note: you will want to avoid logging with
on_epoch=Truein case ofmax_steps=-1.
DeepSpeed Stage 1
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Lightning now also supports the DeepSpeed ZeRO Stage 1 protocol that partitions your optimizer states across your GPUs to reduce memory.
from pytorch_lightning import Trainer trainer = Trainer(gpus=4, strategy="deepspeed_stage_1", precision=16) trainer.fit(model)
For even more memory savings and model sharding advice, check out stage 2 & 3 as well in their multi-GPU docs.
Gradient Clipping Customization
By overriding the LightningModule.configure_gradient_clipping hook, you can customize gradient clipping to your needs:
# Perform gradient clipping on gradients associated with discriminator (optimizer_idx=1) in GAN
def configure_gradient_clipping(
self,
optimizer,
optimizer_idx,
gradient_clip_val,
gradient_clip_algorithm
):
if optimizer_idx == 1:
# Lightning will handle the gradient clipping
self.clip_gradients(
optimizer,
gradient_clip_val=gradient_clip_val,
gradient_clip_algorithm=gradient_clip_algorithm
)
This means you can now implement state-of-the-art clipping algorithms with Lightning.
Determinism
Added support for torch.use_deterministic_algorithms. Read more about how it works here. You can enable it by setting:
from pytorch_lightning import Trainer trainer = Trainer(deterministic=True)
Anomaly Detection
Lightning makes it easier to debug your code, so we've added support for torch.set_detect_anomaly. With this, PyTorch detects numerical anomalies like NaN or inf during forward and backward. Read more about anomaly detection here.
from pytorch_lightning import Trainer trainer = Trainer(detect_anomaly=True)
DDP Debugging Improvements
Are you having a hard time debugging DDP on your remote machine? Now you can debug DDP locally on the CPU:
trainer = Trainer(accelerator="cpu", strategy="ddp", devices=2)
When everything works, switch back to GPU by changing only the accelerator. Check the documentation for more useful debugging tricks.
Note that this will not provide any speed benefits.
ModelSummary Callback
Generates a summary of all layers in a LightningModule. This currently works with the new RichProgressBar callback.
from pytorch_lightning import Trainer from pytorch_lightning.callbacks import ModelSummary trainer = Trainer(callbacks=[ModelSummary(max_depth=1)])
New Hooks
An on_exception Callback hook has been added which allows the user to perform custom exception handling.
class MyCallback(Callback):
def on_exception(self, trainer, pl_module, exception):
# whatever you want!
...Experimental Feature - Inter Batch Parallelism
The inter-batch parallelism feature aims at hiding the latency of host-to-device copy of input batches behind computationally intensive operations. In some use case, it can provide training speed up. This feature is experimental and subject to change, hence opt-in through an environment variable.
PL_INTER_BATCH_PARALLELISM=1 python train.py
Experimental Feature - Training Step With DataLoader Iterator
If your training_step signature takes a dataloader_iter, Lightning would pass it directly. This can be useful for recommendation engine optimization.
Experimental Feature - Meta Module
PyTorch 1.10 introduces the meta tensors, tensors without the data. In this continuation, PyTorch Lightning provides an init_meta_context context manager and materialize_module function to handle large sharded models.
Full Changelog
Added
- Added support for monitoring the learning rate without schedulers in
LearningRateMonitor(#9786) - Added registration of
ShardedTensorstate dict hooks inLightningModule.__init__if the PyTorch version supportsShardedTensor(#8944) - Added error handling including calling of
on_keyboard_interrupt()andon_exception()for all entrypoints (fit, validate, test, predict) (#8819) - Added a flavor of
training_stepthat takesdataloader_iteras an argument (#8807) - Added a
state_keyproperty to theCallbackbase class (#6886) - Added progress tracking to loops:
- Integrated
TrainingEpochLoop.total_batch_idx(#8598) - Added
BatchProgressand integratedTrainingEpochLoop.is_last_batch(#9657) - Avoid optional
Trackerattributes (#9320) - Reset
currentprogress counters when restarting an epoch loop that had already finished (#9371) - Call
reset_on_restartin the loop'sresethook instead of when loading a checkpoint (#9561) - Use
completedoverprocessedinreset_on_restart(#9656) - Renamed
reset_on_epochtoreset_on_run(#9658)
- Integrated
- Added
batch_sizeandrank_zero_onlyarguments forlog_dictto matchlog(#8628) - Added a check for unique GPU ids (#8666)
- Added
ResultCollectionstate_dict to the Loopstate_dictand added support for distributed reload (#8641) - Added DeepSpeed collate checkpoint utility function (#8701)
- Added a
handles_accumulate_grad_batchesproperty to the training type plugins (#8856) - Added a warning to
WandbLoggerwhen reusing a wandb run (#8714) - Added
log_graphargument forwatchmethod ofWandbLogger(#8662) LightningCLIadditions:- Added
LightningCLI(run=False|True)to choose whether to run aTrainersubcommand (#8751) - Added support to call any trainer function from the
LightningCLIvia subcommands (#7508) - Allow easy trainer re-instantiation (#7508)
- Automatically register all optimizers and learning rate schedulers (#9565)
- Allow registering custom optimizers and learning rate schedulers without subclassing the CLI (#9565)
- Support shorthand notation to instantiate optimizers and learning rate schedulers (#9565)
- Support passing lists of callbacks via command line (#8815)
- Support shorthand notation to instantiate models (#9588)
- Support shorthand notation to instantiate datamodules (#10011)
- Added
multifileoption toLightningCLIto enable/disable config saving to preserve multiple files structure (#9073)
- Added
- Fault-tolerant training:
- Added
FastForwardSamplerandCaptureIterableDatasetinjection to data loading utilities (#8366) - Added
DataFetcherto control fetching flow (#8890) - Added
SharedCycleIteratorStateto prevent infinite loop (#8889) - Added
CaptureMapDatasetfor state management in map-style datasets (#8891) - Added Fault Tolerant Training to
DataFetcher(#8891) - Replaced old prefetch iterator with new
DataFetcherin training loop (#8953) - Added partial support for global random state fault-tolerance in map-style datasets (#8950)
- Converted state to tuple explicitly when setting Python random state (#9401)
- Added support for restarting an optimizer loop (multiple optimizers) (#9537)
- Added support for restarting within Evaluation Loop (#9563)
- Added mechanism to detect that a signal has been sent so the Trainer can gracefully exit (#9566)
- Added support for skipping ahead to validation during the auto-restart of fitting (#9681)
- Added support for auto-restart if a fault-tolerant checkpoint is available (#9722)
- Added
- Checkpoint saving and loading extensibility:
- Added
CheckpointIOplugin to expose checkpoint IO from training type plugin (#8743) - Refactored
CheckpointConnectorto offload validation logic to theCheckpointIOplugin (#9045) - Added
remove_checkpointtoCheckpointIOplugin by moving the responsibility out of theModelCheckpointcallback (#9373) - Added
XLACheckpointIOplugin (#9972)
- Added
- Loop customization:
- Added
ClosureandAbstractClosureclasses (#8642) - Refactored
TrainingBatchLoopand extractedOptimizerLoop, splitting off automatic optimization into its own loop (#9191) - Removed
TrainingBatchLoop.backward(); manual optimization now calls directly intoAccelerator.backward()and automatic optimization handles backward in newOptimizerLoop(#9265) - Extracted
ManualOptimizationlogic fromTrainingBatchLoopinto its own separate loop class (#9266) - Added
OutputResultandManualResultclasses (#9437, #9424) - Marked
OptimizerLoop.backwardas protected (#9514) - Marked
FitLoop.should_accumulateas protected (#9515) - Marked several methods in
PredictionLoopas protected:on_predict_start,on_predict_epoch_end,on_predict_end,on_predict_model_eval(#9516) - Marked several methods in
EvaluationLoopas protected:get_max_batches,on_evaluation_model_eval,on_evaluation_model_train,on_evaluation_start,on_evaluation_epoch_start,on_evaluation_epoch_end,on_evaluation_end,reload_evaluation_dataloaders(#9516) - Marked several methods in
EvaluationEpochLoopas protected:on_evaluation_batch_start,evaluation_step,evaluation_step_end(#9516) - Added
yielding_training_stepexample (#9983)
- Added
- Added support for saving and loading state of multiple callbacks of the same type (#7187)
- Added DeepSpeed Stage 1 support (#8974)
- Added
Python dataclasssupport forLightningDataModule(#8272) - Added sanitization of tensors when they get logged as hyperparameters in
TensorBoardLogger(#9031) - Added
InterBatchParallelDataFetcher(#9020) - Added
DataLoaderIterDataFetcher(#9020) - Added
DataFetcherwithinFit / EvaluationLoop (#9047) - Added a friendly error message when DDP attempts to spawn new distributed processes with rank > 0 (#9005)
- Added Rich integration:
- Added input validation logic for precision (#9080)
- Added support for CPU AMP autocast (#9084)
- Added
on_exceptioncallback hook (#9183) - Added a warning to DeepSpeed when inferring batch size (#9221)
- Added
ModelSummarycallback (#9344) - Added
log_images,log_textandlog_tabletoWandbLogger(#9545) - Added
PL_RECONCILE_PROCESSenvironment variable to enable process reconciliation regardless of cluster environment settings (#9389) - Added
get_device_statsto the Accelerator interface and added its implementation for GPU and TPU (#9586) - Added a warning when an unknown key is encountered in the optimizer configuration, and when
OneCycleLRis used with"interval": "epoch"(#9666) - Added
DeviceStatsMonitorcallback (#9712) - Added
enable_progress_barto the Trainer constructor (#9664) - Added
pl_legacy_patchload utility for loading old checkpoints that have pickled legacy Lightning attributes (#9166) - Added support for
torch.use_deterministic_algorithms(#9121) - Added automatic parameters tying for TPUs (#9525)
- Added support for
torch.autograd.set_detect_anomalythroughTrainerconstructor argumentdetect_anomaly(#9848) - Added
enable_model_summaryflag to Trainer (#9699) - Added
strategyargument to Trainer (#8597) - Added
init_meta_context,materialize_moduleutilities (#9920) - Added
TPUPrecisionPlugin(#10020) - Added
torch.bfloat16support: - Added
kfoldexample for loop customization (#9965) - LightningLite:
- Added
PrecisionPlugin.forward_context, making it the default implementation for all{train,val,test,predict}_step_context()methods (#9988) - Added
DDPSpawnPlugin.spawn()for spawning new processes of a given function (#10018, #10022) - Added
TrainingTypePlugin.{_setup_model, _setup_optimizer}methods (#9994, #10064) - Implemented
DataParallelPlugin._setup_model(#10010) - Implemented
DeepSpeedPlugin._setup_model_and_optimizers(#10009, #10064) - Implemented
{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_model_and_optimizers(#10028, #10064) - Added optional
modelargument to theoptimizer_stepmethods in accelerators and plugins (#10023) - Updated precision attributes in
DeepSpeedPlugin(#10164) - Added the ability to return a result from rank 0 in
DDPSpawnPlugin.spawn(#10162) - Added
pytorch_lightning.litepackage (#10175) - Added
LightningLitedocumentation (#10043) - Added
LightningLiteexamples (#9987) - Make the
_LiteDataLoaderan iterator and add supports for custom dataloader (#10279)
- Added
- Added
use_omegaconfargument tosave_hparams_to_yamlplugin (#9170) - Added
ckpt_pathargument forTrainer.fit()(#10061) - Added
auto_device_countmethod toAccelerators(#10222) - Added support for
devices="auto"(#10264) - Added a
filenameargument inModelCheckpoint.format_checkpoint_name(#9818) - Added support for empty
gpuslist to run on CPU (#10246) - Added a warning if multiple batch sizes are found from ambiguous batch (#10247)
Changed
- Trainer now raises a
MisconfigurationExceptionwhen its methods are called withckpt_path="best"but a checkpoint callback isn't configured (#9841) - Setting
Trainer(accelerator="ddp_cpu")now does not spawn a subprocess ifnum_processesis kept1along withnum_nodes > 1(#9603) - Module imports are now catching
ModuleNotFoundErrorinstead ofImportError(#9867) pytorch_lightning.loggers.neptune.NeptuneLoggeris now consistent with the new neptune-client API; the old neptune-client API is supported byNeptuneClientfrom the neptune-contrib repo (#6867)- Parsing of
enumstype hyperparameters to be saved in thehaprams.yamlfile by TensorBoard and CSV loggers has been fixed and made in line with how OmegaConf parses it (#9170) - Parsing of the
gpusTrainer argument has changed:gpus="n"(str) no longer selects the GPU index n and instead selects the first n devices (#8770) iteration_countand other index attributes in the loops has been replaced with progress dataclasses (#8477)- The
trainer.lightning_modulereference is now properly set at the very beginning of a run (#8536) - The model weights now get loaded in all cases when the checkpoint path gets provided in validate/test/predict, regardless of whether the model instance is provided or not (#8352)
- The
Trainerfunctionsreset_{train,val,test,predict}_dataloader,reset_train_val_dataloaders, andrequest_dataloadermodelargument is now optional (#8536) - Saved checkpoints will no longer use the type of a
Callbackas the key to avoid issues with unpickling (#6886) - Improved string conversion for
ResultCollection(#8622) LightningCLIchanges:LightningCLI.init_parsernow returns the parser instance (#8721)LightningCLI.add_core_arguments_to_parser,LightningCLI.parse_argumentsnow take aparserargument (#8721)LightningCLI.instantiate_trainernow takes a config and a list of callbacks (#8721)- Split
LightningCLI.add_core_arguments_to_parserintoLightningCLI.add_default_arguments_to_parser+LightningCLI.add_core_arguments_to_parser(#8721)
- The accelerator and training type plugin
setuphooks no longer have amodelargument (#8536) - The accelerator and training type plugin
update_global_stephook has been removed (#8856) - The coverage of
self.log-ing in anyLightningModuleorCallbackhook has been improved (#8498) self.log-ing without aTrainerreference now raises a warning instead of an exception (#9733)- Removed restrictions in the Trainer that loggers can only log from rank 0; the existing logger behavior has not changed (#8608)
Trainer.request_dataloadernow takes aRunningStageenum instance (#8858)- Changed
rank_zero_warntoNotImplementedErrorin the{train, val, test, predict}_dataloaderhooks thatLightning(Data)Moduleuses (#9161) - Moved
block_ddp_sync_behaviourout ofTrainingBatchLoopto loop utilities (#9192) - Executing the
optimizer_closureis now required when overriding theoptimizer_stephook (#9360) - Changed logging of
LightningModuleandLightningDataModulehyperparameters to raise an exception only if there are colliding keys with different values (#9496) seed_everythingnow fails when an invalid seed value is passed instead of selecting a random seed (#8787)- The Trainer now calls
TrainingTypePlugincollective APIs directly instead of going through the Accelerator reference (#9677, #9901) - The tuner now usees a unique filename to save a temporary checkpoint (#9682)
- Changed
HorovodPlugin.all_gatherto return atorch.Tensorinstead of a list (#9696) - Changed Trainer connectors to be protected attributes:
- Configuration Validator (#9779)
- The
current_epochandglobal_stepattributes now get restored irrespective of the Trainer task (#9413) - Trainer now raises an exception when requesting
amp_levelwith nativeamp_backend(#9755) - Update the logic to check for accumulation steps with deepspeed (#9826)
pytorch_lightning.utilities.grads.grad_normnow raises an exception if parameternorm_type <= 0(#9765)- Updated error message for interactive incompatible plugins (#9896)
- Moved the
optimizer_stepandclip_gradientshook from theAcceleratorandTrainingTypePlugininto thePrecisionPlugin(#10143, #10029) NativeMixedPrecisionPluginand its subclasses now take an optionalGradScalerinstance (#10055)- Trainer is now raising a
MisconfigurationExceptioninstead of a warning ifTrainer.{validate/test}is missing required methods (#10016) - Changed default value of the
max_stepsTrainer argument fromNoneto -1 (#9460) - LightningModule now raises an error when calling
log(on_step=False, on_epoch=False)(#10227) - Quantization aware training observers are now disabled by default during validating/testing/predicting stages (#8540)
- Raised
MisconfigurationExceptionwhen total length ofdataloaderacross ranks is zero, and give warning when total length is non-zero, but only local rank length is zero. (#9827) - Changed the model size calculation using
ByteCounter(#10123) - Enabled
on_load_checkpointforLightningDataModulefor alltrainer_fn(#10238) - Allowed separate config files for parameters with class type when LightningCLI is in
subclass_mode=False(#10286)
Deprecated
- Deprecated Trainer argument
terminate_on_nanin favor ofdetect_anomaly(#9175) - Deprecated
Trainer.terminate_on_nanpublic attribute access (#9849) - Deprecated
LightningModule.summarize()in favor ofpytorch_lightning.utilities.model_summary.summarize()(#8513) - Deprecated
LightningModule.model_size(#8343) - Deprecated
DataModuleproperties:train_transforms,val_transforms,test_transforms,size,dims(#8851) - Deprecated
add_to_queue,get_from_queuefromLightningModulein favor of corresponding methods in theDDPSpawnPlugin(#9118) - Deprecated
LightningModule.get_progress_bar_dictandTrainer.progress_bar_dictin favor ofpytorch_lightning.callbacks.progress.base.get_standard_metricsandProgressBarBase.get_metrics(#8985) - Deprecated
prepare_data_per_nodeflag on Trainer and set it as a property ofDataHooks, accessible in theLightningModuleandLightningDataModule(#8958) - Deprecated the
TestTubeLogger(#9065) - Deprecated
on_{train/val/test/predict}_dataloader()fromLightningModuleandLightningDataModule(#9098) - Deprecated
on_keyboard_interruptcallback hook in favor of newon_exceptionhook (#9260) - Deprecated passing
process_positionto theTrainerconstructor in favor of adding theProgressBarcallback withprocess_positiondirectly to the list of callbacks (#9222) - Deprecated passing
flush_logs_every_n_stepsas a Trainer argument, instead pass it to the logger init if supported (#9366) - Deprecated
LightningLoggerBase.close,LoggerCollection.closein favor ofLightningLoggerBase.finalize,LoggerCollection.finalize(#9422) - Deprecated passing
progress_bar_refresh_rateto theTrainerconstructor in favor of adding theProgressBarcallback withrefresh_ratedirectly to the list of callbacks, or passingenable_progress_bar=Falseto disable the progress bar (#9616) - Deprecated
LightningDistributedand moved the broadcast logic toDDPPluginandDDPSpawnPlugindirectly (#9691) - Deprecated passing
stochastic_weight_avgto theTrainerconstructor in favor of adding theStochasticWeightAveragingcallback directly to the list of callbacks (#8989) - Deprecated Accelerator collective API
barrier,broadcast, andall_gatherin favor of calling theTrainingTypePlugincollective API directly (#9677) - Deprecated
checkpoint_callbackfrom theTrainerconstructor in favor ofenable_checkpointing(#9754) - Deprecated the
LightningModule.on_post_move_to_devicemethod (#9525) - Deprecated
pytorch_lightning.core.decorators.parameter_validationin favor ofpytorch_lightning.utilities.parameter_tying.set_shared_parameters(#9525) - Deprecated passing
weights_summaryto theTrainerconstructor in favor of adding theModelSummarycallback withmax_depthdirectly to the list of callbacks (#9699) - Deprecated
log_gpu_memory,gpu_metrics, and util funcs in favor ofDeviceStatsMonitorcallback (#9921) - Deprecated
GPUStatsMonitorandXLAStatsMonitorin favor ofDeviceStatsMonitorcallback (#9924) - Deprecated setting
Trainer(max_steps=None); To turn off the limit, setTrainer(max_steps=-1)(default) (#9460) - Deprecated access to the
AcceleratorConnector.is_slurm_managing_tasksattribute and marked it as protected (#10101) - Deprecated access to the
AcceleratorConnector.configure_slurm_ddpmethod and marked it as protected (#10101) - Deprecated passing
resume_from_checkpointto theTrainerconstructor in favor oftrainer.fit(ckpt_path=)(#10061) - Deprecated
ClusterEnvironment.creates_children()in favor ofClusterEnvironment.creates_processes_externally(property) (#10106) - Deprecated
PrecisionPlugin.master_params()in favor ofPrecisionPlugin.main_params()(#10105) - Deprecated
lr_sch_namesfromLearningRateMonitor(#10066) - Deprecated
ProgressBarcallback in favor ofTQDMProgressBar(#10134)
Removed
- Removed deprecated
metrics(#8586) - Removed the deprecated
outputsargument in both theLightningModule.on_train_epoch_endandCallback.on_train_epoch_endhooks (#8587) - Removed the deprecated
TrainerLoggingMixinclass (#8609) - Removed the deprecated
TrainerTrainingTricksMixinclass (#8679) - Removed the deprecated
optimizer_idxfromtraining_stepas an accepted argument in manual optimization (#8576) - Removed support for the deprecated
on_save_checkpointsignature. The hook now takes acheckpointpositional parameter (#8697) - Removed support for the deprecated
on_load_checkpointsignature. The hook now takes apl_modulepositional parameter (#8697) - Removed the deprecated
save_functionproperty inModelCheckpoint(#8680) - Removed the deprecated
modelargument fromModelCheckpoint.save_checkpoint(#8688) - Removed the deprecated
sync_stepargument fromWandbLogger(#8763) - Removed the deprecated
Trainer.truncated_bptt_stepsin favor ofLightningModule.truncated_bptt_steps(#8826) - Removed
LightningModule.write_predictionsandLightningModule.write_predictions_dict(#8850) - Removed
on_reset_*_dataloaderhooks in TrainingType Plugins and Accelerators (#8858) - Removed deprecated
GradInformationmodule in favor ofpytorch_lightning.utilities.grads(#8831) - Removed
TrainingTypePlugin.on_saveandAccelerator.on_save(#9023) - Removed
{Accelerator,TrainingTypePlugin,PrecisionPlugin}.post_optimizer_step(#9746) - Removed deprecated
connect_precision_pluginandconnect_training_type_pluginfromAccelerator(#9019) - Removed
on_train_epoch_endfromAccelerator(#9035) - Removed
InterBatchProcessorin favor ofDataLoaderIterDataFetcher(#9052) - Removed
Plugininbase_plugin.pyin favor of accessingTrainingTypePluginandPrecisionPlugindirectly instead (#9066) - Removed
teardownfromParallelPlugin(#8943) - Removed deprecated
profiled_functionsargument fromPyTorchProfiler(#9178) - Removed deprecated
pytorch_lighting.utilities.argparse_utilsmodule (#9166) - Removed deprecated property
Trainer.running_sanity_checkin favor ofTrainer.sanity_checking(#9209) - Removed deprecated
BaseProfiler.output_filenamearg from it and its descendants in favor ofdirpathandfilename(#9214) - Removed deprecated property
ModelCheckpoint.periodin favor ofModelCheckpoint.every_n_epochs(#9213) - Removed deprecated
auto_move_datadecorator (#9231) - Removed deprecated property
LightningModule.datamodulein favor ofTrainer.datamodule(#9233) - Removed deprecated properties
DeepSpeedPlugin.cpu_offload*in favor ofoffload_optimizer,offload_parametersandpin_memory(#9244) - Removed deprecated property
AcceleratorConnector.is_using_torchelasticin favor ofTorchElasticEnvironment.is_using_torchelastic()(#9729) - Removed
pytorch_lightning.utilities.debugging.InternalDebugger(#9680) - Removed
call_configure_sharded_model_hookproperty fromAcceleratorandTrainingTypePlugin(#9612) - Removed
TrainerPropertiesmixin and moved property definitions directly intoTrainer(#9495) - Removed a redundant warning with
ModelCheckpoint(monitor=None)callback (#9875) - Remove
epochfromtrainer.logged_metrics(#9904) - Removed
should_rank_save_checkpointproperty from Trainer (#9433) - Remove deprecated
distributed_backendfromTrainer(#10017) - Removed
process_idxfrom the{DDPSpawnPlugin,TPUSpawnPlugin}.new_processmethods (#10022) - Removed automatic patching of
{train,val,test,predict}_dataloader()on theLightningModule(#9764) - Removed
pytorch_lightning.trainer.connectors.OptimizerConnector(#10120)
Fixed
- Fixed ImageNet evaluation in example (#10179)
- Fixed an issue with logger outputs not being finalized correctly after prediction runs (#8685)
- Fixed
move_metrics_to_cpumoving the loss to CPU while training on device (#9308) - Fixed incorrect main progress bar indicator when resuming training mid-epoch (#9310)
- Fixed an issue with freeing memory of datafetchers during teardown (#9387)
- Fixed a bug where the training step output needed to be
deepcopy-ed (#9349) - Fixed an issue with freeing memory allocated by the data iterators in
Loop.on_run_end(#9386, #9915) - Fixed
BasePredictionWriternot returning the batch indices in a non-distributed setting (#9432) - Fixed an error when running in XLA environments with no TPU attached (#9572)
- Fixed check on torchmetrics logged whose
compute()output is a multielement tensor (#9582) - Fixed gradient accumulation for
DDPShardedPlugin(#9122) - Fixed missing DeepSpeed distributed call (#9540)
- Fixed an issue with wrapped LightningModule during evaluation; The LightningModule no longer gets wrapped with data-parallel modules when not fitting in
DDPPlugin,DDPSpawnPlugin,DDPShardedPlugin,DDPSpawnShardedPlugin(#9096) - Fixed
trainer.accumulate_grad_batchesto be an int on init. The default value for it is nowNoneinside Trainer (#9652) - Fixed
broadcastinDDPPluginandDDPSpawnPluginto respect thesrcinput (#9691) - Fixed
self.log(on_epoch=True, reduce_fx=sum))for theon_batch_startandon_train_batch_starthooks (#9791) - Fixed
self.log(on_epoch=True)for theon_batch_startandon_train_batch_starthooks (#9780) - Fixed restoring training state during
Trainer.fitonly (#9413) - Fixed DeepSpeed and Lightning both calling the scheduler (#9788)
- Fixed missing arguments when saving hyperparameters from the parent class but not from the child class (#9800)
- Fixed DeepSpeed GPU device IDs (#9847)
- Reset
val_dataloaderintuner/batch_size_scaling(#9857) - Fixed use of
LightningCLIin computer_vision_fine_tuning.py example (#9934) - Fixed issue with non-init dataclass fields in
apply_to_collection(#9963) - Reset
val_dataloaderintuner/batch_size_scalingfor binsearch (#9975) - Fixed logic to check for spawn in dataloader
TrainerDataLoadingMixin._worker_check(#9902) - Fixed
train_dataloadergetting loaded twice when resuming from a checkpoint duringTrainer.fit()(#9671) - Fixed
LearningRateMonitorlogging with multiple param groups optimizer with no scheduler (#10044) - Fixed undesired side effects being caused by
Trainerpatching dataloader methods on theLightningModule(#9764) - Fixed gradients not being unscaled when clipping or logging the gradient norm (#9287)
- Fixed
on_before_optimizer_stepgetting called before the optimizer closure (including backward) has run (#10167) - Fixed monitor value in
ModelCheckpointgetting moved to the wrong device in a special case where it becomes NaN (#10118) - Fixed creation of
dirpathinBaseProfilerif it doesn't exist (#10073) - Fixed incorrect handling of sigterm (#10189)
- Fixed bug where
log(on_step=True, on_epoch=True, sync_dist=True)wouldn't reduce the value on step (#10227) - Fixed an issue with
pl.utilities.seed.reset_seedconverting thePL_SEED_WORKERSenvironment variable tobool(#10099) - Fixed iterating over a logger collection when
fast_dev_run > 0(#10232) - Fixed
batch_sizeinResultCollectionnot being reset to 1 on epoch end (#10242) - Fixed
distrib_typenot being set when training plugin instances are being passed to the Trainer (#10251)
Have any questions?
Contact Exxact Today



.jpg?format=webp)