Logging using a torchmetric object that returns dictionary
Hi everone, Some of metrics in torchmetrics returns dictionary when I call its compute() method. For example, torchmetrics.SQuAD() returns {'exact_match': tensor(0., device='cuda:0'), 'f1':...
View ArticleRun_training_epoch duration increases with more epochs
Reposting this discussion question here because I read in another discussion that lightning wants to move from discussions to this forum: I have a LightningModule, DataModule and Trainer that I am...
View Article[CLI] How to Pass Arguments to Initialize an Object in L.LightningModule?
I want to use Lightning CLI to pass arguments to initialize a LightningModule and some objects inside (e.g., a nn.Module). Lightning CLI provides some helpful features that allow me to create and...
View ArticleHow to get the checkpoint without saving it?
When I train a LightningModule using a Trainer, how do I get the checkpoint object (which is presumably a python dict) without saving it to disk? 2 posts - 2 participants Read full topic
View ArticleImproving poor training efficiency on A100 40 GB
Hi all! First, thank you for the amazing framework and blog. I am training falcon-7b on a custom dataset, with following hyperparams: batch_size = 2 aggregate_batch = 4 epochs = 10 train set size =...
View ArticleWhat's wrong with the pytorch lighting doc
I can’t see any details of each option in the navigation menu. It’s hard to learn from this document without navigation buttons like on_train_start, on_save_checkpoint, etc 5 posts - 2 participants...
View ArticleHow to access the returned values of *_step()
hey,guys , I got a question when overriding LightningModule, the newest version deleted _epoch_end functions , but remain the return in all the _step() functions. Now how to access the returned values...
View ArticleHow do I get the metric in on_validation_epoch_end()?
def validation_step(self, batch, batch_idx, dataloader_idx=None): I calculate metrci here. metric = XXXX def on_validation_epoch_end(self): I would like to get the metric here . 3 posts - 2...
View ArticleAuto grad issue
Hey folks! I am having an issue where I am executing a code that throws an error related to autograd I suppose. I have defined a forward step as follows. def step(self, batch, mode): anc, pos = batch...
View Article`self.lr_schedulers().optimizer` and `self.optimizers()` return different...
I’m training a GPT2 network and my configure_optimizers() is as follows: def configure_optimizers(self): opt = optim.Adam(self.model.parameters(), self.lr) # logging.info() total_steps = \...
View ArticleLightning Module isn't loading checkpoint from the path as per documentation
Hi. Im trying to use the below methodology to load my checkpoint however, it throws isADirectoryError when I passed ckpt path as shown in documentation. Here’s my code. def main(): <some data...
View ArticleHow to correctly initialize latent vector parameters that have size dependent...
Hi, May I ask how do you correctly create a set of latents for each sample in the training dataset? I.e., suppose you would like to have optimizable latent codes for each of the frame. The total...
View ArticleCustom model definition is not included in checkpoint hyper_parameters
Hi, i have the following dummy LightningModule class MyLightningModule(LightningModule): def __init__( self, param_1: torch.nn.Module = torch.nn.Conv2d(1,1,1) param_2: torch.nn.Module =...
View ArticleSave_hyperparameters and OptimizerCallable
If I have an OptimizerCallable argument in my models constructor, using save_hyperparameters just gives python/name:jsonargparse._typehints.partial_instance rather than the arguments used to build the...
View ArticleDisabling autocast for certain modules
Hi, I was wondering what is the way in Lightning to disable mixed precision for certain sub-modules? Is there a way to do this through callbacks? Thanks 2 posts - 2 participants Read full topic
View ArticleSize mismatch for model
Hi! I load checkpoint from model with head size = 1599 to same model with head size = 59. Set strict=False, but got the error: Traceback (most recent call last): File...
View ArticleWhere should I load the model checkpoint when using configure_model?
When i load the model checkpoint in configure_model, the following error occurs. It seems to create an empty model, where should I load the model checkpoint? size mismatch for...
View ArticleLoad checkpoint with dynamically created model
Hi, In the Lightingmodule docs, the setup hook is described as a possibility to dynamically build a model (instead of initiating in __init__). See the example here. However, when I load a...
View ArticleERROR:root:Attempting to deserialize object on a CUDA device but...
Dear I trained a model that came from huggingface and the training works and saving the checkpoint. After when I try to load the model on a pc withouth CUDA I obtain the error: ERROR:root:Attempting...
View ArticleLogging one value per epoch?
Reading the documentation and following the examples, there doesn’t seem to be a way to log just one value per epoch. This is insane, because when you’re trying to figure out a model architecture,...
View Article