ValueError: too many values to unpack (expected 3)
For studying purposes, I am trying to create a simple fine-tuning example using t5 and lighting: import pandas as pd df = pd.DataFrame({ "text": ["O Brasil é um país localizado na América do Sul.", "A...
View ArticleQuestion about recover nested model from checkpoint
I have a Nested model class MovieScoreTask(pl.LightningModule): def __init__(self, base_model:nn.Module, learning_rate:float): super().__init__() self.save_hyperparameters() # self.example_input_array...
View ArticleMetrics not logged properly in PyTorch Lightning
The feature of logging is not working fine. It is giving following logs on console → v_num:z3_3 val_loss:3.105 val_kappa:0.34 val_accuracy:0.295 train_loss:2.436 train_kappa: nan train_accuracy:0.0...
View ArticleMixed precision training (how to appropriately scale the manual gradient...
I’m working with mixed precision training. My loss has conceptually two components: loss1 and loss2. I call self.manual_backward(loss1,retain_graph=True). This fills gradients to all params. For...
View ArticleRuntimeError: one of the variables needed for gradient computation has been...
My first forward pass went on smoothly but then i encounter this runtime error Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the...
View ArticleHow can I remove metric parameters from model?
Hi, I meet a problem that lightning will save my metric parameters and make pytorch cannot load weights directly, how can I exlude it? Below is my code and class IAT_enhancement(L.LightningModule):...
View Articleconfusions about load_from_checkpoint() and save_hyperparameters()
according to Saving and loading checkpoints (basic) — PyTorch Lightning 2.1.3 documentation, There is a model like this: class Encoder(L.LightningModule): ... class Decoder(L.LightningModule): ......
View ArticleSave and restore persisted DataLoader states from checkpoint
Hi! I am working on a project to save and restore persisted DataLoader states from checkpoint, especially working with vanilla Pytorch DataLoader Can you provide suggestions on how to implement that?...
View ArticleHow to interactively run inference with a model in jupyter notebook created...
example: RAD-MMM/tts_main.py at main · NVIDIA/RAD-MMM · GitHub 1 post - 1 participant Read full topic
View ArticleDo I need to detach when using self.logger.experiment.add_scalars?
I am aware that when we use self.log("train_loss",loss) for instance, the loss tensor is automatically detached to avoid CPU RAM leak. However, if I am logging something else through the method...
View ArticleSkip instances during training
Hi, I am using the LightningModule to train a neural network across many instances/GPUs, however the data is imbalanced ( I cannot change this ), so I want to skip over some instances during training...
View ArticleLightningModule.train_dataloader()
How do the hooks for the LightningModule interact with the hooks for the LightningDataModule? Does one overwrite the other? Previously, I was able to call the LightningDataModule.train_dataloader()...
View ArticleGo pass the sanity check but get CUDA OUT OF MEMORY when in validation loop
Hi, when I run the train code. It pass the sanity check and use about 15GB/24GB memory. But when the code went to validation loop, I got CUDA OUT OF MEMORY error (it was fine in train loop. my...
View ArticleSave torchmetrics plots after logging them in LightningModule
Hello, I am using a LightningModule and a Trainer and I’m using multiple Metrics from torchmetrics, some are native metrics to the library and some are customized Metrics objects. I’m only interested...
View ArticleFine tuning using LLAMA models
Hello, My code was working with the T5 model for finetuning # train.py import os import torch import datasets from transformers import T5ForConditionalGeneration, T5Tokenizer import lightning as L...
View ArticleDLRM run failed in torchrec+lightning
model: recipes/torchrecipes/rec at main · facebookresearch/recipes · GitHub error: dlrm_main/0 [0]:[rank0]: Traceback (most recent call last): dlrm_main/0 [0]:[rank0]: File...
View Article