pytorch DistributedDataParallel 多卡訓(xùn)練結(jié)果變差的解決方案
DDP 數(shù)據(jù)shuffle 的設(shè)置
使用DDP要給dataloader傳入sampler參數(shù)(torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)) 。 默認(rèn)shuffle=True,但按照pytorch DistributedSampler的實(shí)現(xiàn):
def __iter__(self) -> Iterator[T_co]: if self.shuffle: # deterministically shuffle based on epoch and seed g = torch.Generator() g.manual_seed(self.seed + self.epoch) indices = torch.randperm(len(self.dataset), generator=g).tolist() # type: ignore else: indices = list(range(len(self.dataset))) # type: ignore
產(chǎn)生隨機(jī)indix的種子是和當(dāng)前的epoch有關(guān),所以需要在訓(xùn)練的時(shí)候手動(dòng)set epoch的值來(lái)實(shí)現(xiàn)真正的shuffle:
for epoch in range(start_epoch, n_epochs): if is_distributed: sampler.set_epoch(epoch) train(loader)
DDP 增大batchsize 效果變差的問(wèn)題
large batchsize:
理論上的優(yōu)點(diǎn):
數(shù)據(jù)中的噪聲影響可能會(huì)變小,可能容易接近最優(yōu)點(diǎn);
缺點(diǎn)和問(wèn)題:
降低了梯度的variance;(理論上,對(duì)于凸優(yōu)化問(wèn)題,低的梯度variance可以得到更好的優(yōu)化效果; 但是實(shí)際上Keskar et al驗(yàn)證了增大batchsize會(huì)導(dǎo)致差的泛化能力);
對(duì)于非凸優(yōu)化問(wèn)題,損失函數(shù)包含多個(gè)局部最優(yōu)點(diǎn),小的batchsize有噪聲的干擾可能容易跳出局部最優(yōu)點(diǎn),而大的batchsize有可能停在局部最優(yōu)點(diǎn)跳不出來(lái)。
解決方法:
增大learning_rate,但是可能出現(xiàn)問(wèn)題,在訓(xùn)練開(kāi)始就用很大的learning_rate 可能導(dǎo)致模型不收斂 (https://arxiv.org/abs/1609.04836)
使用warming up (https://arxiv.org/abs/1706.02677)
warmup
在訓(xùn)練初期就用很大的learning_rate可能會(huì)導(dǎo)致訓(xùn)練不收斂的問(wèn)題,warmup的思想是在訓(xùn)練初期用小的學(xué)習(xí)率,隨著訓(xùn)練慢慢變大學(xué)習(xí)率,直到base learning_rate,再使用其他decay(CosineAnnealingLR)的方式訓(xùn)練.
# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py from torch.optim.lr_scheduler import _LRScheduler from torch.optim.lr_scheduler import ReduceLROnPlateau class GradualWarmupScheduler(_LRScheduler): """ Gradually warm-up(increasing) learning rate in optimizer. Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'. Args: optimizer (Optimizer): Wrapped optimizer. multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr. total_epoch: target learning rate is reached at total_epoch, gradually after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau) """ def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None): self.multiplier = multiplier if self.multiplier < 1.: raise ValueError('multiplier should be greater thant or equal to 1.') self.total_epoch = total_epoch self.after_scheduler = after_scheduler self.finished = False super(GradualWarmupScheduler, self).__init__(optimizer) def get_lr(self): if self.last_epoch > self.total_epoch: if self.after_scheduler: if not self.finished: self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs] self.finished = True return self.after_scheduler.get_last_lr() return [base_lr * self.multiplier for base_lr in self.base_lrs] if self.multiplier == 1.0: return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs] else: return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs] def step_ReduceLROnPlateau(self, metrics, epoch=None): if epoch is None: epoch = self.last_epoch + 1 self.last_epoch = epoch if epoch != 0 else 1 # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning if self.last_epoch <= self.total_epoch: warmup_lr = [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs] for param_group, lr in zip(self.optimizer.param_groups, warmup_lr): param_group['lr'] = lr else: if epoch is None: self.after_scheduler.step(metrics, None) else: self.after_scheduler.step(metrics, epoch - self.total_epoch) def step(self, epoch=None, metrics=None): if type(self.after_scheduler) != ReduceLROnPlateau: if self.finished and self.after_scheduler: if epoch is None: self.after_scheduler.step(None) else: self.after_scheduler.step(epoch - self.total_epoch) self._last_lr = self.after_scheduler.get_last_lr() else: return super(GradualWarmupScheduler, self).step(epoch) else: self.step_ReduceLROnPlateau(metrics, epoch)
分布式多卡訓(xùn)練DistributedDataParallel踩坑
近幾天想研究了多卡訓(xùn)練,就花了點(diǎn)時(shí)間,本以為會(huì)很輕松,可是好多坑,一步一步踏過(guò)來(lái),一般分布式訓(xùn)練分為單機(jī)多卡與多機(jī)多卡兩種類(lèi)型;
主要有兩種方式實(shí)現(xiàn):
1、DataParallel: Parameter Server模式,一張卡位reducer,實(shí)現(xiàn)也超級(jí)簡(jiǎn)單,一行代碼
DataParallel是基于Parameter server的算法,負(fù)載不均衡的問(wèn)題比較嚴(yán)重,有時(shí)在模型較大的時(shí)候(比如bert-large),reducer的那張卡會(huì)多出3-4g的顯存占用
2、DistributedDataParallel:官方建議用新的DDP,采用all-reduce算法,本來(lái)設(shè)計(jì)主要是為了多機(jī)多卡使用,但是單機(jī)上也能用
為什么要分布式訓(xùn)練?
可以用多張卡,總體跑得更快
可以得到更大的 BatchSize
有些分布式會(huì)取得更好的效果
主要分為以下幾個(gè)部分:
單機(jī)多卡,DataParallel(最常用,最簡(jiǎn)單)
單機(jī)多卡,DistributedDataParallel(較高級(jí))、多機(jī)多卡,DistributedDataParallel(最高級(jí))
如何啟動(dòng)訓(xùn)練
模型保存與讀取
注意事項(xiàng)
一、單機(jī)多卡(DATAPARALLEL)
from torch.nn import DataParallel device = torch.device("cuda") ?;蛘遜evice = torch.device("cuda:0" if True else "cpu") model = MyModel() model = model.to(device) model = DataParallel(model) #或者model = nn.DataParallel(model,device_ids=[0,1,2,3])
比較簡(jiǎn)單,只需要加一行代碼就行, model = DataParallel(model)
二、多機(jī)多卡、單機(jī)多卡(DISTRIBUTEDDATAPARALLEL)
建議先把注意事項(xiàng)看完在修改代碼,防止出現(xiàn)莫名的bug,修改訓(xùn)練代碼如下:
其中opt.local_rank要在代碼前面解析這個(gè)參數(shù),可以去后面看我寫(xiě)的注意事項(xiàng);
from torch.utils.data.distributed import DistributedSampler import torch.distributed as dist import torch # Initialize Process Group dist_backend = 'nccl' print('args.local_rank: ', opt.local_rank) torch.cuda.set_device(opt.local_rank) dist.init_process_group(backend=dist_backend) model = yourModel()#自己的模型 if torch.cuda.device_count() > 1: print("Let's use", torch.cuda.device_count(), "GPUs!") # 5) 封裝 # model = torch.nn.parallel.DistributedDataParallel(model, # device_ids=[opt.local_rank], # output_device=opt.local_rank) model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank]) device = torch.device(opt.local_rank) model.to(device) dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#自己的讀取數(shù)據(jù)的代碼 world_size = torch.cuda.device_count() datasampler = DistributedSampler(dataset, num_replicas=dist.get_world_size(), rank=opt.local_rank) dataloader = torch.utils.data.DataLoader( dataset, batch_size=opt.batch_size, shuffle=False, num_workers=opt.n_cpu, pin_memory=True, collate_fn=dataset.collate_fn, sampler=datasampler )#在原始讀取數(shù)據(jù)中加sampler參數(shù)就行 ..... 訓(xùn)練過(guò)程中,數(shù)據(jù)轉(zhuǎn)cuda imgs = imgs.to(device) targets = targets.to(device)
三、如何啟動(dòng)訓(xùn)練
1、DataParallel方式
正常訓(xùn)練即可,即
python3 train.py
2、DistributedDataParallel方式
需要通過(guò)torch.distributed.launch來(lái)啟動(dòng),一般是單節(jié)點(diǎn),
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py
其中CUDA_VISIBLE_DEVICES 設(shè)置用的顯卡編號(hào),--nproc_pre_node 每個(gè)節(jié)點(diǎn)的顯卡數(shù)量,一般有幾個(gè)顯卡就用幾個(gè)顯卡
多節(jié)點(diǎn)
python3 -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0 #兩個(gè)節(jié)點(diǎn),在0號(hào)節(jié)點(diǎn)
要是訓(xùn)練成功,就會(huì)打印出幾個(gè)信息,有幾個(gè)卡就打印幾個(gè)信息,如下圖所示:
四、模型保存與讀取
以下a、b是對(duì)應(yīng)的,用a保存,就用a方法加載
1、保存
a、只保存參數(shù)
torch.save(model.module.state_dict(), path)
b、保存參數(shù)與網(wǎng)絡(luò)
torch.save(model.module,path)
2、加載
a、多卡加載模型預(yù)訓(xùn)練;
model = Yourmodel() if opt.pretrained_weights: if opt.pretrained_weights.endswith(".pth"): model.load_state_dict(torch.load(opt.pretrained_weights)) else: model.load_darknet_weights(opt.pretrained_weights)
單卡加載模型,需要加載模型時(shí)指定主卡讀模型,而且這個(gè)'cuda:0',是看你訓(xùn)練的模型是0還是1(否則就會(huì)出錯(cuò)RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device),可以根據(jù)自己的更改:
model = Yourmodel() if opt.pretrained_weights: if opt.pretrained_weights.endswith(".pth"): model.load_state_dict(torch.load(opt.pretrained_weights,map_location="cuda:0")) else: model.load_darknet_weights(opt.pretrained_weights)
b、單卡加載模型;
同樣也要指定讀取模型的卡?! ?/p>
model = torch.load(opt.weights_path, map_location="cuda:0")
多卡加載預(yù)訓(xùn)練模型,以b這種方式還沒(méi)跑通。
五、注意事項(xiàng)
1、model后面添加module
獲取到網(wǎng)絡(luò)模型后,使用并行方法,并將網(wǎng)絡(luò)模型和參數(shù)移到GPU上。注意,若需要修改網(wǎng)絡(luò)模塊或者獲得模型的某個(gè)參數(shù),一定要在model后面加上.module,否則會(huì)報(bào)錯(cuò),比如:
model.img_size 要改成 model.module.img_size
2、.cuda或者.to(device)等問(wèn)題
device是自己設(shè)置,如果.cuda出錯(cuò),就要化成相應(yīng)的device
model
(如:model.to(device))
input
(通常需要使用Variable包裝,如:input = Variable(input).to(device))
target
(通常需要使用Variable包裝
nn.CrossEntropyLoss()
(如:criterion = nn.CrossEntropyLoss().to(device))
3、args.local_rank的參數(shù)
通過(guò)torch.distributed.launch來(lái)啟動(dòng)訓(xùn)練,torch.distributed.launch 會(huì)給模型分配一個(gè)args.local_rank的參數(shù),所以在訓(xùn)練代碼中要解析這個(gè)參數(shù),也可以通過(guò)torch.distributed.get_rank()獲取進(jìn)程id。
parser.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to use during batch generation")
以上為個(gè)人經(jīng)驗(yàn),希望能給大家一個(gè)參考,也希望大家多多支持本站。
版權(quán)聲明:本站文章來(lái)源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有,歡迎引用、轉(zhuǎn)載,請(qǐng)保持原文完整并注明來(lái)源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站,禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像,否則將依法追究法律責(zé)任。本站部分內(nèi)容來(lái)源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來(lái),僅供學(xué)習(xí)參考,不代表本站立場(chǎng),如有內(nèi)容涉嫌侵權(quán),請(qǐng)聯(lián)系alex-e#qq.com處理。