日韩中文字幕精品视频在线,午夜免费看片网站,久久精品水蜜桃?V综合天堂

DDP 數(shù)據(jù)shuffle 的設(shè)置

使用DDP要給dataloader傳入sampler參數(shù)（torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)）。默認(rèn)shuffle=True，但按照pytorch DistributedSampler的實(shí)現(xiàn)：

 def __iter__(self) -> Iterator[T_co]:
  if self.shuffle:
# deterministically shuffle based on epoch and seed
g = torch.Generator()
g.manual_seed(self.seed + self.epoch)
indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore
  else:
indices = list(range(len(self.dataset)))  # type: ignore

產(chǎn)生隨機(jī)indix的種子是和當(dāng)前的epoch有關(guān)，所以需要在訓(xùn)練的時(shí)候手動(dòng)set epoch的值來(lái)實(shí)現(xiàn)真正的shuffle：

for epoch in range(start_epoch, n_epochs):
 if is_distributed:
  sampler.set_epoch(epoch)
 train(loader)

DDP 增大batchsize 效果變差的問(wèn)題

large batchsize：

理論上的優(yōu)點(diǎn)：

數(shù)據(jù)中的噪聲影響可能會(huì)變小，可能容易接近最優(yōu)點(diǎn)；

缺點(diǎn)和問(wèn)題：

降低了梯度的variance；(理論上，對(duì)于凸優(yōu)化問(wèn)題，低的梯度variance可以得到更好的優(yōu)化效果; 但是實(shí)際上Keskar et al驗(yàn)證了增大batchsize會(huì)導(dǎo)致差的泛化能力);

對(duì)于非凸優(yōu)化問(wèn)題，損失函數(shù)包含多個(gè)局部最優(yōu)點(diǎn)，小的batchsize有噪聲的干擾可能容易跳出局部最優(yōu)點(diǎn)，而大的batchsize有可能停在局部最優(yōu)點(diǎn)跳不出來(lái)。

解決方法：

增大learning_rate，但是可能出現(xiàn)問(wèn)題，在訓(xùn)練開(kāi)始就用很大的learning_rate 可能導(dǎo)致模型不收斂 (https://arxiv.org/abs/1609.04836)

使用warming up (https://arxiv.org/abs/1706.02677)

warmup

在訓(xùn)練初期就用很大的learning_rate可能會(huì)導(dǎo)致訓(xùn)練不收斂的問(wèn)題，warmup的思想是在訓(xùn)練初期用小的學(xué)習(xí)率，隨著訓(xùn)練慢慢變大學(xué)習(xí)率，直到base learning_rate，再使用其他decay（CosineAnnealingLR）的方式訓(xùn)練.

# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.lr_scheduler import ReduceLROnPlateau
class GradualWarmupScheduler(_LRScheduler):
 """ Gradually warm-up(increasing) learning rate in optimizer.
 Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
 Args:
  optimizer (Optimizer): Wrapped optimizer.
  multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.
  total_epoch: target learning rate is reached at total_epoch, gradually
  after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
 """
 def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
  self.multiplier = multiplier
  if self.multiplier < 1.:
raise ValueError('multiplier should be greater thant or equal to 1.')
  self.total_epoch = total_epoch
  self.after_scheduler = after_scheduler
  self.finished = False
  super(GradualWarmupScheduler, self).__init__(optimizer)
 def get_lr(self):
  if self.last_epoch > self.total_epoch:
if self.after_scheduler:
 if not self.finished:
  self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]
  self.finished = True
 return self.after_scheduler.get_last_lr()
return [base_lr * self.multiplier for base_lr in self.base_lrs]
  if self.multiplier == 1.0:
return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs]
  else:
return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
 def step_ReduceLROnPlateau(self, metrics, epoch=None):
  if epoch is None:
epoch = self.last_epoch + 1
  self.last_epoch = epoch if epoch != 0 else 1  # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning
  if self.last_epoch <= self.total_epoch:
warmup_lr = [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
for param_group, lr in zip(self.optimizer.param_groups, warmup_lr):
 param_group['lr'] = lr
  else:
if epoch is None:
 self.after_scheduler.step(metrics, None)
else:
 self.after_scheduler.step(metrics, epoch - self.total_epoch)
 def step(self, epoch=None, metrics=None):
  if type(self.after_scheduler) != ReduceLROnPlateau:
if self.finished and self.after_scheduler:
 if epoch is None:
  self.after_scheduler.step(None)
 else:
  self.after_scheduler.step(epoch - self.total_epoch)
 self._last_lr = self.after_scheduler.get_last_lr()
else:
 return super(GradualWarmupScheduler, self).step(epoch)
  else:
self.step_ReduceLROnPlateau(metrics, epoch)

分布式多卡訓(xùn)練DistributedDataParallel踩坑

近幾天想研究了多卡訓(xùn)練，就花了點(diǎn)時(shí)間，本以為會(huì)很輕松，可是好多坑，一步一步踏過(guò)來(lái)，一般分布式訓(xùn)練分為單機(jī)多卡與多機(jī)多卡兩種類(lèi)型；

主要有兩種方式實(shí)現(xiàn)：

１、DataParallel: Parameter Server模式，一張卡位reducer，實(shí)現(xiàn)也超級(jí)簡(jiǎn)單，一行代碼

DataParallel是基于Parameter server的算法，負(fù)載不均衡的問(wèn)題比較嚴(yán)重，有時(shí)在模型較大的時(shí)候（比如bert-large），reducer的那張卡會(huì)多出3-4g的顯存占用

２、DistributedDataParallel：官方建議用新的DDP，采用all-reduce算法，本來(lái)設(shè)計(jì)主要是為了多機(jī)多卡使用，但是單機(jī)上也能用

為什么要分布式訓(xùn)練？

可以用多張卡，總體跑得更快

可以得到更大的 BatchSize

有些分布式會(huì)取得更好的效果

主要分為以下幾個(gè)部分：

單機(jī)多卡，DataParallel（最常用，最簡(jiǎn)單）

單機(jī)多卡，DistributedDataParallel（較高級(jí)）、多機(jī)多卡，DistributedDataParallel（最高級(jí)）

如何啟動(dòng)訓(xùn)練

模型保存與讀取

注意事項(xiàng)

一、單機(jī)多卡（DATAPARALLEL）

from torch.nn import DataParallel
 
device = torch.device("cuda")
?；蛘遜evice = torch.device("cuda:0" if True else "cpu")
 
model = MyModel()
model = model.to(device)
model = DataParallel(model)
＃或者model = nn.DataParallel(model,device_ids=[0,1，2,3])

比較簡(jiǎn)單，只需要加一行代碼就行， model = DataParallel(model)

二、多機(jī)多卡、單機(jī)多卡（DISTRIBUTEDDATAPARALLEL）

建議先把注意事項(xiàng)看完在修改代碼，防止出現(xiàn)莫名的bug，修改訓(xùn)練代碼如下：

其中opt.local_rank要在代碼前面解析這個(gè)參數(shù)，可以去后面看我寫(xiě)的注意事項(xiàng)；

 from torch.utils.data.distributed import DistributedSampler
 import torch.distributed as dist
 import torch
 
 # Initialize Process Group
 dist_backend = 'nccl'
 print('args.local_rank: ', opt.local_rank)
 torch.cuda.set_device(opt.local_rank)
 dist.init_process_group(backend=dist_backend)
 
 model = yourModel()＃自己的模型
 if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # 5) 封裝
  # model = torch.nn.parallel.DistributedDataParallel(model,
  #  device_ids=[opt.local_rank],
  #  output_device=opt.local_rank)
  model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])
 device = torch.device(opt.local_rank)
 model.to(device)
 dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#自己的讀取數(shù)據(jù)的代碼
 world_size = torch.cuda.device_count()
 datasampler = DistributedSampler(dataset, num_replicas=dist.get_world_size(), rank=opt.local_rank)
 
 dataloader = torch.utils.data.DataLoader(
  dataset,
  batch_size=opt.batch_size,
  shuffle=False,
  num_workers=opt.n_cpu,
  pin_memory=True,
  collate_fn=dataset.collate_fn,
  sampler=datasampler
 )＃在原始讀取數(shù)據(jù)中加sampler參數(shù)就行
 
 
.....
 
訓(xùn)練過(guò)程中，數(shù)據(jù)轉(zhuǎn)cuda
imgs = imgs.to(device)
targets = targets.to(device)

三、如何啟動(dòng)訓(xùn)練

１、DataParallel方式

正常訓(xùn)練即可，即

python3 train.py

２、DistributedDataParallel方式

需要通過(guò)torch.distributed.launch來(lái)啟動(dòng)，一般是單節(jié)點(diǎn)，

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py

其中CUDA_VISIBLE_DEVICES　設(shè)置用的顯卡編號(hào)，--nproc_pre_node 每個(gè)節(jié)點(diǎn)的顯卡數(shù)量，一般有幾個(gè)顯卡就用幾個(gè)顯卡

多節(jié)點(diǎn)

python３ -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0
＃兩個(gè)節(jié)點(diǎn)，在０號(hào)節(jié)點(diǎn)

要是訓(xùn)練成功，就會(huì)打印出幾個(gè)信息，有幾個(gè)卡就打印幾個(gè)信息，如下圖所示:

四、模型保存與讀取

以下a、b是對(duì)應(yīng)的，用a保存，就用a方法加載

１、保存

a、只保存參數(shù)

torch.save(model.module.state_dict(), path)

b、保存參數(shù)與網(wǎng)絡(luò)

torch.save(model.module,path)

２、加載

a、多卡加載模型預(yù)訓(xùn)練；

model = Yourmodel()
if opt.pretrained_weights:
  if opt.pretrained_weights.endswith(".pth"):
model.load_state_dict(torch.load(opt.pretrained_weights))
  else:
model.load_darknet_weights(opt.pretrained_weights)

單卡加載模型，需要加載模型時(shí)指定主卡讀模型，而且這個(gè)'cuda:0',是看你訓(xùn)練的模型是０還是１（否則就會(huì)出錯(cuò)RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device），可以根據(jù)自己的更改：

model = Yourmodel()
if opt.pretrained_weights:
  if opt.pretrained_weights.endswith(".pth"):
model.load_state_dict(torch.load(opt.pretrained_weights，map_location="cuda:0"))
  else:
model.load_darknet_weights(opt.pretrained_weights)

b、單卡加載模型；

同樣也要指定讀取模型的卡?！　?/p>

model = torch.load(opt.weights_path, map_location="cuda:0")

多卡加載預(yù)訓(xùn)練模型，以b這種方式還沒(méi)跑通。

五、注意事項(xiàng)

１、model后面添加module

獲取到網(wǎng)絡(luò)模型后，使用并行方法，并將網(wǎng)絡(luò)模型和參數(shù)移到GPU上。注意，若需要修改網(wǎng)絡(luò)模塊或者獲得模型的某個(gè)參數(shù)，一定要在model后面加上.module，否則會(huì)報(bào)錯(cuò)，比如：

model.img_size　　要改成　　model.module.img_size

２、.cuda或者.to(device)等問(wèn)題

device是自己設(shè)置，如果.cuda出錯(cuò)，就要化成相應(yīng)的device

model（如：model.to(device)）

input（通常需要使用Variable包裝，如：input = Variable(input).to(device)）

target（通常需要使用Variable包裝

nn.CrossEntropyLoss()（如：criterion = nn.CrossEntropyLoss().to(device)）

３、args.local_rank的參數(shù)

通過(guò)torch.distributed.launch來(lái)啟動(dòng)訓(xùn)練，torch.distributed.launch 會(huì)給模型分配一個(gè)args.local_rank的參數(shù)，所以在訓(xùn)練代碼中要解析這個(gè)參數(shù)，也可以通過(guò)torch.distributed.get_rank()獲取進(jìn)程id。

parser.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to use during batch generation")

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持本站。

香港服務(wù)器租用

版權(quán)聲明：本站文章來(lái)源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有，歡迎引用、轉(zhuǎn)載，請(qǐng)保持原文完整并注明來(lái)源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站，禁止在非www.sddonglingsh.com所屬的服務(wù)器上建立鏡像，否則將依法追究法律責(zé)任。本站部分內(nèi)容來(lái)源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來(lái)，僅供學(xué)習(xí)參考，不代表本站立場(chǎng)，如有內(nèi)容涉嫌侵權(quán)，請(qǐng)聯(lián)系alex-e#qq.com處理。

排名優(yōu)化：網(wǎng)站排名優(yōu)化方法有什么，如何做有效果

老域名：怎樣才算老域名，老域名建站有什么影響

內(nèi)容優(yōu)化：關(guān)鍵字排名要做哪些方面的優(yōu)化，怎樣做

技巧：網(wǎng)站轉(zhuǎn)化率究竟是什么，有什么提升的技巧

一下吧：外貿(mào)站優(yōu)化有哪些基本的做法和注意事項(xiàng)

概要：競(jìng)價(jià)推廣費(fèi)用大概要多少呢，競(jìng)價(jià)推廣好不好

一下吧：SEO中site是什么意思，作用和應(yīng)用是怎樣的

郵箱：付費(fèi)郵箱有哪些優(yōu)勢(shì)，付費(fèi)郵箱挑選要考慮什么

集群是什么意思：集群是什么意思，都有哪些優(yōu)勢(shì)呢