首页主机资讯CentOS上PyTorch如何进行并行计算

CentOS上PyTorch如何进行并行计算

时间2025-10-09 18:22:03发布访客分类主机资讯浏览1046
导读:在CentOS上使用PyTorch进行并行计算,通常涉及以下几种方法: 1. 数据并行(Data Parallelism) 数据并行是最常见的并行计算方式,它将数据分成多个小批次,然后在多个GPU上并行处理这些小批次。 步骤: 安装Py...

在CentOS上使用PyTorch进行并行计算,通常涉及以下几种方法:

1. 数据并行(Data Parallelism)

数据并行是最常见的并行计算方式,它将数据分成多个小批次,然后在多个GPU上并行处理这些小批次。

步骤:

  1. 安装PyTorch:确保你已经安装了支持GPU的PyTorch版本。

    pip install torch torchvision torchaudio
    
  2. 编写代码

    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms
    
    # 定义模型
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
            self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
            self.conv2_drop = nn.Dropout2d()
            self.fc1 = nn.Linear(320, 50)
            self.fc2 = nn.Linear(50, 10)
    
        def forward(self, x):
            x = torch.relu(torch.max_pool2d(self.conv1(x), 2))
            x = torch.relu(torch.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
            x = x.view(-1, 320)
            x = torch.relu(self.fc1(x))
            x = torch.dropout(x, training=self.training)
            x = self.fc2(x)
            return torch.log_softmax(x, dim=1)
    
    # 初始化模型并移动到GPU
    model = Net()
    if torch.cuda.device_count() >
     1:
        print(f"Let's use {
    torch.cuda.device_count()}
     GPUs!")
        model = nn.DataParallel(model)
    
    model.cuda()
    
    # 加载数据
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
    trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
    trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
    
    # 定义损失函数和优化器
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
    
    # 训练模型
    for epoch in range(10):
        running_loss = 0
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs, labels = inputs.cuda(), labels.cuda()
    
            optimizer.zero_grad()
    
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
    
            running_loss += loss.item()
            if i % 100 == 99:    # 每100个小批次打印一次
                print(f'[{
    epoch + 1}
    , {
    i + 1:5d}
    ] loss: {
    running_loss / 100:.3f}
    ')
                running_loss = 0.0
    
    print('Finished Training')
    

2. 模型并行(Model Parallelism)

模型并行适用于模型太大,无法放入单个GPU内存的情况。它将模型的不同部分放在不同的GPU上。

步骤:

  1. 编写代码
    import torch
    import torch.nn as nn
    
    class ModelParallelNet(nn.Module):
        def __init__(self):
            super(ModelParallelNet, self).__init__()
            self.block1 = nn.Sequential(
                nn.Conv2d(1, 10, kernel_size=5).cuda(0),
                nn.ReLU().cuda(0),
                nn.MaxPool2d(kernel_size=2, stride=2).cuda(0)
            )
            self.block2 = nn.Sequential(
                nn.Conv2d(10, 20, kernel_size=5).cuda(1),
                nn.ReLU().cuda(1),
                nn.MaxPool2d(kernel_size=2, stride=2).cuda(1)
            )
            self.fc1 = nn.Linear(320, 50).cuda(0)
            self.fc2 = nn.Linear(50, 10).cuda(1)
    
        def forward(self, x):
            x = x.cuda(0)
            x = self.block1(x)
            x = x.cuda(1)
            x = self.block2(x)
            x = torch.flatten(x, 1).cuda(0)
            x = self.fc1(x)
            x = torch.relu(x)
            x = torch.dropout(x, training=self.training)
            x = x.cuda(1)
            x = self.fc2(x)
            return torch.log_softmax(x, dim=1)
    
    model = ModelParallelNet()
    

3. 分布式并行(Distributed Parallelism)

分布式并行适用于多台机器上的多个GPU。它使用torch.distributed包来实现。

步骤:

  1. 设置环境变量

    export MASTER_ADDR='localhost'
    export MASTER_PORT='12345'
    
  2. 编写代码

    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.nn.parallel import DistributedDataParallel as DDP
    from torch.utils.data import DataLoader, DistributedSampler
    from torchvision import datasets, transforms
    
    def main(rank, world_size):
        torch.manual_seed(1234)
        torch.cuda.set_device(rank)
    
        # 初始化进程组
        torch.distributed.init_process_group(backend='nccl', init_method='env://')
    
        # 定义模型并移动到GPU
        model = Net().to(rank)
        ddp_model = DDP(model, device_ids=[rank])
    
        # 加载数据
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
        dataset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
        sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
        trainloader = DataLoader(dataset, batch_size=64, sampler=sampler)
    
        # 定义损失函数和优化器
        criterion = nn.CrossEntropyLoss().to(rank)
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.01)
    
        # 训练模型
        for epoch in range(10):
            sampler.set_epoch(epoch)
            running_loss = 0
            for i, data in enumerate(trainloader, 0):
                inputs, labels = data
                inputs, labels = inputs.to(rank), labels.to(rank)
    
                optimizer.zero_grad()
    
                outputs = ddp_model(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
    
                running_loss += loss.item()
                if i % 100 == 99:
                    print(f'[{
    epoch + 1}
    , {
    i + 1:5d}
    ] loss: {
    running_loss / 100:.3f}
        ')
                    running_loss = 0.0
    
        print('Finished Training')
    
    if __name__ == '__main__':
        world_size = torch.cuda.device_count()
        mp.spawn(main, args=(world_size,), nprocs=world_size, join=True)
    

总结

  • 数据并行:适用于单个机器上的多个GPU。
  • 模型并行:适用于模型太大,无法放入单个GPU内存的情况。
  • 分布式并行:适用于多台机器上的多个GPU。

根据你的具体需求和硬件配置,选择合适的并行计算方法。

声明:本文内容由网友自发贡献,本站不承担相应法律责任。对本内容有异议或投诉,请联系2913721942#qq.com核实处理,我们将尽快回复您,谢谢合作!


若转载请注明出处: CentOS上PyTorch如何进行并行计算
本文地址: https://pptw.com/jishu/721607.html
PyTorch CentOS环境如何备份 PyTorch CentOS网络配置技巧

游客 回复需填写必要信息