distributed-training

We would like to forward a particular 'key' column which is part of the features to appear alongside the predictions - this is to be able to identify to which set of features a particular prediction belongs to. Here is an example of predictions output using the tensorflow.contrib.estimator.multi_class_head:

{"classes": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
 "scores": [0.068196

I have the same hardware envs, same network, but I could not get the result as you, almost half as you. Any best practices and experience? thanks very much! for bytePS with 1 instance and 8 GPU, I have similar testing result.

Simple mistakes trigger unclear error messages in the ALBERT example, that is:

Absence of the unpacked data for trainer (currently triggers requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer)
Running all peers in --client_mode (currently triggers AllReduce failed: could not find a group)

It would be great to

Does HyperGBM's make_experiment return the best model?
How does it work on paramter tuning? It's say that, what's its seach space (e.g. in XGboost)???

torchtext (as of 0.4.0) adopts torch.utils.data.DataLoader, and the older iterator interface is deprecated. Ensure AdaptDL's AdaptiveDataLoader supports this new torchtext interface for data loading, and port the example transformer code to the new interface. Then, adaptdl.data.iterator can be deprecated/removed.

distributed-training

Here are 67 public repositories matching this topic...

PaddlePaddle / Paddle

rwightman / pytorch-image-models

tensorflow / adanet

bytedance / byteps

determined-ai / determined

tensorlayer / hyperpose

learning-at-home / hivemind

DataCanvasIO / HyperGBM

hpcaitech / ColossalAI

awslabs / deeplearning-cfn

petuum / adaptdl

lsds / KungFu

dougsouza / pytorch-sync-batchnorm-example

DeNA / HandyRL

maudzung / YOLO3D-YOLOv4-PyTorch

wenwei202 / terngrad

synxlin / deep-gradient-compression

pytorch / torchx

INET-RC / GeoMX

ZJU-OpenKS / OpenKS

richardkxu / distributed-pytorch

PKU-DAIR / Hetu

bindog / pytorch-model-parallel

bryanyzhu / Video-Tutorial-CVPR2020

bytedance / ps-lite

awslabs / dynamic-training-with-apache-mxnet-on-aws

pinpoint-apm / pinpoint-node-agent

aws-samples / TensorFlow-in-SageMaker-workshop

aws-samples / amazon-sagemaker-protein-classification

Azure / DistributedDeepLearning

Improve this page

Add this topic to your repo