我是分布式TensorFlow的新手。我在这里找到了这个分布式MNIST测试: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py
但我不知道如何运行它。我使用了以下脚本:
python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=0 --worker_grpc_url="grpc://tf-worker0:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=1 --worker_grpc_url="grpc://tf-worker1:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=2 --worker_grpc_url="grpc://tf-worker2:2222"
我刚刚发现这些参数缺失了,所以我将它们传递给程序。这是发生的情况:
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Worker GRPC URL: grpc://tf-worker0:2222
Worker index = 0
Number of workers = 3
Worker GRPC URL: grpc://tf-worker2:2222
Worker index = 2
Number of workers = 3
Worker GRPC URL: grpc://tf-worker1:2222
Worker index = 1
Number of workers = 3
Worker 0: Initializing session...
Worker 2: Waiting for session to be initialized...
Worker 1: Waiting for session to be initialized...
E0608 20:37:13.514249023 7501 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.514287961 7501 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.548052986 7502 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.548091527 7502 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.555449386 7503 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.555473898 7503 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
^CE0608 20:37:28.517451603 7504 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.517491102 7504 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.551002331 7505 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.551029795 7505 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.556681378 7506 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.556709728 7506 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
有人知道如何正确运行它吗?非常感谢!
tf.summary.merge_all()
。 - mrry