Tensorflow2.0 MultiWorkerMirroredStrategy example hangs

Question

I followed the example from official tensorflow website.
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

Here is my spec
WSL
Ubuntu 16.04.6 LTS
Tensorflow2.0
No-GPU available

I have a file called 'tfexample.py' which looks like this

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
import json, os

tfds.disable_progress_bar()

os.environ["TF_CONFIG"] = json.dumps(
    {
        "cluster": {"worker": ["localhost:12345", "localhost:23456"]},
        "task": {"type": "worker", "index": 0},
    }
)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
BUFFER_SIZE = 10000
BATCH_SIZE = 64


def make_datasets_unbatched():
    # Scaling MNIST data from (0, 255] to (0., 1.]
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label

    datasets, info = tfds.load(name="mnist", with_info=True, as_supervised=True)

    return datasets["train"].map(scale).cache().shuffle(BUFFER_SIZE)


train_datasets = make_datasets_unbatched().batch(BATCH_SIZE)


def build_and_compile_cnn_model():
    model = tf.keras.Sequential(
        [
            tf.keras.layers.Conv2D(32, 3, activation="relu", input_shape=(28, 28, 1)),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64, activation="relu"),
            tf.keras.layers.Dense(10, activation="softmax"),
        ]
    )
    model.compile(
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
        metrics=["accuracy"],
    )
    return model


# single_worker_model = build_and_compile_cnn_model()
# single_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)


NUM_WORKERS = 2
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
# and now this becomes 128.
GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
with strategy.scope():
    # Creation of dataset, and model building/compiling need to be within
    # `strategy.scope()`.
    train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
    multi_worker_model = build_and_compile_cnn_model()

# Keras' `model.fit()` trains the model with specified number of epochs and
# number of steps per epoch. Note that the numbers here are for demonstration
# purposes only and may not sufficiently produce a model with good quality.
multi_worker_model.fit(x=train_datasets, epochs=3, steps_per_epoch=5)

When I run this file with

python tfexample.py

The terminal just hangs like below

2020-02-04 17:50:23.483411: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485194: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-02-04 17:50:23.485747: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/danny/.local/lib/python2.7/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
2020-02-04 17:50:29.013263: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-02-04 17:50:29.014152: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-02-04 17:50:29.014781: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (WINDOWS-6DFFM0Q): /proc/driver/nvidia/version does not exist
2020-02-04 17:50:29.015780: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-04 17:50:29.025575: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2701000000 Hz
2020-02-04 17:50:29.027050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x66b11a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-04 17:50:29.027669: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
E0204 17:50:29.038614800   24084 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1580856629.038575000","description":"Protocol not available","errno":92,"file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":175,"os_error":"Protocol not available","syscall":"getsockopt(SO_REUSEPORT)"}
E0204 17:50:29.039313500   24084 socket_utils_common_posix.cc:299] setsockopt(TCP_USER_TIMEOUT) Protocol not available
2020-02-04 17:50:29.051180: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12345, 1 -> localhost:23456}
2020-02-04 17:50:29.053392: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:12345

Any help will be appreciated!

amlarraz · Accepted Answer

This problem is due MultiWorkerMirroredStrategy() needs as much different physical devices as number of workers you want to run. If you want to run your script in your local machine you can run each worker in a different Docker container.

Tensorflow2.0 MultiWorkerMirroredStrategy example hangs

Tags:

tensorflow2.0

Danny

1 Answers

amlarraz

Recent Activity

Donate For Us

Tensorflow2.0 MultiWorkerMirroredStrategy example hangs

Tags:

tensorflow2.0

Danny

1 Answers

amlarraz

Related questions

Recent Activity

Donate For Us