Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Basic StopAtStepHook & MonitoredTrainingSession usage

I want to setup a distributed tensorflow model, but fail to understand how MonitoredTrainingSession & StopAtStepHook interact. Before I had this setup:

for epoch in range(training_epochs):
  for i in range(total_batch-1):
    c, p, s = sess.run([cost, prediction, summary_op], feed_dict={x: batch_x, y: batch_y})

Now I have this setup (simplified):

def run_nn_model(learning_rate, log_param, optimizer, batch_size, layer_config):
  with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % mytaskid,
        cluster=cluster)):

    # [variables...]

    hooks=[tf.train.StopAtStepHook(last_step=100)]
    if myjob == "ps":
        server.join()
    elif myjob == "worker":  
        with tf.train.MonitoredTrainingSession(master = server.target,
                                is_chief=(mytaskid==0),
                                checkpoint_dir='/tmp/train_logs',
                                hooks=hooks
                                ) as sess:

          while not sess.should_stop():
            #for epoch in range...[see above]

Is this wrong? It throws:

RuntimeError: Run called even after should_stop requested.
Command exited with non-zero status 1

Can somebody explain to me how tensorflow is coordinating here? How can I use the stepcounter to keep track of the training? (before I had this handy epoch variable)

like image 596
dv3 Avatar asked Dec 29 '25 13:12

dv3


1 Answers

Every time a sess.run is executed, the counter is incremented. The problem here is that you are running more steps (total_batch-1 x training_epochs) than the number of steps specified in the hook (200).

What you could do, even though I don't think it is a clean syntax is define last_step = total_batch-1 x training_epochs.

like image 54
Malo Marrec Avatar answered Jan 01 '26 03:01

Malo Marrec



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!