Restoring checkpoint in distributed tensorflow

Using a setup similar to https://github.com/tensorflow/models/tree/master/inception, the chief worker automatically saves a checkpoint file periodically on the node this process is running on. I'm running two ps on two different nodes. Two workers are also running on the two nodes each, with one out of 4 workers being the chief.

When restarting training without any modification, the Supervisor automatically tries to restore the last checkpoint file, but ends up giving an error that it could not find the ckpt on the second node (the node other than the chief worker), because the chief never saved the ckpt on the second node.

W tensorflow/core/framework/op_kernel.cc:936] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/muneebs/tf_train/model.ckpt-275

If I copy the ckpt directory to the second node, it restores fine. Is it a bug? Should the saver be initialized as sharded=True? If so, is that the only way, and we can't have the ckpt as a single file in case the number of nodes change later on?

1 Answer
1

A distributed file system like hdfs would help.

U can save the model (ckpt) to a directory in hdfs, thus avoiding the question of restoring ckpt.

Another method is launch the ps and worker whose task_index=0 in a same machine.

hdfs is not an option. task_index doesn't matter because each ps on each machine has to restore the part of the model parameters that it's supposed to manage.
– muneeb
Aug 11 '17 at 18:49

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk