Tuesday, January 4, 2022

[SOLVED] Can I delete events.out.tfevents.XXXXXXXXXX.computer_name files from training folder

Issue

I am training faster_rcnn_inception module for object detection on custom dataset. In training directory,we found folder called eval_0 and tensorflow generated events.out.tfevents.xxxxxx files.

Training Directory structure as follows

+training_dir
    +eval_0
     -events.out.tfevents.1542309785.instance-1  1.2GB
     -events.out.tfevents.1542367255.instance-1  5.3GB
     -events.out.tfevents.1542369886.instance-1  3.6GB
     -events.out.tfevents.1542624154.instance-1  31MB
     -events.out.tfevents.1543060258.instance-1  19MB
     -events.out.tfevents.1543066775.instance-2  1.6GB
 -events.out.tfevents.1542308099.instance-1  17MB
 -events.out.tfevents.1542308928.instance-1  17MB
 -events.out.tfevents.1542366369.instance-1  17MB
 -events.out.tfevents.1542369000.instance-1  17MB
 -events.out.tfevents.1542623262.instance-1  17MB
 -events.out.tfevents.1543064936.instance-2  17MB
 -events.out.tfevents.1543065796.instance-2  17MB
 -events.out.tfevents.1543065880.instance-2  17MB
 -model.ckpt-96004.data-00000-of-00001
 -model.ckpt-96004.data-00000-of-00001
 -model.ckpt-96004.index
 -model.ckpt-96004.meta
 -model.ckpt-96108.data-00000-of-00001
 -model.ckpt-96108.index
 -model.ckpt-96108.meta

As per my understanding, tfevents files in eval_0 folder are summery files of evaluation and tfevents files in training_dir are summery files of training.

I have interrupted training process several times and continued from recent checkpoint. I also understand restarting training process generates new tfevents files.

My Questions as follows:

  • Why training tfevents_files have same size, but in case if eval_0/tfevents_files size varies ?

  • Why interrupting training generates new tfevents_file in training folder, but same not observed in case of eval_0?

  • Can I delete all tfevents files in eval_0 except latest one? Does it affect on training or evolution history?


Solution

tfevents files are not essential for training and can be safely removed.

In Tensorflow tfevents are created by FileWriters and are generally used to store summary output. Here are some common examples of how tf.summaries are used:

  • storing a description of the tensorflow graph before training starts
  • writing a value of the loss function for every training step
  • storing a histogram of activations or weights for a layer once per epoch
  • storing an example of output image of the network once on every validation
  • storing average precision (or any other metric) for the whole validation set

This information is not essential for training and can therefore be deleted. Yet, it might come in handy for debugging or studying behavior of the model. TensorBoard is the most common tool to read and visualize data stored in tfevent files. Anyone can read and interpret TFRecord files manually using protobuf protocol and it's implementation for Python, C++ and other.

tfevents are written in TFRecord format. TFRecord is a simple format for storing a sequence of binary records. Tensorflow always appends new events/summaries to the end of the file if file already exists. This explains file grows.

Due to details of implementation of optimization routine provided with tensorflow/models/reserach/object_detection training and evaluation event files have different behaviour. Namely, evaluation event file is created using a FileWriter directly, which will reuse latest existing event file in the log_dir whenever one exists. Implementation also has large number of summaries that are collected regularly, which increases event file during training.

For the training routine, on the other hand, developers explicitly specify an empty list of summaries when training is done on TPU. Which means that event file is created once and is never used afterwards. This behaviour can be different when training is performed on non-TPU hardware or summarize_gradients option is enabled for training.



Answered By - y.selivonchyk