1

Tensorflow 1.3, ubuntu 16.04 Network size: 4M

Yet during graph construction it always give me: failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY

Then I tried to capture all the log info from the screen, below is a snippet:

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorOutput { step_id: 9749 kernel_name: "input/ParseSingleExample/Squeeze_gtc_raw/_13" tensor { dtype: DT_FLOAT shape { dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "cuda_host_bfc" allocation_id: 19504 ptr: 1173065629696 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorOutput { step_id: 9750 kernel_name: "input/ParseSingleExample/ParseExample/ParseExample/_7" tensor { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19619 ptr: 1117158602752 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorOutput { step_id: 9750 kernel_name: "input/ParseSingleExample/Squeeze_img_raw" tensor { dtype: DT_FLOAT shape { dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19619 ptr: 1117158602752 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1835008 allocator_name: "cuda_host_bfc" allocation_id: 19505 has_single_reference: true ptr: 1173067399168 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorOutput { step_id: 9750 kernel_name: "input/ParseSingleExample/ParseExample/ParseExample/_9" tensor { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19620 ptr: 1117162141696 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorOutput { step_id: 9750 kernel_name: "input/ParseSingleExample/Squeeze_gtc_raw" tensor { dtype: DT_FLOAT shape { dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19620 ptr: 1117162141696 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19621 has_single_reference: true ptr: 1117158602752 } } }

I tensorflow/core/framework/log_memory.cc:35] LOG_MEMORY MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown" tensor { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 256 } dim { size: 576 } dim { size: 3 } } allocation_description { requested_bytes: 1769472 allocated_bytes: 1769472 allocator_name: "GPU_0_bfc" allocation_id: 19622 has_single_reference: true ptr: 1117160372224 } } }

YOU CAN SEE that TF keeps assigning memory to input image(size: 256x576x3, batch size 1)

WHEN I check how many lines containing this assignment I got

grep -c 'allocated_bytes: 1769472' logging.txt 97022

AMAZING!

MY QUESTION is: why TF keeps assigning memory to input images? which probably results in the memory leak?

Thanks!

  • @YaroslavBulatov I have checked the log info as suggested in question https://stackoverflow.com/questions/45826499/tensorflow-memory-needed-doesnt-scale-with-batch-size-and-image-size – yanchao yang Aug 23 '17 at 01:34
  • @YaroslavBulatov Would you please help me dig in? – yanchao yang Aug 23 '17 at 01:35
  • I don't understand the question, you are asking why TensorFlow tries to put image in memory? Where else would it put the image? – Yaroslav Bulatov Aug 23 '17 at 01:42
  • @YaroslavBulatov First I got a memory leak, which tries to allocate 34GB memory. Then I check all the memory allocation operation as listed in the question, you can see that during construction TF is keeping assigning memory to input image(in my case there is just 1 image, as batch size is 1), I CAN NOT understand why there are 97022 assignment lines, and I guess that's the reason for the 34GB memory consumption. – yanchao yang Aug 23 '17 at 01:47
  • I would look for the LOG_MEMORY line which tries to allocate the largest output. What you are seeing with 97022 allocations is normal, every intermediate result causes an allocation and the memory is released immediately after the result is no longer needed. – Yaroslav Bulatov Aug 23 '17 at 01:50
  • @YaroslavBulatov Earlier, your expectation is that there would be a intermediate tensor that takes up 34GB memory, yet I didn't find this in the log info. – yanchao yang Aug 23 '17 at 01:52
  • It's possible that LOG_MEMORY only records successful allocations, and the 34GB allocation didn't get recorded because it failed. You could reduce your problem to a point where it works, and look for unusually large memory allocations. There's also github.com/yaroslavvb/memory_util to make LOG_MEMORY output more readable – Yaroslav Bulatov Aug 23 '17 at 01:55
  • @YaroslavBulatov good point! will check. – yanchao yang Aug 23 '17 at 02:04

0 Answers0