Asynchronous MPI with SysV Shared Memory

We have a large Fortran / MPI database that uses system-V shared memory segments on node. We work on bold nodes with 32 processors, but only with two or four network cards and a relatively small amount of memory per processor; therefore, the idea is that we set up a shared memory segment on which each processor performs its calculation (in its block of the SMP array). Then MPI is used to process node messages, but only for the leader in the SMP group. The procedure is double buffered and works well for us.

The problem arose when we decided to switch to asynchronous messages in order to slightly hide the wait time. Since only a few processors on a node exchange MPI, but all processors see the resulting array (via shared memory), the CPU does not know when the communication processor terminates, if we do not introduce any barrier, and then why asynchronous messages?

An ideal hypothetical solution would be to place request tags in the SMP segment and run mpi_request_get_status on the CPU, which you should know. Of course, the request tag is only registered at the sending CPU, so it does not work! Another suggested feature was to disable the stream on the communication stream and use it to start mpi_request_get_status in a loop with a flag argument in the shared memory segment so that all other images can see. Unfortunately, this is also not an option, since we do not need to use stream libraries.

The only viable option we came across seems to work, but feels like a dirty hack. We put the impossible value in the upper address of the receive buffer, so as soon as mpi_irecv has finished, the value has changed, and therefore, each processor knows when it can use the buffer safely. This is normal? It seems that it will only work reliably if the MPI implementation can be guaranteed to transmit data sequentially. It almost sounds convincing as we wrote it in Fortran, and therefore our arrays are contiguous; I would suggest that access will also be.

Any thoughts?

Thanks, Joly

Here is the pseudo code pattern that I am making. I don’t have the code as a reference at home, so I hope I haven’t forgotten anything important, but I’ll be sure when I get back to the office ...

pseudo(array_arg1(:,:), array_arg2(:,:)...)

  integer,      parameter : num_buffers=2
  Complex64bit, smp       : buffer(:,:,num_buffers)
  integer                 : prev_node, next_node
  integer                 : send_tag(num_buffers), recv_tag(num_buffers)
  integer                 : current, next
  integer                 : num_nodes

  boolean                 : do_comms
  boolean,      smp       : safe(num_buffers)
  boolean,      smp       : calc_complete(num_cores_on_node,num_buffers)

  allocate_arrays(...)

  work_out_neighbours(prev_node,next_node)

  am_i_a_slave(do_comms)

  setup_ipc(buffer,...)

  setup_ipc(safe,...)

  setup_ipc(calc_complete,...)

  current = 1
  next = mod(current,num_buffers)+1

  safe=true

  calc_complete=false

  work_out_num_nodes_in_ring(num_nodes)

  do i=1,num_nodes

    if(do_comms)
      check_all_tags_and_set_safe_flags(send_tag, recv_tag, safe) # just in case anything else has finished.
      check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
      safe(current)=true
    else
      wait_until_true(safe(current))
    end if

    calc_complete(my_rank,current)=false
    calc_complete(my_rank,current)=calculate_stuff(array_arg1,array_arg2..., buffer(current), bounds_on_process)
    if(not calc_complete(my_rank,current)) error("fail!")

    if(do_comms)
      check_all_tags_and_set_safe(send_tag, recv_tag, safe)

      check_tags_and_wait_if_need_be(next, send_tag, recv_tag)
      recv(prev_node, buffer(next), recv_tag(next))
      safe(next)=false

      wait_until_true(all(calc_complete(:,current)))
      check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
      send(next_node, buffer(current), send_tag(current))
      safe(current)=false
    end if

    work_out_new_bounds()

    current=next
    next=mod(next,num_buffers)+1

  end do
end pseudo

"check_all_tags_and_set_safe_flags" : " " / slaves, : "check_tags_and_wait_if_need_be (current, send_tag, recv_tag)" (mpi_wait) "wait_until_true (safe (current))".

+5
1

"... - , ?"

. ; - , . , , , , , ( - ).

, , (, , ) , . , () sysv, , , (b) , , , fork() MPI_Init() - ?

, - OpenMP on- node , , . , .

, " " , MPI, , node MPI-, - "" node. node , , . MPI (Wait Barrier) on-node. MPI3 .

, , <- > w636 > - IPC, SysV, SysV, . ( ) "", ; , , , MPI, , ( , MPI , ).

+5

All Articles