The task-flow graph generation is accomplished by the front-end in two phases. The first one depends on the algorithm studied: given a sequential implementation, the user modifies it by inserting code that will produce dynamically the set of calculations and communications defining the corresponding parallel execution. The modified program is the first phase of FAST's front-end for the specific algorithm. Running this program on some appropriate input configuration produces an architecture-independent Intermediate Representation (IR) of the parallel execution.
The Intermediate Representation is given in terms of a simple intermediate language comprised of IR-operations and Send/Receive communication primitives. Each IR-operation is an abstraction of a ``medium-grain'' group of successive numerical instructions. These groups correspond to the basic computational blocks of the algorithm. Send/Receive primitives correspond to data-dependences between IR-operations and represent the data-flow.
As an example we give a small piece of pseudo-code and present the steps
taken to derive the code producing its Intermediate Representation. This
pseudo-code computes upon a two-dimensional grid of size
and calculates values of variables at each point of the grid as a
function of variables from this point and neighboring ones. It is a
simplified version of the routine which calculates temperature and heat
variables in the SIMPLE application [6]:
for j=0 to siz_Y do
for i=0 to siz_X do
D[i,j]=s[i,j]+R[i,j]+R[i,j-1]*(1-A[i,j])
A[i,j]=R[i,j]/D[i,j]
V[i,j]=(R[i,j-1]*V[i,j-1]+
s[i,j]*H[i,j])/D[i,j]
endfor
endfor
We notice that in every grid-point (i,j), the computation above requires
the values R[i,j-1] and V[i,j-1] from the neighboring point
(i,j-1), as well as values of ``local'' variables, that is, s[i,j],
R[i,j], etc.
This computation can be represented in an intermediate language format as
follows (for every
,
such that
) :
Grid-point[i, j] :
RECEIVE from [i,j-1]: R[i,j-1], 8 bytes
RECEIVE from [i,j-1]: V[i,j-1], 8 bytes
DAV_comp: DAV_COMP
SEND to [i,j+1]: R[i,j+1], 8 bytes
SEND to [i,j+1]: V[i,j+1], 8 bytes
DAV_comp denotes an IR-operation corresponding to the basic computational block of instructions which compute D[i,j], A[i,j], and V[i,j]. DAV_COMP is a constant representing the time spent for the calculation of values D[i,j], A[i,j], and V[i,j]. Such constants are expressed in terms of seconds, or machine cycles, or instruction counts. In the current implementation of FAST we compute them manually by counting the corresponding machine instructions (e.g. three floating point multiplications, one f.p. division, etc.). If necessary, the machine instruction counts are transformed into machine cycles or seconds by taking into consideration cycles-per-instruction figures from some specific hardware platform.
The first part of FAST's front-end generates such an intermediate
description of the algorithm examined for a certain input data-set. For
instance, the following piece of C-code generates the Intermediate
Representation for our example and can be used as the first part of a FAST
simulator for this toy application.
for (i=0; i<siz_X; i++)
for (j=0; j<siz_Y; j++) {
if (j-1 > 0) {
printf("RECEIVE [%d,%d]:%d",i,j-1,siz(*R));
printf("RECEIVE [%d,%d]:%d",i,j-1,siz(*V));
}
printf("DAV_comp: %lf", DAV_COMP);
if (j+1 < siz_Y) {
printf("SEND [%d,%d]:%d",i,j+1,siz(*R));
printf("SEND [%d,%d]:%d",i,j+1,siz(*V));
}
}
The actual values that would be received, computed, and sent in a real implementation of this example are not of interest. Functional algorithm simulation focuses on extracting information about the sets of computations performed and about the sources, destinations, and sizes of messages exchanged; not about numerical values of the results.
Table 1 shows a typical subset of the Intermediate Representation as generated by FAST for SIMPLE, a CFD code,. It describes calculations and message exchanges taking place at the points of the two-dimensional grid employed by SIMPLE.
In a future version of FAST, instead of having the user manually writing the code that generates the Intermediate Representation, we could replace FAST's front-end with the front-end of a parallel programming language like JADE. The JADE compiler generates a task-flow graph according to the parallel program and the given input. This task-flow graph could then be fed to FAST's back-end. In that way the same parallel code could be used for functional algorithm simulation and actual execution.
In the second phase of the front-end a simple parser transforms the Intermediate Representation into a weighted task-flow graph which follows the Macro-Dataflow model of computation [8]. Task-nodes in the graph contain a number of Intermediate Representation primitives. Their ``boundaries'' are defined by Send and Receive primitives occurring in the IR. The tasks start executing upon receipt of all incoming data and continue to completion without interruption. Upon completion their results are forwarded to adjacent nodes. Edges correspond to Send-Receive pairs and represent the data dependencies between the nodes. The annotation of the task-flow graph is straightforward. Nodes are assigned the sum of the constants of their corresponding IR-operations. Weights assigned to the edges represent the number of bytes ``carried'' by those edges from their source to their destination nodes.
Tasks are generated by FAST with no implicit assumptions regarding partitioning and placement of data. Nevertheless, it is not difficult to take into account the case where the data-structures of the parallel execution studied are partitioned into blocks: this option corresponds to the static-partitioning approach frequently used in practice when parallelizing scientific applications. To this end, the user of FAST has the flexibility to have the system partition the data-structures into blocks structured either according to a given geometry or in compliance to some specified heuristic. Tasks belonging to the same partition will eventually be merged into a sequential thread of execution.