Compilation and Real-Time Analysis of a Synchronous Data-Flow Application on the Kalray MPPA Many-Core Processor

Matthieu Moy (from a presentation by Hamza Rihani)

Verimag (Grenoble INP) Grenoble, France

June 2017

#### Outline

- 1 Critical, Real-Time and Many-Core
- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Multicore Response Time Analysis of SDF Programs
- 5 Evaluation
- 6 Conclusion and Future Work
- 7 A new compilation team in Lyon (LIP) ?

#### Outline

#### 1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Multicore Response Time Analysis of SDF Programs
- 5 Evaluation
- 6 Conclusion and Future Work
- 7 A new compilation team in Lyon (LIP) ?

#### Time-critical, compute intensive applications





- Hard Real-Time: we must guarantee that task execution completes before deadline
- Compute-intensive
- Space/power bounded

#### Performance Vs Predictability





# Many-core

## Lots of simple cores

Kalray MPPA (Massively Parallel Processor Array):

- 256 cores
- No cache consistency
- No out-of-order execution
- No branch prediction
- No timing anomaly

## $\Rightarrow$ good fit for real-time?



#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time



| <br><u> </u> | <br>57 |
|--------------|--------|

#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time



#### High-level Data-Flow Application Model

Synchronous hypothesis: computation/communication in 0-time





→ Take into account all levels in Worst-Case Execution Time (WCET) analysis and programming model

#### **Context and Partners**



### Outline

#### 1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
  - 3 Models Definition
  - 4 Multicore Response Time Analysis of SDF Programs
  - 5 Evaluation
- 6 Conclusion and Future Work
- 7 A new compilation team in Lyon (LIP) ?



Single-core code generation

```
int main_app(i1, i2)
{
    na = NA(i1);
    ne = NE(i2);
    nb = NB(na);
    nd = ND(na);
    nf = NF(ne);
    o = NC(nb,nd,nf);
    return o;
}
```

static non-preemptive scheduling

High level representation

Industrialized as SCADE (1993)

heavily used in avionics and nuclear



Multi/Many-core code generation



static non-preemptive scheduling

High level representation





Multi/Many-core code generation



static non-preemptive scheduling

High level representation





Multi/Many-core code generation



static non-preemptive scheduling

High level representation



#### Parallel code generation from Lustre/SCADE (pseudo-code)



#### Contributions (part of Ph.D Hamza Rihani, with Claire Maiza)

**1** Precise accounting for interference on shared resources in a many-core processor



#### Contributions (part of Ph.D Hamza Rihani, with Claire Maiza)

**1** Precise accounting for interference on shared resources in a many-core processor

P<sub>0</sub> task of interest y task of interest y task of interest 00 40 80

2 Model of a multi-level arbiter to the shared memory



## Contributions (part of Ph.D Hamza Rihani, with Claire Maiza)

**1** Precise accounting for interference on shared resources in a many-core processor

2 Model of a multi-level arbiter to the shared memory

3 Response time and release dates analysis respecting dependencies.





### Outline

1 Critical, Real-Time and Many-Core

2 Parallel code generation and analysis

#### 3 Models Definition

4 Multicore Response Time Analysis of SDF Programs

#### 5 Evaluation

6 Conclusion and Future Work

7 A new compilation team in Lyon (LIP) ?

## Architecture Model



- Kalray MPPA 256 Bostan
- 16 compute clusters + 4 I/O clusters
- Dual NoC (Network on Chip)

#### Architecture Model



#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)

#### Architecture Model





#### Per cluster:

- 16 cores + 1 Resource Manager
- NoC Tx, NoC Rx, Debug Unit
- 16 shared memory banks (total size: 2 MB)
- Multi-level bus arbiter per memory bank







- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank



- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank





- Tasks mapping on cores
- Static non-preemptive scheduling
- Spatial Isolation

different tasks go to different memory banks

- Execution model:
  - execute in a "local" bank
  - write to a "remote" bank



- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order



- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- Each task  $\tau_i$ :





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- Each task  $\tau_i$ :
  - Input: Processor Demand, Memory Demand





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- Each task  $\tau_i$ :
  - Input: Processor Demand, Memory Demand
  - Output: Release date  $(rel_i)$ , response time  $(R_i)$





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- Each task  $\tau_i$ :
  - Input: Processor Demand, Memory Demand
  - Output: Release date  $(rel_i)$ , response time  $(R_i)$





- Directed Acyclic Task Graph
- Mono-rate
- Fixed mapping and execution order
- Each task  $\tau_i$ :
  - Input: Processor Demand, Memory Demand
  - Output: Release date  $(rel_i)$ , response time  $(R_i)$



## Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition

#### 4 Multicore Response Time Analysis of SDF Programs

#### 5 Evaluation

6 Conclusion and Future Work

7 A new compilation team in Lyon (LIP) ?

• Response Time 
$$R = PD + I^{BUS}(R)$$









• Fix-point formula  $\Rightarrow$  iterative algorithm.



- Fix-point formula  $\Rightarrow$  iterative algorithm.
- Multiple shared resources (memory banks)



- Fix-point formula  $\Rightarrow$  iterative algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in B} I^{BUS}_b(R)$$

where B: a set of memory banks



- Fix-point formula  $\Rightarrow$  iterative algorithm.
- Multiple shared resources (memory banks)

$$I^{BUS}(R) = \sum_{b \in B} I^{BUS}_b(R)$$

where B: a set of memory banks

Requires a model of the bus arbiter

Model of the MPPA Bus



 $I_b^{\text{BUS}}$ : delay from all accesses + concurrent ones



Model of the MPPA Bus















 $Lv_{1} = S_{i}^{b}$   $Lv_{2} = Lv_{1} + \sum_{y=1}^{15} \min(A_{i}^{y,b}, Lv_{1})$   $Lv_{3} = Lv_{2} + \min(A_{i}^{G2,b}, Lv_{2})$   $Lv_{4} = Lv_{4} + A_{i}^{G3,b}$ 

 $I_b^{\scriptscriptstyle BUS} = Lv_4 \times \text{Bus Delay}$ 





$$Lv_{1} = S_{i}^{b}$$

$$Lv_{2} = Lv_{1} + \sum_{y=1}^{15} \min(A_{i}^{y,b}) Lv_{1}$$

$$Lv_{3} = Lv_{2} + \min(A_{i}^{G2,b}), Lv_{2}$$

$$Lv_{4} = Lv_{4} + (A_{i}^{G3,b})$$

 $I_b^{\scriptscriptstyle BUS} = Lv_4 \times \mathsf{Bus} \mathsf{ Delay}$ 

- $I_b^{\text{BUS}}$ : delay from all accesses + concurrent ones  $S_i^b$ : number of accesses of task  $\tau_i$  to bank b $A_i^{y,b}$ : number of concurrent accesses from core y to
  - $b_i$  . Humber of concurrent accesses from core y to bank b









1 Start with initial release dates.







**1** Start with initial release dates.

2 Compute response times

•••





1 Start with initial release dates.

2 Compute response times

... ...



- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!





- 1 Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.





- Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).





- Start with initial release dates.
- 2 Compute response times
  - ... ... a fixed-point is reached!
- 3 Update the release dates.
- 4 Repeat until no release date changes (another fixed-point iteration).







- Convergence of the  $1^{st}$  fixed-point iteration:
  - Monotonic and bounded







21 / 30



http://www-verimag.imag.fr/TR/TR-2016-1.pdf





http://www-verimag.imag.fr/TR/TR-2016-1.pdf









http://www-verimag.imag.fr/TR/TR-2016-1.pdf

## Outline

1 Critical, Real-Time and Many-Core

- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Multicore Response Time Analysis of SDF Programs

#### 5 Evaluation

6 Conclusion and Future Work

7 A new compilation team in Lyon (LIP) ?

# Evaluation: ROSACE Case Study<sup>1</sup>



• Flight management system controller

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

# Evaluation: ROSACE Case Study<sup>1</sup>



- Flight management system controller
- Receive from sensors and transmit to actuators

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

# Evaluation: ROSACE Case Study<sup>1</sup>





- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

# Evaluation: ROSACE Case Study<sup>1</sup>





- Flight management system controller
- Receive from sensors and transmit to actuators
- Assumptions:

Tasks are mapped on 5 cores Debug Support Unit is disabled Context switches are over-approximated constants

<sup>&</sup>lt;sup>1</sup> Pagetti et al., RTAS 2014

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |  |  |
|------------|---------------------------|--------------------------|--|--|
| altitude   | 275                       | 22                       |  |  |
| az_filter  | 274                       | 22                       |  |  |
| h_filter   | 326                       | 24                       |  |  |
| va_control | 303                       | 24                       |  |  |
| va_filter  | 301                       | 23                       |  |  |
| vz_control | 320                       | 25                       |  |  |
| vz_filter  | 334                       | 25                       |  |  |

Table: Task profiles of the FMS controller

• Profile obtained from measurements

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |  |  |
|------------|---------------------------|--------------------------|--|--|
| altitude   | 275                       | 22                       |  |  |
| az_filter  | 274                       | 22                       |  |  |
| h_filter   | 326                       | 24                       |  |  |
| va_control | 303                       | 24                       |  |  |
| va_filter  | 301                       | 23                       |  |  |
| vz_control | 320                       | 25                       |  |  |
| vz_filter  | 334                       | 25                       |  |  |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- Memory Demand: data and instruction cache misses + communications

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |  |  |
|------------|---------------------------|--------------------------|--|--|
| altitude   | 275                       | 22                       |  |  |
| az_filter  | 274                       | 22                       |  |  |
| h_filter   | 326                       | 24                       |  |  |
| va_control | 303                       | 24                       |  |  |
| va_filter  | 301                       | 23                       |  |  |
| vz_control | 320                       | 25                       |  |  |
| vz_filter  | 334                       | 25                       |  |  |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- · Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - NoC Tx: reads 2 words

| Task       | Processor Demand (cycles) | Memory Demand (accesses) |  |  |
|------------|---------------------------|--------------------------|--|--|
| altitude   | 275                       | 22                       |  |  |
| az_filter  | 274                       | 22                       |  |  |
| h_filter   | 326                       | 24                       |  |  |
| va_control | 303                       | 24                       |  |  |
| va_filter  | 301                       | 23                       |  |  |
| vz_control | 320                       | 25                       |  |  |
| vz_filter  | 334                       | 25                       |  |  |

Table: Task profiles of the FMS controller

- Profile obtained from measurements
- · Memory Demand: data and instruction cache misses + communications
- Moreover:
  - NoC Rx: writes 5 words
  - *NoC Tx*: reads 2 words

Experiments: Find the smallest schedulable hyper-period



Smallest schedulable hyper-period









E5: All accesses interfere

E4, E3: We don't us the release dates

E2, E1: Our approach. We use the release dates

Taking into account the memory banks improves the analysis with a factor in [1.77,2.52]



Smallest schedulable hyper-period

|                 | E5/E1 | E5/E2 | E3/E1 | E4/E2 | E2/E1 | E4/E3 |
|-----------------|-------|-------|-------|-------|-------|-------|
| MPPA            | 4.15  | 4.12  | 1.68  | 1.29  | ~1.01 | 0.77  |
| RR              | 3.3   | 3.29  | 1.24  | 1.13  | ~1.01 | 0.91  |
| Speedup factors |       |       |       |       |       |       |











# Outline

- 1 Critical, Real-Time and Many-Core
- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Multicore Response Time Analysis of SDF Programs
- 5 Evaluation
- 6 Conclusion and Future Work
- 7 A new compilation team in Lyon (LIP) ?

# Conclusion

- Code generation and real-time analysis for many-core (Kalray MPPA 256)
   = major challenge for industry and research
- Hard Real-Time  $\Rightarrow$  simplicity, predictability  $\Rightarrow$  static, time-driven schedule
- Critical  $\Rightarrow$  traceability  $\Rightarrow$  no aggressive optimization

• Our work:

- Understand and model the precise architecture of MPPA
- Extension of Multi-Core Response Time Analysis
- Non-trivial proof of termination

• Model of the Resource Manager.



• Model of the Resource Manager.

tighter estimation of context switches and other interrupts



• Model of the Resource Manager.

tighter estimation of context switches and other interrupts

• Model of the NoC accesses.



Model of the Resource Manager.

• Model of the NoC accesses.



tighter estimation of

use the output of

Model of the Resource Manager.

- Model of the NoC accesses.
- Memory access pipelining.



tighter estimation of

 Model of the Resource Manager.
 Model of the NoC accesses.
 use the output of any NoC analysis
 current assumption: bus delay is 10 cycles

Memory access pipelining.



• Model of the Resource Manager.

Model of the NoC accesses.

current assumption: bus delay is 10 cycles

use the output of

tighter estimation of context switches and

other interrupts

- Memory access pipelining.
- Model Blocking and non-blocking accesses.



• Model of the Resource Manager.

• Model of the NoC accesses.

current assumption: bus delay is 10 cycles

use the output of

- Memory access pipelining.
- Model Blocking and non-blocking accesses.



reads are blocking writes are non-blocking

tighter estimation of context switches and

other interrupts



# Questions?

# Outline

- 1 Critical, Real-Time and Many-Core
- 2 Parallel code generation and analysis
- 3 Models Definition
- 4 Multicore Response Time Analysis of SDF Programs
- 5 Evaluation
- 6 Conclusion and Future Work
- 7 A new compilation team in Lyon (LIP) ?





#### BACKUP

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10 \text{ (response time in isolation)}$  $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

Example: Fixed Priority bus arbiter, PE1 > PE0Bus access delay = 10



• Task of interest running on PE0:

 $R_0 = 10 + 3 \times 10$  (response time in isolation)

 $R_1 = 10 + 3 \times 10 + 2 \times 10 = 60$ 

 $R_2 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 = 80$ 

 $R_3 = 10 + 3 \times 10 + 2 \times 10 + 2 \times 10 + 0 = 80$  (fixed-point)

<sup>&</sup>lt;sup>1</sup>Altmeyer et al., RTNS 2015

# The Global Picture

