Página 1 dos resultados de 255 itens digitais encontrados em 0.016 segundos

‣ Towards Compositional Hierarchical Scheduling Frameworks on Uniform Multiprocessors

Craveiro, João Pedro; Rufino, José
Fonte: Universidade de Lisboa Publicador: Universidade de Lisboa
Tipo: Trabalho em Andamento
Publicado em 28/12/2012 Português
Relevância na Pesquisa
37.628303%
In this report, we approach the problem of defining and analysing compositional hierarchical scheduling frameworks (HSF) upon uniform multiprocessor platforms. For this we propose the uniform multiprocessor periodic resource (UMPR) model for a component interface. We extend previous work by fellow researchers (for dedicated uniform multiprocessors, and for compositional HSFs on identical multiprocessors), by providing mechanisms for the multiple aspects of compositional analysis: a sufficient test for local schedulability of sporadic task sets under global Earliest Deadline First (GEDF) and guidelines for the complex problem of selecting the virtual platform when abstracting a component. Finally, we present experimental results that provide evidence for the need of future developments within the realm of compositional HSFs on uniform multiprocessors.

‣ Evolution of an Operating System for Large-Scale Shared-Memory Multiprocessors

Scott, Michael L. (1959 - ); LeBlanc, Thomas J. ; Marsh, Brian D.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
27.327925%
Scalable shared-memory multiprocessors (those with non-uniform memory access times) are among the most flexible architectures for high-performance parallel computing, admitting efficient implementations of a wide range of process models, communication mechanisms, and granularities of parallelism. Such machines present opportunities for general-purpose parallel computing that cannot be exploited by existing operating systems, because the traditional approach to operating system design presents a virtual machine in which the definition of processes, communication, and grain size are outside the control of the user. Psyche is an operating system designed to enable the most effective use possible of large-scale shared memory multiprocessors. The Psyche project is characterized by (1) a design that permits the implementation of multiple models of parallelism, both within and among applications, (2) the ability to trade protection for performance, with information sharings as the default, rather than the exception, (3) explicit, user-level control of process structure and scheduling, and (4) a kernel implementation that uses shared memory itself, and that provides users with the illusion of uniform memory access times. The postscript here was reconstructed from old troff source...

‣ Memory Management for Large-Scale NUMA Multiprocessors

LeBlanc, Thomas J. ; Marsh, Brian D. ; Scott, Michael L. (1959 - )
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
27.327925%
Large-scale shared-memory multiprocessors such as the BBN Butterfly and IBM RP3 introduce a new level in the memory hierarchy: multiple physical memories with different memory access times. An operating system for these NUMA (NonUniform Memory Access) multiprocessors should provide traditional virtual memory management, facilitate dynamic and widespread memory sharing, and minimize the apparent disparity between local and nonlocal memory. In addition, the implementation must be scalable to configurations with hundreds or thousands of processors. This paper describes memory management in the Psyche multiprocessor operating system, under development at the University of Rochester. The Psyche kernel manages a multi-level memory hierarchy consisting of local memory, nonlocal memory, and backing store. Local memory stores private data and serves as a cache for shared data; nonlocal memory stores shared data and serves as a disk cache. The system structure isolates the policies and mechanisms that manage different layers in the memory hierarchy, so that customized data structures and policies can be constructed for each layer. Local memory management policies are implemented using mechanisms that are independent of the architectural configuration; global policies are implemented using multiple processes that increase in number as the architecture scales. Psyche currently runs on the BBN Butterfly Plus multiprocessor. This paper describes an ambitious early design for a VM system for Psyche. Our plans for VM were scaled back substantially after implementation was about half completed...

‣ Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors

Markatos, Evangelos P. ; LeBlanc, Thomas J.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
27.506997%
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism, and thereby minimize program execution time, is to execute loop iterations in parallel on different processors. Traditional approaches to loop scheduling attempt to achieve the minimum completion time by distributing the workload as evenly as possible, while minimizing the number of synchronization operations required. In this paper we consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to non-local data. We show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. We propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. We compare the performance of this new algorithm to other known algorithms using five representative applications on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, and a Sequent Symmetry, and show that the new algorithm offers substantial performance improvements...

‣ Alleviating Memory Contention in Matrix Computations on Large-Scale Shared-Memory Multiprocessors

Bianchini, Ricardo ; Crovella, Mark E. ; Kontothanassis, Leonidas I. ; LeBlanc, Thomas J.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
37.327925%
Memory contention can be a major source of overhead in large-scale shared-memory multiprocessors. Although there are many hardware solutions to the problem of memory contention, these solutions are often complex and expensive, so software solutions are an attractive alternative. This paper evaluates one particular software solution, called block-column allocation, which is very effective at reducing memory contention for a large class of SPMD (Single-Program-Multiple-Data) programs, and can be implemented easily by the compiler. We first quantify the impact of memory contention on performance by simulating the execution of several application kernels on a large-scale multiprocessor. Our simulation results confirm that memory contention is widespread on large-scale machines; our applications suggest that contention is usually caused by synchronized access to a range of addresses (rather than to a single address). We show that block-column allocation, where each range of addresses is divided into cache lines, and each cache line is allocated to a separate memory module, can nearly eliminate this source of memory contention. As our main contribution, we compare block-column allocation to row-major allocation (a common data allocation scheme) and logarithmic broadcasting (the standard software technique for alleviating memory contention). Our analysis demonstrates the clear superiority of block-column allocation over row-major allocation in the presence of memory contention. Our analysis also indicates that the choice between block-column allocation and logarithmic broadcasting is less clear...

‣ Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

LeBlanc, Thomas J. ; Bianchini, Ricardo
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
37.506997%
An important architectural design decision affecting the performance of coherent caches in shared-memory multiprocessors is the choice of block size. There are two primary factors that influence this choice: the reference behavior of application programs and the remote access bandwidth and latency of the machine. Several studies have shown that increasing the block size can lower the miss rate and reduce the number of invalidations. However, increasing the block size can also increase the miss rate by, for example, increasing false sharing or the number of cache evictions. Large cache blocks can also generate network contention. Given that we anticipate enormous increases in both network bandwidth and latency in large-scale, shared-memory multiprocessors, the question arises as to what effect these increases will have on the choice of block size. We use analytical modeling and execution-driven simulation of parallel programs on a large-scale shared-memory machine to examine the relationship between cache block size and application performance as a function of remote access bandwidth and latency. We show that even under assumptions of high remote access bandwidth, the best application performance usually results from using cache blocks between 32 and 128 bytes in size. Using even larger blocks tends to increase the mean cost per reference...

‣ A Preliminary Evaluation of Cache-Miss-Initiated Prefetching Techniques in Scalable Multiprocessors

LeBlanc, Thomas J. ; Bianchini, Ricardo
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
37.506997%
Prefetching is an important technique for reducing the average latency of memory accesses in scalable cache-coherent multiprocessors. Aggressive prefetching can significantly reduce the number of cache misses, but may introduce bursty network and memory traffic, and increase data sharing and cache pollution. Given that we anticipate enormous increases in both network bandwidth and latency, we examine whether aggressive prefetching triggered by a miss (cache-miss-initiated prefetching) can substantially improve the running time of parallel programs. Using execution-driven simulation of parallel programs on a scalable cache-coherent machine, we study the performance of three cache-miss-initiated prefetching techniques: large cache blocks, sequential prefetching, and hybrid prefetching. Large cache blocks (which fetch multiple words within a single block) and sequential prefetching (which fetches multiple consecutive blocks) are well-known prefetching strategies. Hybrid prefetching is a novel technique combining hardware and software support for stride-directed prefetching. Our simulation results show that large cache blocks rarely provide significant performance improvements; the improvement in the miss rate is often too small (or nonexistent) to offset a corresponding increase in the miss penalty. Our results also show that sequential and hybrid prefetching perform better than prefetching via large cache blocks...

‣ Eliminating Useless Messages in Write-Update Protocols on Scalable Multiprocessors

Bianchini, Ricardo ; LeBlanc, Thomas J. ; Veenstra, Jack E.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
37.327925%
Cache coherence protocols for shared-memory multiprocessors use invalidations or updates to maintain coherence across processors. Although invalidation protocols usually produce higher miss rates, update protocols typically perform worse. Detailed simulations of these two classes of protocol show that the excessive network traffic caused by update protocols significantly degrades performance, even with infinite bandwidth. Motivated by this observation, we categorize the coherence traffic in update-based protocols and show that, for most applications, more than 90\% of all updates generated by the protocol are unnecessary. We identify application characteristics that generate useless update traffic, and compare the isolated and combined effects of several software and hardware techniques for eliminating useless updates. These techniques include dynamic and static hybrid protocols, false sharing elimination strategies, and coalescing write buffers. Our simulations show that software caching (where coherence is managed under programmer or compiler control) and the dynamic hybrid protocol reduce useless updates the most, but coalescing write buffers produce fewer, albeit larger, coherence messages. As a result, coalescing write buffers usually produce the best running time...

‣ Memory Contention in Scalable Cache-Coherent Multiprocessors

Bianchini, Ricardo ; Crovella, Mark E. ; Kontothanassis, Leonidas I. ; LeBlanc, Thomas J.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Relatório
Português
Relevância na Pesquisa
27.628303%
Effective use of large-scale multiprocessors requires the elimination of all bottlenecks that reduce processor utilization. One such bottleneck is memory contention. In this paper we show that memory contention occurs in many parallel applications, when those applications are run on large-scale shared-memory multiprocessors. In our simulations of several parallel applications on a large-scale machine, we observed that some applications exhibit near-perfect speedup on hundreds of processors when the effect of memory contention is ignored, and exhibit no speedup at all when memory contention is considered. As the number of processors is increased, many applications exhibit an increase in both the number of hot spots and in the degree of contention for each hot spot. In addition, we observed that hot spots are spread throughout memory for some applications, and that eliminating hot spots on an individual basis can cause other hot spots to worsen. These observations suggest that modern multiprocessors require some mechanism to alleviate hot-spot contention.
We evaluate the effectiveness of two different mechanisms for dealing with hot-spot contention in direct-connected, distributed-shared-memory multiprocessors: queueing requests at the memory module...

‣ Exploiting Bandwidth to Reduce Average Memory Access Time in Scalable Multiprocessors

Bianchini, Ricardo ; LeBlanc, Thomas J.
Fonte: University of Rochester. Computer Science Department. Publicador: University of Rochester. Computer Science Department.
Tipo: Technical Report; Thesis
Português
Relevância na Pesquisa
37.327925%
Thesis (Ph. D.)--University of Rochester. Dept. of Computer Science, 1995. Simultaneously published in the Technical Report series.; The overhead of remote memory accesses is a major impediment to achieving good application performance on scalable shared-memory multiprocessors. This dissertation explores ways in which to exploit network and memory bandwidth in order to reduce the average cost of memory accesses. We consider scenarios in which (1) the remote access cost is dominated by contention, and (2) the hardware provides abundant bandwidth and the remote access time is dominated by the unsaturated request/access/reply sequence of operations. We introduce and evaluate two techniques for increasing the effective bandwidth available to processors, software interleaving and eager combining. We also evaluate strategies for hiding the high cost of remote accesses, including several forms of prefetching and update-based coherence protocols. We use both analytic models and detailed simulations of multiprocessor systems to quantify the effectiveness of these techniques, and to provide insight into the potential and limitations of exploiting bandwidth to reduce average memory access cost.

‣ Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors

Michael, Maged M. ; Scott, Michael L. (1959 - )
Fonte: IEEE Publicador: IEEE
Tipo: Relatório
Português
Relevância na Pesquisa
37.327925%
Link to published version: http://ieeexplore.ieee.org/iel3/4440/12600/00580906.pdf?tp=&arnumber=580906&isnumber=12600; Most multiprocessors are multiprogrammed to achieve acceptable response time. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive, and are widely regarded as inefficient. We present a comparison of the two alternative strategies, focusing on four simple but important concurrent data structures---stacks, FIFO queues, priority queues and counters---in micro-benchmarks and real applications on a 12-processor SGI Challenge multiprocessor. Our results indicate that data-structure-specific non-blocking algorithms, which exist for stacks, FIFO queues and counters, can work extremely well: not only do they outperform preemption-safe lock-based algorithms on multiprogrammed machines, they also outperform ordinary locks on dedicated machines. At the same time...

‣ Implementation of Atomic Primitives on Distributed Shared Memory Multiprocessors

Michael, Maged M. ; Scott, Michael L. (1959 - )
Fonte: IEEE Publicador: IEEE
Tipo: Relatório
Português
Relevância na Pesquisa
37.327925%
Link to published version: http://ieeexplore.ieee.org/iel2/3040/8763/00386540.pdf?tp=&arnumber=386540&isnumber=8763; In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch_and_Phi, compare_and_swap, load_linked, and store_conditional on large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bus-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose several alternative hardware implementations of these primitives, and then analyze the performance of these implementations for various data sharing patterns. Our results indicate that good overall performance can be obtained by implementing compare_and_swap in the cache controllers, and by providing an additional instruction to load an exclusive copy of a cache line.

‣ On the design of a high-performance adaptive router for CC-NUMA multiprocessors

Puente, V.; Gregorio, J.; Beivide, R.; Izu, M.
Fonte: IEEE-Inst Electrical Electronics Engineers Inc Publicador: IEEE-Inst Electrical Electronics Engineers Inc
Tipo: Artigo de Revista Científica
Publicado em //2003 Português
Relevância na Pesquisa
37.327925%
This work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade off between network performance and hardware cost. The outcome of this research is a High-Performance Adaptive Router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input buffering, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. The throughput gains ranged from 10 percent to 40 percent in respect to its most direct rival...

‣ The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology

Pai, Vijay S.; Pai, Vijay S.
Fonte: Universidade Rice Publicador: Universidade Rice
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
27.506997%
Masters Thesis; Current microprocessors exploit high levels of instruction-level parallelism (ILP). This theis presents the first detailed analysis of the impact of such processors on shared-memory multiprocessors. We find that ILP techniques substantially reduce CPU time in multiprocessors, but are less effective in reducing meory stall time for our applications. Consequently, despite the latency-tolerating techniques incorporated in ILP processors, memory stall time becomes a large component of execution time and parallel efficiencies are generally poorer in our ILP-based multiprocessor than in an otherwise equivalent previous-generation multiprocessor. We identify clustering independent read misses together in the processor instruction window as a key optimization to exploit the ILP features of current processors. We also use the above analysis to examine the validity of direct-execution simulators with previous-generation processor models to approximate ILP-based multiprocessors. We find that, with appropriate approximations, such simulators can reasonably characterize the behavior of applications with poor overlap of read misses. However, they can be highly inaccurate for applications with high overlap of read misses.

‣ The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Pai, Vijay S.; Ranganathan, Parthasarathy; Abdel-Shafi, Hazim; Adve, Sarita V.; Pai, Vijay S.; Ranganathan, Parthasarathy; Abdel-Shafi, Hazim; Adve, Sarita V.
Fonte: Universidade Rice Publicador: Universidade Rice
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
37.715913%
Journal Paper; Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP- based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.

‣ A Customized MVA Model for ILP Multiprocessors

Sorin, Daniel J.; Vernon, Mary K.; Pai, Vijay S.; Adve, Sarita V.; Wood, David A.; Sorin, Daniel J.; Vernon, Mary K.; Pai, Vijay S.; Adve, Sarita V.; Wood, David A.
Fonte: Universidade Rice Publicador: Universidade Rice
Tipo: Relatório
Português
Relevância na Pesquisa
37.327925%
Tech Report; This paper provides the customized MVA equations for an analytical model for evaluating architectural alternatives for shared-memory multiprocessors with processors that aggressively exploit instruction-level parallelism (ILP). Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds.

‣ Fine-grain producer-initiated communication in cache-coherent multiprocessors

Abdel-Shafi, Hazim Mustafa
Fonte: Universidade Rice Publicador: Universidade Rice
Português
Relevância na Pesquisa
27.715913%
Shared-memory multiprocessors are becoming increasingly popular as a high-performance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors due to physical constraints and cache coherence overheads. In addition, synchronization operations, which are necessary to ensure correctness in parallel programs, add further communication overhead in shared-memory multiprocessors. Software-controlled non-binding data prefetching is a widely used consumer-initiated mechanism to hide communication latency and is currently supported on most architectures. However, on an invalidation-based cache-coherent multiprocessor, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, fine-grain pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated primitives (referred to as remote writes) is better able to reduce the latency of communication. This work demonstrates experimentally that remote writes provide significant performance benefits in cache-coherent shared-memory multiprocessors both with and without prefetching. Further...

‣ New scalable cache coherence protocols for on-chip multiprocessors; Nuevos protocolos de coherencia escalables para multiprocesadores en chip

Gregorio Menezo, Lucía
Fonte: Universidade de Cantabria Publicador: Universidade de Cantabria
Tipo: Tese de Doutorado
Português
Relevância na Pesquisa
37.327925%
RESUMEN: En esta tesis se lleva a cabo un análisis sobre la problemática asociada a la coherencia cache en el ámbito de los Multiprocesadores en chip (CMPs) y se presentan dos nuevas propuestas de protocolos de coherencia basada en hardware. Ambas propuestas van dirigidas a mitigar el coste asociado a la imperiosa necesidad de emplear jerarquías de memoria complejas dentro del chip que buscan superar la limitación del ancho de banda a memoria (bandwidth-wall). Así, por un lado, considerando como objetivo los sistemas multicore, compuestos por unas decenas de procesadores dentro del chip, se propone LOCKE, un protocolo de coherencia basado en broadcast y centrado en mejorar la reactividad de la jerarquía de memoria on-chip. Por otro lado, para futuros sistemas CMPs de gran escala que incluirán cientos o miles de procesadores, se propone MOSAIC, un protocolo escalable hibrido broadcast-directorio que logra disminuir significativamente el coste del mantenimiento de la coherencia hardware.; ABSTRACT: This thesis includes an analysis of the problems associated with cache coherence in the field of chip multiprocessors (CMPs) and it introduces two new hardware-based coherence protocol proposals. Both proposals are focused on mitigating the associated cost brought by the necessity of having to use complex memory hierarchies inside the chip in order to face the memory bandwidth limitation (bandwidth-wall). On the one hand...

‣ Implementing multicore real-time scheduling algorithms based on task splitting using ada 2012

Andersson, Björn; Pinho, Luis Miguel
Fonte: Springer Publicador: Springer
Tipo: Parte de Livro
Publicado em //2010 Português
Relevância na Pesquisa
27.506997%
Multiprocessors, particularly in the form of multicores, are becoming standard building blocks for executing reliable software. But their use for applications with hard real-time requirements is non-trivial. Well-known realtime scheduling algorithms in the uniprocessor context (Rate-Monotonic [1] or Earliest-Deadline-First [1]) do not perform well on multiprocessors. For this reason the scientific community in the area of real-time systems has produced new algorithms specifically for multiprocessors. In the meanwhile, a proposal [2] exists for extending the Ada language with new basic constructs which can be used for implementing new algorithms for real-time scheduling; the family of task splitting algorithms is one of them which was emphasized in the proposal [2]. Consequently, assessing whether existing task splitting multiprocessor scheduling algorithms can be implemented with these constructs is paramount. In this paper we present a list of state-of-art task-splitting multiprocessor scheduling algorithms and, for each of them, we present detailed Ada code that uses the new constructs.

‣ Athapascan-0 : exploitation de la multiprogrammation légère sur grappes de multiprocesseurs

Carissimi, Alexandre da Silva
Fonte: Universidade Federal do Rio Grande do Sul Publicador: Universidade Federal do Rio Grande do Sul
Tipo: Tese de Doutorado Formato: application/pdf
Português
Relevância na Pesquisa
27.715913%
L'accroissement d'efficacite des réseaux d'interconnexion et la vulgarisation des machines multiprocesseurs permettent la réalisation de machines parallèles a mémoire distribuée de faible coût: les grappes de multiprocesseurs. Elles nécessitent l'exploitation à la fois du parallélismeà grain fin, interne à un multiprocesseur offert par la multiprogrammation légère, et du parallélisme à gros grain entre les différents multiprocesseurs. L'exploitation simultanée de ces deux types de parallélisme exige une méthode de communication entre les processus légers qui ne partagent pas le mêmme espace d'adressage. Le travail de cette thèse porte sur le problème de l'Intégration de la multiprogrammation légère et des communications sur grappes de multiprocesseurs symétriques (SMP). II porte plus précisément sur evaluation et le reglage du noyau exécutif ATHAPASCAN-0 sur ce type d'architecture. ATHAPASCAN-0 est un noyau exécutif, portable, développé au sein du projet APACHE (CNRS-INPG-INRIA-UJF), qui combine la multiprogrammation légère et la communication par échange de messages. La portabilité est assurée par une organisation en couches basée sur les standards POSIX threads et MPI largement répandus. ATHAPASCAN-0 étend le modèle de réseau statique de processus «lourds» communicants tel que MPI...