Clusters of workstations are a popular platform for high-performance computing. For many parallel applications, efficient use of a fast interconnection network is essential for good performance. Several modern System Area Networks include programmable network interfaces that can be tailored to perform protocol tasks that otherwise would need to be done by the host processors. Finding the right trade-off between protocol processing at the host and the network interface is difficult in general. In this work, we systematically evaluate the performance of different implementations of a single, user-level communication interface. The implementations make different architectural assumptions about the reliability of the network and the capabilities of the network interface. The implementations differ accordingly in their division of protocol tasks between host software, network-interface firmware, and network hardware. Also, we investigate the effects of alternative data-transfer methods and multicast implementations, and we evaluate the influence of packet size. Using microbenchmarks, parallel-programming systems, and parallel applications, we assess the performance of the different implementations at multiple levels. We use two hardware platforms with different performance characteristics to validate our conclusions. We show how moving protocol tasks to a relatively slow network interface can yield both performance advantages and disadvantages, depending on specific characteristics of the application and the underlying parallel-programming system.