Radio telescopes typically consist of multiple receivers whose signals are cross-correlated to filter out noise. A recent trend is to correlate in software instead of custom-built hardware, taking advantage of the flexibility that software solutions offer. Examples include e-VLBI and LOFAR. However, the data rates are usually high and the processing requirements challenging. Many-core processors are promising devices to provide the required processing power. In this paper, we explain how to implement and optimize signal-processing applications on multi-core CPUs and manycore architectures, such as the Intel Core i7, NVIDIA and ATI GPUs, and the Cell/B.E. We use correlation as a running example. The correlator is a streaming, possibly real-time application, and is much more I/O intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P supercomputer. We discuss several important architectural problems which cause architectures to perform suboptimally, and also deal with programmability. The correlator on the Blue Gene/P achieves a superb 96% of the theoretical peak performance. We show that the processing power and memory bandwidth of current GPUs are highly imbalanced. Because of this, the correlator achieves only 16% of the peak on ATI GPUs, and 32% on NVIDIA GPUs. The Cell/B.E. processor, in contrast, achieves an excellent 92%. Many of the insights we discuss here are not only applicable to telescope correlators, are valuable when developing signal-processing applications in general. © 2006 IEEE.