Algorithm Leadership, by John Gustafson in 2007. He suggests that algorithm designer can no longer expect an architecture of balanced flop-per-byte ratio. Instead, algorithm designers should redesign their algorithm and data structures to live with the available architecture. A stunning example is that, on modern architectures, the sparse data structures and the sparse algorithms are an order of magnitude slower than dense data structures and dense algorithms. Crunching zeros are much cheaper than irregular data access in modern architectures.
David Patterson has two recent posts in the ARM Blog:
The benefit of binary compatibility is predicted to be overclouded by its penalty in the PostPC eras.
Specialised Imaging is providing one billion frames per second cameras. Although it can only shoot for a limited number of frames each time and store the images locally, it is still mind blowing to design such systems. Wikipedia has two relevant entries: high speed photography and high-speed camera.
Top 10 Predictions of FCCM. The number of misses is more than the number of hits. Research is a risky business. Negative predictions are more likely to hit. Several open problems are simply too hard to be solved.
Death of the RLOC?, a paper by Satnam Singh, in FPGA 2000. The conclusion is that RLOC is still way better than automatic place and route in many cases. Ten years later, he re-evaluates the death of RLOC in FPGA 2011, with a short paper titled “The RLOC is Dead — Long Live the RLOC“. According to a longer abstract of the latter paper, the RLOC is still not dead yet.
Object Detection with Discriminatively Trained Part Based Models, in TPAMI 2010. The MATLAB code is available at the project website. Thanks to the pointer from Tomasz Malisiewicz.
Intel Science and Technology Center (ISTC) for Visual Computing at Stanford (http://visual.stanford.edu/), is recently founded. It re-brands the Computer Graphics Lab (http://graphics.stanford.edu/) by replacing the “graphics” with “visual”. The term “visual computing”, in a general sense, includes both computer vision and computer graphics.
Multiframe Auto White Balance, in Signal Processing Letter March 2011. An intuitive way to extend current auto white balancing (AWB).
Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation, in TPAMI March 2011. It uses the SIFT descriptor for optical flow.
Some interesting papers in Field Programmable Logic and Applications (FPL) 2010:
- Survey of New Trends in Industry for Programmable Hardware: FPGAs, MPPAs, MPSoCs, Structured ASICs, eFPGAs and New Wave of Innovation in FPGAs
- ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing, benchmark suite available at the ERCBench website.
- OpenRCL: Low-Power High-Performance Computing with Reconfigurable Devices, from the research group of John Wawrzynek.
- A Comparison of Hardware Acceleration Interfaces in a Customizable Soft Core Processor, the peripheral extension approach exhibits 10x improvement on performance per area compared with the pipeline extension approach.
- Early Prediction of Hardware Complexity in HLL-to-HDL Translation, built on top of LLVM. It reports better estimation than the MOLEN framework of Delft.
- Dynamically Reconfigurable Vision-Chip Architecture, the processing is limited to per pixel local operation.
- Field Programmable Gate Array Implementation of Parts-Based Object Detection for Real Time Video Applications. It reports to have 300 fps for VGA image, but does not mention how to stream the image at such high speed to the FPGA.
- General Purpose Computing with Reconfigurable Acceleration, from Delft. It discuss possible interfaces between CPU and FPGA at both hardware and software level. However, we do not actually need a full blown off-chip CPU to achieve GPFPGA. The Virtex 7 or 8 will have embedded ARM core anyway.
- Fast and Low-Memory-Bandwidth Architecture of SIFT Descriptor Generation with Scalability on Speed and Accuracy for VGA Video, running at 30 fps on Cyclone II FPGA, utilizing less logic but more DSP block compared to previous works. In the introduction, it shows some nice applications based on SIFT, e.g. object recognition, visual SLAM, 3D matching and reconstruction, tracking by SIFT flow.
- IP Based Configurable SIMD Massively Parallel SoC, a frame work supporting SIMD with configurable interconnections.
- Control Techniques for Coupling a Coarse-Grain Reconfigurable Array with a Generic RISC Core, from Tampere. The accelerator is largely decoupled from the control processor, which is not the case for ADRES and MOLEN.
- A Scalable, High-Performance Motion Estimation Application for a Weakly-Programmable FPGA Architecture, based on the Flexible Weakly-programmable Advanced Film Engine (FlexWAFE) architecture from TU Braunschweig.
- GPU Versus FPGA for High Productivity Computing, the Convey HC-1 is compared against the NVIDIA GTX285. The authors predict that FPGA based high performance computers will be increasingly marginalised.
- Customized Exposed Datapath Soft-Core Design Flow with Compiler Support, based on the Transport Triggerred Architecture.
Application development with the FlexWAFE real-time stream processing architecture for FPGAs, in TECS 2009.
Empowering Visual Categorization With the GPU, from the Intelligent Systems Lab Amsterdam (ISLA), in IEEE Transactions on Multimedia Feb 2011. It sets a high benchmark for future research. More importantly, the analysis is comprehensive. It also makes thorough comparisons with related work.
Review: Kd-tree Traversal Algorithms for Ray Tracing, to appear in Computer Graphics Forum. Some algorithms are motivated by the SIMD/vector architecture.
The proceeding of Microarchitecture 2010 is released. Some interesting papers are:
- Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior, a scheduling algorithm that takes both system throughput and fairness into account. Scheduler classifies threads into two types: latency-sensitive and bandwidth-sensitive.
- Task Superscalar: An Out-of-Order Task Pipeline, it opts for implicit specification of task level parallelism.
- Many-Thread Aware Prefetching Mechanisms for GPGPU Applications, improving SW/HW prefetching in the context of SIMT architecture.
- Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?, a grand model was built for the exploration. The model may over-simplify the architecture details.
- Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels, everything is for the branch.
- Register Cache System Not for Latency Reduction Purpose, but for area and energy reduction purpose.
- Erasing Core Boundaries for Robust and Configurable Performance, from U Mich.
- Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors, compacting the execution of identical control flows in SPMD.
- Understanding the Energy Consumption of Dynamic Random Access Memories, from Rambus.
- Throughput-Effective On-Chip Networks for Manycore Accelerators, utilizing the characteristics of Bulk Synchronous Parallelism (BSP) model to design efficient NOC.
- ReMAP: A Reconfigurable Heterogeneous Multicore Architecture, supporting point-to-point communication for pipeline parallelism and barrier synchronization.