parallel removal of elements from an array

Unless you have a compelling reason to roll your own implementation, I recommend you just use Thrust remove_if(). Thrust is modeled on the STL and if your requirements for generality are similar, you will wind up writing code that looks very similar to the Thrust source code.

If the performance of Thrust isn't satisfactory, the Thrust community (including the principal authors) might have good suggestions on how to formulate your code for better performance.

Failing that - if you have a vertical application and Thrust isn't fast enough - roll a scan-based implementation as a last resort. The one-line summary of the algorithm is to do a parallel prefix sum ("scan") on the inverse of the predicate - the output index of elements you want to keep is then specified by the corresponding element of the scan.