cuda - How many 8 bit operations can be performed on 32 bit ALU of a GPU in one cycle if the IPC is 1? -
can perform 4 8 bit operations (simd operations) per cycle or one? conventionally higher bits made zeros , 8 bit treated 32 bit word higher bits 0 perform such operation. there hardware feature available @ present in processors can more number of lower bit operations performed per cycle (especially in nvidia gpus)?
afaik there aren't arithmetic instructions on gpu "can performed on 32 bit alu of gpu in 1 cycle" arithmetic functional units on gpu pipelined resulting in latencies of around 5-25 clock cycles. unit can have new operation issued per clock, , can retire operation per clock, cannot perform operation "in 1 cycle".
the gpu has simd vector intrinsics, of similar describing. throughput of these vary specific gpu type specific operation type.
so, example, throughput, on kepler, of vabsdiff4 simd intrinsic (which 4 8-bit arithmetic operations on 4 byte vector quantity packed 32-bit word) should approximately same throughput 32-bit integer operation (add, subtract, etc.) other simd intrinsics have lower throughputs.
Comments
Post a Comment