nvprof output for gtx 460

$ nvprof --query-events
======== Available Events:
                            Name   Description
Device 0:
        Domain domain_a:
                 sm_cta_launched:  Number of thread blocks launched on a multiprocessor.

               l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_load_miss:   Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

              l1_global_load_hit:  Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_global_load_miss:  Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1, 2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

uncached_global_load_transaction:  Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

        global_store_transaction:  Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

         l1_shared_bank_conflict:  Number of shared bank conflicts caused due to addresses for two or more shared memory requests fall in the same memory bank. Increments by N-1 and 2*(N-1) for a N-way conflict for 32 bit and 64bit shared memory accesses respectively.

       tex0_cache_sector_queries:  Number of texture cache requests. This increments by 1 for each 32-byte access.

        tex0_cache_sector_misses:  Number of texture cache misses. This increments by 1 for each 32-byte access.

       tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.

        tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.

        Domain domain_b:
           fb_subp0_read_sectors:  Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.

           fb_subp1_read_sectors:  Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.

          fb_subp0_write_sectors:  Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.

          fb_subp1_write_sectors:  Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.

    l2_subp0_write_sector_misses:  Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp1_write_sector_misses:  Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

     l2_subp0_read_sector_misses:  Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

     l2_subp1_read_sector_misses:  Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp0_write_sector_queries:  Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp1_write_sector_queries:  Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp0_read_sector_queries:  Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp1_read_sector_queries:  Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_tex_sector_queries:  Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_read_tex_sector_queries:  Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

       l2_subp0_read_hit_sectors:  Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

       l2_subp1_read_hit_sectors:  Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp0_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp1_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_sysmem_sector_queries:  Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_read_sysmem_sector_queries:  Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_write_sysmem_sector_queries:  Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_write_sysmem_sector_queries:  Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp1_total_read_sector_queries:  Total read requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp1_total_write_sector_queries:  Total write requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

        Domain domain_c:
                   gld_inst_8bit:  Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_16bit:  Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_32bit:  Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.

                  gld_inst_64bit:  Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.

                 gld_inst_128bit:  Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.

                   gst_inst_8bit:  Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_16bit:  Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_32bit:  Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.

                  gst_inst_64bit:  Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.

                 gst_inst_128bit:  Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

        Domain domain_d:
                      local_load:  Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.

                     local_store:  Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.

                     gld_request:  Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and share state space.

                     gst_request:  Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and share state space.

                     shared_load:  Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.

                    shared_store:  Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.

                          branch:  Number of branch instructions executed per warp on a multiprocessor.

                divergent_branch:  Number of divergent branches within a warp. This counter will be incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a conditional branch.

                  warps_launched:  Number of warps launched on a multiprocessor.

                threads_launched:  Number of threads launched on a multiprocessor.

                  inst_issued1_0:  Number of single instruction issued per cycle in pipeline 0.

                  inst_issued2_0:  Number of dual instructions issued per cycle in pipeline 0.

                  inst_issued1_1:  Number of single instruction issued per cycle in pipeline 1.

                  inst_issued2_1:  Number of dual instructions issued per cycle in pipeline 1.

                 prof_trigger_00:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_01:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_02:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_03:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_04:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_05:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_06:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 prof_trigger_07:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                   inst_executed:  Number of instructions executed, do not include replays.

          thread_inst_executed_0:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 0.

          thread_inst_executed_2:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 2.

          thread_inst_executed_1:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 1.

          thread_inst_executed_3:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction in pipeline 3.

                    active_warps:  Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 48.

                   active_cycles:  Number of cycles a multiprocessor has at least one active warp.

                      atom_count:  Number of warps executing atomic reduction operations for thread-to-thread communication. Increments by one if at least one thread in a warp executes the instruction

                      gred_count:  Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction