Advertisement
EelcoHoogendoorn

More pycuda nd_kernel stuff

Aug 27th, 2012
113
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 20.74 KB | None | 0 0
  1.  
  2. """
  3. some conceptual code to illustrate planned kernel syntax stuff
  4. this is just a small stub of code, for ease of mental access for others who want to look at it or play with it
  5. it is currently not compatible with my other messy code-base, but that will come
  6.  
  7.  
  8. the intent of this code is to extend elementwise-type pycuda functionality.
  9. specifically, i aim for a kernel factory that is ndarray-aware. this is taken to mean several things:
  10.    runtime type and shape checking of input arrays
  11.    macros to obtain shapes and strides of arrays within the kernel; both the compile time and runtime constants
  12.    macros to obtain the current index being processed; nd generalization of the 'i' in elementwise
  13.    or even better: arbitrary dereferencing of arrays ala weave.blitz:
  14.        arrayidentifier(x,y,z) -> transformed into an appropriate inner product over strides
  15.  
  16.  
  17. this is all fairly easy to do for a fixed number of dimension (what my main code base is built on);
  18. but the obvious next step is to emit code that works for any dimension
  19.  
  20. in order to do so, we should have some clear logic to decide which axes will
  21. be looped over (serial), and which axes will be handled by the thread launcher (parallel)
  22.  
  23. we should always at least launch 1d kernels to have a minimum of parralelism;
  24. furthermore, the stride=1 axis should always be parralel for coalesing
  25. but exactly how many axes we should do parralel is highly problem (and somewhat hardware) dependent
  26. f.i., 2d deconvolution is typically best done by reading coalesed rows into shared mem,
  27. and then looping over the other axis
  28. a 3d kernel on older hardware must be force to loop over the biggest stride, and so on
  29.  
  30. anyway, having decided upon a partition of parralel and serial axes:
  31.    parralel axes should emit indexing code of the form:
  32.        unsigned i0 = blockDim.x*blockIdx.x + threadIdx.x;
  33.    and serial axes should (recursively) emit an outer loop of the form:
  34.        unsigned in; for in = 0; in<kernel.shape[n]; in++) { $body}
  35.  
  36. the most basis implementation that always works (though not optimally);
  37. simply always launch 1d threadblocks, and loop over the other axes
  38.  
  39. we should probably allow for a manual override of axis paralelism, considering the problem-dependence
  40.  
  41.  
  42. another low priority thing, but which would be cool to have:
  43.    can we make a general shared mem mechanism? upon marking an input arg with a keyword, emit code
  44.    to defer reads of that array to a block of shared mem, with correct padding as decided by the supplied stencil
  45.    shared mem only really comes into play for stencil operations. a simple nd-aware stencil like an outer product
  46.    does not stand to gain much from shared mem; it is still 'elementwise', and needs NxM writes anyway, so there is not much
  47.    point trying to optimize the technically unnecessary NxM reads to N+M
  48.  
  49.  
  50.  
  51. what about class hierarchy?
  52. nd_kernel is essentially the base
  53. elementwise is simply an 1d nd_kernel; subclass with some constraints
  54. stencil_kernel is also a subclass of nd_kernel. it calls upon nd_kernel base to do most work
  55. all it does itself is pass in a pre-composed body, with the optionally unrolled inner stencil loop
  56. """
  57.  
  58.  
  59.  
  60.  
  61. import numpy as np
  62.  
  63. from pyparsing import *
  64.  
  65.  
  66. from pycuda.driver import init, Device
  67. init()
  68. device = Device(1)
  69. context = device.make_context()
  70.  
  71. import pycuda.tools
  72. ##pycuda.compyte.dtypes._fill_dtype_registry(True)
  73. dtype_to_ctype = {k:v for k,v in pycuda.compyte.dtypes.DTYPE_TO_NAME.iteritems() if isinstance(k, str)}
  74.  
  75.  
  76.  
  77.  
  78.  
  79.  
  80.  
  81.  
  82. class Kernel(object):
  83.     """wrapper for pycuda kernel; check and bind arguments"""
  84.  
  85.     def __init__(self, arguments, kernel):
  86.         self.arguments = arguments
  87.         self.kernel = kernel
  88.  
  89.     def __call__(self, *args, **kwargs):
  90.         """
  91.        launch kernel
  92.        we need some logic here to extract runtime variable arguments
  93.        and pass them on in correct order to the compiled kernel
  94.        """
  95.  
  96.         args.append( [
  97.             np.uint32( size)
  98.                 for identifier, (dtype,shape) in self.arguments.iteritems() if not shape is None
  99.                     for size in shape if not size is None])
  100.  
  101.         self.kernel(*args, **kwargs)
  102.  
  103.  
  104. class CheckedKernel(Kernel):
  105.  
  106.     def __call__(self, *args, **kwargs):
  107.         size_args = []
  108.         normal_args = []
  109.         arg_info = self.arguments
  110.         #perform runtime checks
  111.         for i, (value, (identifier, arg)) in enumerate( zip(args, arg_info.iteritems())):
  112.             if arg.is_array:
  113.                 print value.dtype.name, arg.dtype
  114.                 assert((value.dtype.name) == (arg.dtype))           #type check on string representation of dtype; is that safe?
  115.                 assert(len(arg.shape)==len(value.shape))    #assert correct number of dimensions
  116.                 for asize, vsize in zip(arg.shape, value.shape):
  117.                     if asize:
  118.                         assert(asize == vsize)  #check all specified dimensions
  119.                     else:
  120.                         size_args.append(np.uint32( vsize))      #this is a runtime argument
  121.                 normal_args.append(value.gpudata)
  122.             else:
  123.                 try:
  124.                     cast = getattr(np, arg.dtype)
  125.                     normal_args.append( cast(value) )
  126.                 except:
  127.                     raise Exception("argument {i} could not be cast to {dtype}".format(i=i,dtype=arg.dtype))
  128.  
  129.         grid = (10,1,1)
  130.         block = (10,1,1)
  131.         kwargs.update(dict(grid=grid, block=block) )
  132.  
  133.         args = normal_args + size_args
  134.         self.kernel(*args, **kwargs)
  135. ##        super(CheckedKernel, self)(*args, **kwargs)
  136.  
  137.  
  138.  
  139.  
  140.  
  141.  
  142.  
  143.  
  144.  
  145. class ArgumentInfo(object):
  146.     def __init__(self, identifier, shape, dtype):
  147.         self.identifier = identifier
  148.         self.shape = shape
  149.         self.dtype = dtype
  150.  
  151.     @property
  152.     def ctype(self): return dtype_to_ctype[self.dtype]
  153.     @property
  154.     def is_scalar(self): return self.shape is None
  155.     @property
  156.     def is_array(self): return not self.is_scalar
  157.     @property
  158.     def ndim(self): return len(self.shape)
  159.  
  160.  
  161. def arg_parse(arguments):
  162.  
  163.     """
  164.    take a raw argument string of the specified syntax, and transform it
  165.    to an ordereddict of ArgumentInfo objects
  166.    """
  167.     np_types = ' '.join(dtype_to_ctype.keys())
  168.     dtype = oneOf(np_types).setResultsName('dtype')
  169.  
  170.     identifier = (Word(alphanums+'_')).setResultsName('identifier') #not fully correct; also matches names starting with a numeral. whatever
  171.  
  172.     positive_integer = Word(nums)
  173.     dimension = Or([positive_integer, Literal(':')])
  174.     shape = nestedExpr('[',']', delimitedList(dimension)).setResultsName('shape')
  175.  
  176.     term = Group( dtype + Optional(shape) + identifier)
  177.     grammar = delimitedList(term)
  178.  
  179.     #convert parsing result to typeinfo object, for reasonably efficient runtime checks, and clean downstream code
  180.     def shape_postparse(shape):     #apply parseactions to shape args
  181.         if shape is '': return None     #scalar argument
  182.         shape = tuple(None if size is ':' else int(size) for size in shape[0])
  183.         if len(shape)==0: raise Exception('arrays must have at least one dimension')
  184.         return shape
  185.  
  186.     from collections import OrderedDict
  187.     arg_info = OrderedDict()
  188.     for argument in grammar.parseString(arguments):
  189.         arg_info[argument.identifier] = ArgumentInfo(argument.identifier, shape_postparse(argument.shape), argument.dtype)
  190.  
  191.     return arg_info
  192.  
  193.  
  194. def build_argument_strings(arg_info):
  195.     """
  196.    build c code to represent all the compile time and run time info
  197.    that we seek to pass into the kernel
  198.    """
  199.     #add restrict keyword to ptrs by default; my preferred syntax would be to add an ALIAS keyword to supress this
  200.     c_arguments = ', '.join(
  201.         '{type}{ptr} const {restrict}{identifier}'.format(type=arg.ctype, ptr='*'*arg.is_array, restrict='__restrict__ '*arg.is_array, identifier=arg.identifier)
  202.             for arg in arg_info.itervalues()
  203.     )
  204.  
  205.     variable_shape_arguments = ', '.join(
  206.         'const unsigned {identifier}_shape_{dimension}'.format(identifier=arg.identifier, dimension=dimension)
  207.             for arg in arg_info.itervalues() if arg.is_array
  208.                 for dimension, size in enumerate(arg.shape) if size is None
  209.     )
  210.  
  211.     constant_shape_arguments = '\n'.join(
  212.         'const unsigned {identifier}_shape_{dimension} = {size};'.format(identifier=arg.identifier, dimension=dimension, size=size)
  213.             for arg in arg_info.itervalues() if arg.is_array
  214.                 for dimension, size in enumerate(arg.shape) if not size is None
  215.     )
  216.  
  217.     return c_arguments, variable_shape_arguments, constant_shape_arguments
  218.  
  219. def compile_kernel_shape(shape):
  220.     """
  221.    emit code that determines the shape of the virtual ndarray over which we evaluate the kernel
  222.    this function hardcodes this into the kernel
  223.    """
  224.     return '\n'.join(
  225.         'const unsigned kernel_shape_{dimension} = {size};'.format(dimension=dimension, size=size)
  226.             for dimension,size in enumerate( shape)
  227.         )
  228. def runtime_kernel_shape(shape):
  229.     """
  230.    emit code that determines the shape of the virtual ndarray over which we evaluate the kernel
  231.    emits extra arguments to be supplied upon calling of the kernel
  232.    """
  233.     raise NotImplementedError()
  234.  
  235.  
  236. def compute_strides(arg_info):
  237.     """
  238.    generate code to compute the strides of each array
  239.    I am assuming in all this the CUDA compiler has a clue about optimizing constant expressions
  240.    otherwise, might have to use C macros for this kind of thing?
  241.    I do hope unused constants get eliminated, and so on?
  242.    """
  243.     def strides(identifier, shape):
  244.         """
  245.        inner generator, to generate a sequence of stride definitions for each array
  246.        i havnt actually settled on a convention for what is the first and last axis
  247.        not having such a convention, it isnt applied either. in other words, this code
  248.        is for illustrative purposes only
  249.        """
  250.         stride_template  = '{identifier}_stride_{dimension}'
  251.         shape_template = '{identifier}_shape_{dimension}'
  252.         prev = stride_template.format(identifier=identifier, dimension=len(shape)-1)
  253.         yield 'const unsigned {identifier} = {stride};'.format(identifier = prev ,stride = 1)
  254.         for i, size in reversed(list(enumerate( shape[:-1]))):
  255.             this = stride_template.format(identifier=identifier, dimension=i)
  256.             size = shape_template.format(identifier=identifier, dimension=i+1)
  257.             yield 'const unsigned {this} = {prev} * {size};'.format(this=this, prev=prev ,size = size)
  258.             prev = this
  259.         #add total element size as well, for good measure
  260.         size = shape_template.format(identifier=identifier, dimension=0)
  261.         yield 'const unsigned {identifier}_size = {prev} * {size};'.format(identifier = identifier, prev=prev ,size=size)
  262.  
  263.     return '\n'.join(
  264.         stride_expr
  265.             for identifier,arg in arg_info.iteritems() if arg.is_array
  266.                 for stride_expr in strides(identifier, arg.shape))
  267.  
  268.  
  269. def replace_typing(source):
  270.     """
  271.    replace numpy types with c-types. this could be more efficient and intelligent...
  272.    we do not do any semantic analysis here; simple find and replace
  273.    but useage is optional anyway; we are fully backwards compatible, free to use ctypes in our code
  274.    """
  275.     np_types = ' '.join(dtype_to_ctype.keys())
  276.     type_grammar = oneOf(np_types)
  277.     type_grammar.setParseAction(lambda s,l,t: dtype_to_ctype[t[0]])
  278.     return type_grammar.transformString(source)
  279.  
  280.  
  281. def replace_shape_syntax(source, arg_info):
  282.     """
  283.    replace arrayidentifier.shape[ndim] syntax with C named variables
  284.    silently fails to replace some wrong syntax, like misspelled shape;
  285.    dont worry, the cuda compiler is sure to complain about it :)
  286.    would it be sufficient and currect to catch all instances of 'arrayidentifier.'+whatever,
  287.    that fail to match the whole syntax?
  288.    """
  289.     arrayidentifier = (Word(alphanums+'_')).setResultsName('identifier') # + Optional( Word(alphanums))
  290.     positive_integer = Word(nums)
  291.     shape_expr = arrayidentifier + Suppress( Literal('.shape')) + nestedExpr('[',']', positive_integer).setResultsName('dimension')
  292.  
  293.     def replace(s,l,t):    #string, locaction, parseresults
  294.         """if match is correct, replace numpy syntax with c-compatible syntax"""
  295.         identifier = t.identifier
  296.         dimensions = t.dimension[0]
  297.         if not len(dimensions)==1: raise Exception('only simple shape indexing allows')
  298.         dimension = dimensions[0]
  299.         try:
  300.             arg = arg_info[identifier]
  301.         except KeyError:
  302.             raise ParseFatalException("array '{identifier}' is not defined".format(identifier=identifier))
  303.         try:
  304.             size = arg.shape[int(dimension)]
  305.         except Exception:
  306.             raise ParseFatalException('{identifier}.shape[{dimension}] is invalid'.format(identifier=identifier, dimension=dimension))
  307.  
  308.         return '{identifier}_shape_{dimension}'.format(identifier=identifier, dimension=dimension)
  309.     shape_expr.setParseAction(replace)
  310.  
  311.     return shape_expr.transformString(source)
  312.  
  313.  
  314. def replace_array_syntax(source, arg_info):
  315.     """
  316.    replace weave.blitz style array indexing with inner product over strides
  317.    we could optionally insert bounds checking code here as well, as a debugging aid
  318.    should we allow for partial indexing? not sure
  319.    """
  320.     arrayidentifier = oneOf(' '.join(arg_info.keys())).setResultsName('identifier')
  321. ##    arrayidentifier = (Word(alphanums+'_')).setResultsName('identifier') # + Optional( Word(alphanums))
  322.     identifier = Word(alphanums+'_')
  323.     positive_integer = Word(nums)
  324.     index = Or([identifier, positive_integer])
  325.     index_expr = arrayidentifier + nestedExpr('(',')', delimitedList( index)).setResultsName('indices')
  326.  
  327.     def replace(s,l,t):    #string, locaction, parseresults
  328.         """if match is correct, replace numpy syntax with c-compatible syntax"""
  329.         identifier = t.identifier
  330.         indices = t.indices[0]
  331.  
  332.         try:
  333.             arg = arg_info[identifier]
  334.         except KeyError:
  335.             raise ParseFatalException("array '{identifier}' is not defined".format(identifier=identifier))
  336.  
  337.         if not len(indices)==arg.ndim:
  338.             raise Exception("indexing '{identifier}' requires {ndim} arguments".format(identifier=identifier, ndim=arg.ndim))
  339.  
  340.  
  341.         offset = '+'.join(
  342.             '{identifier}_stride_{i}*{idx}'.format(identifier=identifier, i=i, idx=idx)
  343.                 for i,idx in enumerate(indices))
  344.         return '{identifier}[{offset}]'.format(identifier=identifier, offset=offset)
  345.     index_expr.setParseAction(replace)
  346.  
  347.     return index_expr.transformString(source)
  348.  
  349. def replace_for_syntax(source, arg_info):
  350.     """
  351.    replace: 'for (id in start:stop:step)'
  352.    with:    'for (int id=start; start<end; id+=step)'
  353.    rather trivial syntactic sugar indeed; just flexing my pyparsing knowledge a bit
  354.    we could implement an unrolling mechanism here too, in case all params are known at compile time, and loop is small?
  355.    """
  356.  
  357.     identifier = Word(alphanums+'_')
  358.     integer = Word(nums)
  359.     index = Combine( Optional(Literal('-')) + Or([identifier, integer]))    #index, with optional unary minus prepended
  360.     colon = Suppress(Literal(':'))
  361.     range = index.setResultsName('start') + colon + index.setResultsName('stop') + Optional(Combine(colon + index), '1').setResultsName('step')
  362.     loop_expr = Literal('for') + '(' + identifier.setResultsName('index') + Literal('in') + range + ')'
  363.  
  364.     def replace(s,l,t):
  365.         return 'for (int {index}={start}; {index} < {stop}; {index}+={step})'.format(**dict(t))
  366.     loop_expr.setParseAction(replace)
  367.  
  368.     return loop_expr.transformString(source)
  369.  
  370.  
  371. def kernel_parsing(arguments, body, axes = None, shape=None):
  372.     """
  373.    transform syntactic sugar to valid CUDA-C code
  374.  
  375.    """
  376.     arg_info = arg_parse(arguments)
  377.  
  378.  
  379.     if shape is None:
  380.         shape = arg_info[arg_info.keys()[0]].shape      #if shape is none, kernel takes shape of first argument
  381.     shape = tuple(shape)
  382.  
  383.     #create various string options for axes param? 'serial', 'parralel', 'max_parralel', 'first_serial'
  384.     if axes is None:
  385.         axes = np.zeros(len(shape), np.bool)
  386.         axes[-3:] = 1       #do last three axes in parallel (needs to be set to two, for old cards
  387.         axes[0] = 0         #first axis is serial, by default. having at least one looping axis can be beneficial, for shared mem or other reusable terms
  388.         axes[-1] = 1        #unless it is also the last, which is always parallel; otherwise we have no coalesing
  389.     axes = np.array(axes, np.bool)
  390.     assert(axes[-1]==True)  #last (stride==1) axis should be parralel. maybe this should be a warning instead, but i cant think of any reason why youd want this
  391.  
  392.  
  393.  
  394.     print 'ORIGINAL CODE:'
  395.     print
  396.     print arguments
  397.     print body
  398.     print
  399.  
  400.  
  401.  
  402.     #simple nd-array indexing. probably full of bugs?
  403.     for axis in np.flatnonzero(axes==0):  #emit loops for serial axes
  404.         loop = """
  405.        unsigned i{n};
  406.        for (i{n}=0; i{n} < kernel_shape_{n}; i{n}++ )
  407.        """.format(n=axis)
  408.         body = loop + '{\n' + body + '\n}'
  409.  
  410.     supported_axes = 'xyz'
  411.     if np.count_nonzero(axes==1) > len(supported_axes):
  412.         raise Exception('requested more parallel axes than are supported by your hardware!')
  413.     for cuda_axis,nd_axis in zip(supported_axes, np.flatnonzero( axes==1)):    #do tread/block arithmetic for parralel axes, and map from x,y,(z) to nd-enumeration
  414.         parallel_axes = """
  415.        const unsigned i{n} = blockDim.{a}*blockIdx.{a} + threadIdx.{a};
  416.        if (i{n} >= kernel_shape_{n}) return;
  417.        """.format(n=nd_axis, a=cuda_axis)
  418.         body = parallel_axes + body
  419.  
  420.     #determine thread blocking
  421.     #warps of THREADS_PER_ROW; other parallel axes are balanced out
  422.     #just like supported_axes, these constants should be pulled from the surrounding context and its associated device
  423.     parallel_axes = np.count_nonzero(axes==1)
  424.     THREADS_PER_BLOCK = 512
  425.     THREADS_PER_ROW = 32        #would like to coalesce 32 threads at the same time
  426.     #WORK IN PROGRESS
  427.  
  428.  
  429.  
  430.  
  431.  
  432.     #parsing-based replacements
  433.     body = replace_typing(body)
  434.     body = replace_shape_syntax( body, arg_info )
  435.     body = replace_array_syntax( body, arg_info )
  436.     body = replace_for_syntax(body, arg_info)
  437.  
  438.  
  439.  
  440.     #macro-based substitutions
  441.     generic_template = """
  442.    __global__ void ${funcname}(${arguments})
  443.    {
  444.        ${body}
  445.    }
  446.    """
  447.     from string import Template
  448.     template = Template(generic_template)   #would be nice to have identation-aware template...
  449.  
  450.     c_arguments, variable_shape_arguments, constant_shape_arguments = build_argument_strings(arg_info)
  451.  
  452.     print 'TRANSFORMED CODE:'
  453.     transform = template.substitute(
  454.         funcname = 'nd_kernel',
  455.         arguments = c_arguments+', '+variable_shape_arguments,
  456.         body = '\n\n'.join([
  457.                 compile_kernel_shape(shape),
  458.                 constant_shape_arguments,
  459.                 compute_strides(arg_info),
  460.                 body
  461.             ])
  462.         )
  463.     print transform
  464.  
  465.     def compile(source):
  466.         from pycuda.compiler import SourceModule
  467.         module = SourceModule(source)
  468.  
  469.         func = module.get_function('nd_kernel')
  470.  
  471.         return CheckedKernel(arg_info, func)
  472.  
  473.     return compile(transform)
  474.  
  475.  
  476.  
  477.  
  478.  
  479.  
  480.  
  481.  
  482. funky_product = kernel_parsing(
  483.     #the (virtual) ndarray to arrange kernel call over; as the example demonstrates, we can not genrally deduce this from arguments.
  484.     #if not specified, first argument is taken to be the kernel shape;
  485. ##    shape = (100,100),
  486.     #specify parallel and serial axes manually (required for now); last (stride=1) axis is made porallel here, the other serial
  487.     #if not specified, default axis assignment gives the same result for this case
  488. ##    axes = [0, 1],
  489. ##    axes= [1,1],
  490.     #note the use of numpy dtypes, and dimension and type constraints in argument list
  491.     arguments = """float64[100,100] result, uint32[:,100,100] foo, uint32[8,100,100] bar, float32 scalar""",
  492.     #note the numpythonic shape accessing, and weave.blitz style array indexing. plus simple looping helpers
  493.     #and the blissfull absence of boilerplate in general :)
  494.     body = """
  495.    float64 r = 0;
  496.    for (i in 0:foo.shape[0])
  497.        for (j in 0:bar.shape[0]:2)
  498.            r += foo(i,i0,i1) * bar(j,i0,i1) * scalar;
  499.    result(i0, i1) = r;
  500.    """
  501.     )
  502.  
  503. from pycuda import gpuarray
  504. out = gpuarray.zeros((100,100), np.float64)
  505. foo = gpuarray.zeros((3,100,100), np.uint32)
  506. foo.fill(1)
  507. bar = gpuarray.zeros((8,100,100), np.uint32)
  508. bar.fill(1)
  509.  
  510.  
  511. #attempt to lunahc this thing, and see if it actually works
  512. funky_product(out, foo, bar, 1.1)
  513. #seems like it!
  514. print out
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement