vector - OpenCL result changes depending on result of printf? What? -
opencl kernel crunches numbers. particular kernel searches array of 8 bit char4 vectors matching string of numbers. example, array holds 3 67 8 2 56 1 3 7 8 2 0 2 - kernel loops on (actual string 1024 digits long) , searches 1 3 7 8 2 , "returns" data letting host program know found match.
in combo learning exercise/programming experiment wanted see if loop on array , search range of values, array not char values, char4 vectors, without using single if statement in kernel. 2 reasons:
1: after half hour of getting compile errors realized cannot do:
if(charvector[3] == searchvector[0])
because may match , may not. , 2:
i'm new opencl , i've read lot how branches can hurt kernel's speed, , if understand internals of kernels correctly, math may faster if statements. case?
anyway... first, kernel in question:
void search(__global uchar4 *rollsrc, __global uchar *srch, char srchlen) { size_t gx = get_global_id(0); size_t wx = get_local_id(0); __private uint base = 0; __local uchar4 queue[8092]; __private uint chunk = 8092 / get_local_size(0); __private uint ctr, start, overlap = srchlen-1; __private int4 srchpos = 0, srchtest = 0; uchar4 searchfor; event_t e; start = max((int)((get_group_id(0)*32768) - overlap), 0); barrier(clk_local_mem_fence); e = async_work_group_copy(queue, rollsrc+start, 8092, 0); wait_group_events(1, &e); for(ctr = 0; ctr < chunk+overlap; ctr++) { base = min((uint)((get_group_id(0) * chunk) + ctr), (uint)((n*32768)-1)); searchfor.x = srch[max(srchpos.x, 0)]; searchfor.y = srch[max(srchpos.y, 0)]; searchfor.z = srch[max(srchpos.z, 0)]; searchfor.w = srch[max(srchpos.w, 0)]; srchpos += max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1; srchpos = max(srchpos, 0); srchtest = clamp(srchpos-(srchlen-1), 0, 1) << 31; srch[0] |= (any(srchtest) * 255); // if(get_group_id(0) == 0 && get_local_id(0) == 0) // printf("%u: %v4u %v4u\n", ctr, srchpos, srchtest); } barrier(clk_local_mem_fence); }
there's unneeded code in there, copy previous kernel, , havent cleaned junk yet. being said.. in short , in english, how math based if statement works:
since need search range, , i'm searching vector, first set char4 vector (searchfor) have elements xyzw individually set number searching for. it's done individually because each of xyz , w hold different stream, , search counter - how many matches in row we've had - different each of members of vector. i'm sure there's better way did. suggestions?
so then, int4 vector, searchpos, holds current position in search array each of 4 vector positions, gets added it:
max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
what does: take abs difference between current location in target queue (queue) , searchfor vector set in previous 4 lines. vector returned each member have either positive number (not match) or 0 (a match - no difference).
it's converted int4 (as uchar cannot negative) multipled -100, run through max(x,-100). vector either -100, or 0. or 1 , it's -99 or 1.
end result: searchpos either increments 1 (a match), or reduced 99, resetting previous partial match increments. (searches can 96 characters long - there exists chance match 91, miss, has able wipe out). max'ed 0 negative result clamped zero. again - open suggestions make more efficient. realized writing use addition saturation remove of max statements.
the last part takes current srchpos, equals number of consecutive matches, subtracts 1 less length of search string, clamps 0-1, ending either 1 - full match, or 0. bit shift << 31. result 0, or 0x8000000. put srchtest.
lastly, bitwise or first character of search string result of any(srchtest) * 255 - it's 1 of few ways (i'm aware of) test across vector , return single integer it. (any() returns 1 if member of vector has it's msb set - set in line above)
end result? srch[0] unchanged, or, in case of match, it's set 0xff. when kernel returns, host can read srch buffer. if first character 0xff, found match.
it has many steps , can cleaned up. may less efficient doing 4 if checks per loop. not sure.
but, after massive post, thing has me pulling hair out:
when uncomment 2 lines @ end prints debug information, script works. end of output on terminal window run it:
36: 0,0,0,0 0,0,0,0 37: 0,0,0,0 0,0,0,0 38: 0,0,0,0 0,0,0,0 39: 0,0,0,0 0,0,0,0 search = 613.384 ms positive done read loop: -1 27 41
positive means string found. -1 27 41 first 3 characters of search string, first being set -1 (signed char on host side).
here's happens when comment out printf debugging info:
search = 0.150 ms negative done read loop: 55 27 41
it not find it. what?! how possible? of course, notice script execution time jumps .15ms 600+ms because of printf, think, maybe it's somehow returning , reading data before script ends, , delay printf gives pause. add barrier(clk_local_mem_fence); end, thinking make sure threads done before returning. nope. no effect. add in 2 second sleep on host side, after running kernel, after running clfinish, , before running clreadbuffer.
nope! still negative. put printf in - , works. how possible? why? have idea? first time i've had programming bug baffled me point of pulling hair out, because makes absolutely 0 sense. work items not clashing, each read own block, , have overlap in case search string split across 2 work item blocks.
please - save hair - how can printf of irrelevant data cause work , removing causes not?
oh - 1 last fun thing: if remove parameters printf - have print text "grr please work" - kernel returns negative, and, nothing prints out. printf ignored.
what heck going on? reading, know absurdly long.
for referencing question in future, issue caused arrays being read out of bounds. when happens, heck breaks loose , results unpredictable.
once fixed work , group size , made sure not exceeding memory bounds, worked expected.
Comments
Post a Comment