I'm revisiting the Moana Island data set and I'm making great progress; I've got almost all of the assets converted into Redshift Proxies.
The biggest problem I'm currently facing is a 3GB JSON file that defines renderable curves on the largest mountain asset. I don't know exactly how many curves are defined in this file, but based on the curve count and data size of other JSON files I think its roughly 5.2 million curves. Each point of the curve is an array with 3 items, each curve is an array of N points, and the curves are stored inside of a top level array.
json module must load the entire file into memory before operating on it. I've experienced extremely poor behavior on any JSON file over 500MB with the
json module so I am instead parsing the files with
ijson which allows for iterative reading of the JSON files as well as a much faster C backend based on YAJL.
ijson I was able to read an 11GB file that stored transform matrices for instanced assets on the beach. However, even using
ijson I cannot seem to build a spline from the curves in this 3GB file (I gave up after letting the script run for 12 hours). I have a suspicion it has more to do with the way I'm building the spline object than parsing the data. So I have some questions. Is there a performance penalty for building a single spline with millions of segments? Should I instead build millions of splines with a single segment? Or would it be better to try and split the difference and build thousands of splines with 10,000 segments each?
I've done a little performance testing with my current code and right now it takes 10 minutes 13 seconds to build a single spline out of the first 100,000 curves in the file. However, if I build just the first 10,000 curves it only takes 5 seconds.
I'm leaning heavily toward chunking the splines into 10,000 segment batches but I want to first see if my code could be further optimized, here is the relevant portion:
curves = ijson.items(json_file, 'item', buf_size=32*1024)
#curves is a generator object that returns the points for each segment successively
for i, curve in enumerate(curves):
#for performance testing I'm limiting the number of segments parsed and created
if i > num:
point_count += len(curve) #tracking the total number of points
segment_count += 1 #tracking the number of segments
spline.ResizeObject(point_count, segment_count) #resizing the spline
for id, point in enumerate(reversed(curve)):
spline.SetSegment(segment_count-1, len(curve), False)
spline.SetPoint(point_count-id-1, flipMat * c4d.Vector(*(float(p) for p in point)))