Using CGI for Synthetic Image/Scene Creation to train Computer Vision Machine Learning Models

Introduction

I've been working with a startup investigating the application of Machine Learning/Artificial Intelligence (ML/AI) trained computer vision models to detect activity in specialized vertical industries as well as other highly specialized uses cases in biology. Examples include:

  1. Safety practices in oil and gas drilling/refining, construction,
  2. Safety of power lines,
  3. Behavior of animals in their natural settings.

These are use cases with an array of specialized objects of interest not available in, say ImageNet, complex behavior, and a variety of environments. Several industries have companies that design their own 100% specialized equipment that will have no public domain images available at all. So, where can we obtain training data for such highly specialized use cases?

I wanted to share some insights I have gleaned in the process of designing a solution to the challenges of training ML models for highly specialized environments. This article will, specifically, discuss the challenge of getting the training data using Computer Graphics Imaging (CGI) techniques and tools.

Uses of CGI in ML Training

For specialized industries it may be difficult to find training sets in the public domain sufficient to train ML models to recognize esoteric equipment. Acquiring sufficiently large and diverse training data sets for Convolutional Neural Networks (CNN) in these industries is a challenge no matter the size of the ML company (even Google can't find images of a specialized pump, for example). So, the question becomes how to acquire those training datasets with enough images to train the NNs.

Enter CGI. CGI technology dates back to the 1980's and really started taking off in the gaming and film industries in the 1990's. Its growth and maturity accelerated with the widespread adoption of GPUs, which also just happen to power ML training.

In the CGI rendering world, the initial data set is in the form of a 3d mesh of polygons. 3d mesh models can be brought into CGI tools like Blender and manipulated in a large variety of ways to achieve the desired training data.

Sources of 3d Models

There are a lot of sources of 3d models appropriate for Blender. Here is just a partial list:

  1. Public CAD models. Many existing industrial objects of interest will have a 3d CAD model in some format available on some public website. There are public domain websites that provide libraries of 3d CAD models like GrabCAD, Free3D, TurboSquid and many others. Models found on these sites can be used for generalized training. Think Personal Protective Equipment (PPE) objects like hardhats, safety vests, gloves, etc. In addition, some generic tools and equipment may be available like hammers, forklifts, etc. Given the way CNN architectures learn, generic images may well suffice for training purposes with the trained models able to generalize to recognize specific target examples. 3d models can also be found for people and animals, many of those come already rigged for animation purposes.
  2. Company provided CAD models. As mentioned in the introduction, highly specialized industries may design their own mechanical equipment, tools, and PPE for highly specialized tasks. Companies that do so will have full CAD models of that specialized equipment that can be imported into CGI tools like Blender where textures and materials can be applied to aid in ML training.
  3. Photogrammetric approaches. This involves creating meshes from a set of photographs taken from as many different angles as possible. The more photographs taken from most varied angles (in all dimensions), the more accurate the 3d mesh model will be. This technique not only produces the 3d mesh but also the texture map of the object. Different lighting conditions can greatly affect the resolution of the generated mesh, so it is important to control for that. For outside objects (like beaches) it is best to try to take the photographs in overcast conditions where the lighting will be diffused and shadows at a minimum. For industrial objects it would be best to set up a studio of some kind where lighting can be controlled. This technique works very well for smaller objects, but can also be used on a large scale to create a mesh for, say a beach, or an entire construction site. This technique used to require expensive commercial software, and the best results may still be produced with high end software. There is now, however, a free and open source option available in the form of Meshroom.

Using CGI Models

There are (at least) 4 broad areas where CGI can aid in ML training:

  1. Data augmentation. Data augmentation is the process of creating multiple training images from a single image, and is mostly used to train CNNs. With traditional 2d data augmentation you are limited to modifications like scaling, rotating, cropping, etc. You can not change the perspective, the lighting, the background, etc. With CGI approaches you can change all of those characteristics and more.
  2. Scene creation. Scene creation is the process whereby much larger scenes (oil rig, city street, beach, etc.) are created and objects of interest are placed within the scene. In CGI we can also annotate the objects of interest suitable for training one of the many variants of R-CNN models, saving much manual labor required to do the annotations.
  3. Animating behavior. You can animate the objects of interest and move them around the scene in order to simulate more complex behaviors. Animation can be used to both train and validate models.
  4. Validating models. Once a model is trained it is important to validate it against a variety of real (or in this case virtual) world examples. Since it is hard to find footage of, say, dangerous behavior on an oil rig (well, it should be hard, anyway), it is not always easy to ensure that the model detects what you claim it detects. Using CGI (perhaps coupled with a game engine like Unity or Unreal Engine) you can create multiple scenarios depicting activity that your model should detect.

Data Augmentation

In terms of data augmentation, here are some of the aspects of the image you can change in the CGI world:

  1. Lighting. You can change the location, the brightness, the number, and the type of light sources illuminating a scene. Examples of light source types are: point, sun, spotlight, and area. With those you can achieve a wide range of lighting that will change the final rendered image. It is hard to achieve photorealistic lighting in CGI using these artificial light sources, however. To overcome the challenges and limitations of artificial light sources, another approach is to use High Dynamic Range Imaging (HDRI) that produces much more realistic lighting scenarios much more simply. Instead of discrete light sources having to be defined and placed in a 3d scene, HDRI allows for much more diffuse lighting of the same scene. Diffuse lighting is what we experience in the real world, which is why HDRI has really become the standard in CGI. Using HDRI, you can easily vary the time of day, season of the year, cloud cover, fog level, etc., etc. not just change properties of discrete light sources. HDRI can also provide realistic looking backgrounds (see below). Finally, the process to create HDRI images suitable for CGI lighting are well defined. All you need is a (high-ish end) camera (a smartphone is probably not adequate for HDRI).
  2. Orientation. With a 3d model, you can literally change the orientation of the object in all dimensions from a single source model. With similar results, you can also change the camera position. But, when augmenting individual objects of interest whether you change the orientation of the camera or the object the results will be similar.
  3. Textures. Textures apply a 2d image to the 3d mesh to provide the overall look of the object once rendered. The simplest texture is just an image of a single color. More complex textures (like decals) can also be applied, however. But, to take a simple example of creating hardhats of different colors, in the CGI world you can take the same 3d mesh model and apply a red, green, blue, yellow, etc. texture that will change the color of the original model. Again, you can also apply a more detailed image that will aid in detecting specific objects.
  4. Materials. Materials affect how light scatters off of objects. These are relatively fine grained modifications to the model that subtly affect how light scatters off an object during the final rendering process (this has to do with the details of how ray tracing, which is the heart of CGI rendering, works) and is way beyond the scope of this post. The changes to the final image using materials in this way are likely too small to affect training in a meaningful way, but in some specific cases may still be useful.
  5. Background. Training images from ImageNet, for example, are never devoid of some sort of background. So, any CGI approach will need to support variations in background as well. You can certainly provide realistic backgrounds (HDRI techniques are one easy way to achieve this). But, perhaps more interesting for CNN training is to create intentionally unrealistic backgrounds that can force the CNN to better extract relevant features. This is an interesting area of much further research, but is also way beyond the scope of this post.

Scene Creation

Having individual models is powerful enough to aid in training CNNs. Each model becomes more powerful, however, when combined with other models and placed into larger CGI scenes suitable for R-CNN training data sets.

Having the individual objects of interest as CGI models, we need to create the CGI scenes into which they will be placed to create the annotated images.

Here are some possibilities:

  1. Photogrammetry. Briefly touched on above, photogrammetry techniques can be used on both the small (for individual objects of interest) and large scale. Large scale photogrammetry typically requires some sort of aerial photography. Drones are the most economical option for this approach. Commercial tools are probably your best bet for large scale photogrammetry, as opposed to, say, Meshroom. Pix4d is a very good commercial option but there are quite a few more. This approach may require a good deal of touch up in "post production", but can provide a very good starting point and save a lot of CGI artist time.
  2. CAD models. Similar to individual objects, there may well exist CAD models for a site or some environment that is relevant to a specific industry. For example, you can find existing models for construction sites, city scenes, highways, etc. and many of these can be found at some of the same sites above in the Sources of 3d Models section. To create photorealistic images, textures and other elements may have to be applied to the CAD models, if that is the starting point. This technique is already widely used in architecture, so it is quite mature.
  3. Procedural models. CGI tools also support a procedural approach to building world (and other) models. In fact, procedural modelling is quickly becoming the preferred choice in both film and gaming and those approaches clearly are applicable to ML training. One such tool is SceneCity, which is an add-on for Blender. With SceneCity you simply specify certain parameters like area, grid size (for grid-like cities), mix of building types, etc. and it will generate a full 3d model into which you can place your objects. This approach will create entirely fictional scenes, so if you want to depict a real scene like an actual city street, other approaches will have to be used.
  4. Artist models. This approach can be used to create, manually, entirely fictional scenes, but can also be used to re-create actual locations. A skilled CGI artist can create a 3d world model of arbitrary complexity in a matter of days, at most weeks. Depending on the desired complexity and photorealism, this will be more or less time consuming. While it may seem resource intensive, this approach can be viable because the model can be created once and repurposed for many training scenarios.

Object of Interest Placement and Annotation

The final step is to place the synthetic objects of interest into the synthetic scenes and annotate each object of interest. The annotations are bounding boxes or polygons around the object of interest with a label indicating what it is. Each rendered image can be one entry in a larger R-CNN training data set.

Here are some options for adding objects to a scene:

  1. Static object placement. This approach entails simply placing the objects into single, static locations for the duration of the scene generation. Once the objects of interest are placed into the scene properties of the scene can be changed to create different training images. Examples are camera properties (focal length, zoom level, etc.), lighting properties (HDRI backgrounds, orientation of the same HDRI background, etc.).
  2. Animated object placement. The approach here is to use the animation capabilities in CGI tools like Blender to move objects of interest around in the space using programmatic techniques. Animated object placement can be combined with the same scene manipulation described above in "static object placement" to further vary the training images. In addition, a CGI model created in Blender can be exported to a game engine like Unreal Engine to use well known game controllers to manipulate objects of interest using a user experience well known to anyone who has ever played a computer game.

Summary

CGI is a powerful technology to generate deep learning training data sets that are hard or impossible to generate in the real world while reducing manual labor in many cases.

That is super interesting and exciting to say the least.

Like
Reply
Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories