Drag the sliders to compare images!
Left: original image. Middle: recreation as pixels. Right: recreation as sequence of bezier curves.

Automatic Sketching and Vectorizations!
with Canvas-Drawer Networks.

We trained artificial agents to draw with high level actions, like placing curves, shapes, and 3D cubes. Here's how.

Card image cap

Approximating the Rendering Program

Card image cap

Vectorizing Handwritten Symbols

Card image cap

Large Sketches and the Sliding Drawer

Card image cap

Color? Image Segmentation?

Card image cap

The Third Dimension


Interested in technical speak? Read the paper. Interested in looking inside my head? Read the research log.

Humans have an incredible ability to convert information from one domain to another. When a painter gazes upon a beautiful landscape, they can understand the trees, rivers, and clouds, in order to recreate the scene on their canvas. In recent years, computer agents have almost approached human-level performance at recognizing and identifying objects. Where the gap lies, however, is in how we produce our images. While an artist makes use of quick brush strokes or pencil sketches, computer agents generally operate on the scale of individual pixels. This can lead to unfavorable artifacts and images that are hard to modify once created. What if computers could instead learn to draw their results, as a sequence of high-level actions such as strokes or shapes?

Of course, learning to convert pixels to brush strokes is easier said than done. The main issue is that while millions of images are available online, almost all them are only in pixel format. We can't just throw a neural network at mapping pixels to brush strokes, because we don't have any data to train from.

So let's take a step back. We don't have any paired data of images to strokes. But we do have a bunch of painting programs that can render strokes into pixel-space. If we can train an agent that operates such programs to create their results, then we're done!

Well, immediately there's a problem. Our painting program is a black box -- anything could be happening inside. For all we know, a team of highly trained monkeys could be running the whole operation. So unfortunately, we can't just calculate a gradient like a typical neural network layer. We're stuck. If only we had a magical sample based method for approximating functions...

Approximating the Rendering Program

We don't have any information on how the painting program works. Instead, let's train a neural network that approximates this program in every situation. Then, we can easily calculate the gradients of this "canvas" network to train whatever drawing network we decide on.

Your browser does not support the <canvas> element.

This is an interactive demo! Drag the points around to see how the canvas network approximates a bezier curve.


It turns out that this relatively simple method works nicely! With a stack of convolutional layers and some resnet connections, we can directly train a neural network to mimic a given painting program. In our case, we're using a program that takes as input a parametreized bezier curve and outputs its corresponding pixels on a black-and-white 64x64 matrix. After sampling a whole bunch of input-output pairs (parameter vectors and pixels) from our painting program, we can optimize our canvas network with simple L2 loss to imitate the original program as closely as possible.

By performing this approximation, we've now got a version of our painting program that's easily differentiable. In other words, we can now optimize through our canvas network to easily find the right sequence of curves that make up a desired image. Let's see what we can do with this!

Vectorizing Handwritten Symbols

Humans draw symbols as a series of strokes. Can we get our computers to do the same?

Your browser does not support the <canvas> element.
Your browser does not support the <canvas> element.

This is an interactive demo! Draw on the left canvas and see how the network vectorizes.


Armed with our new canvas network, the first task we'll tackle is to recreate MNIST digits and Omniglot symbols as a series of bezier curves. To achieve this, we will need to train a recurrent "drawer" network that can take in a pixel image of a symbol, and output a sequence of parametreized curves that recreate it. Of course, we don't have this paired data. Instead, we pass our drawer's output through the canvas network, then optimize in pixel-space towards recreating the original image.

Well, the core setup here is a general one. We have a lot of freedom in constructing our drawer and canvas networks, and in how we train them. And to find the "best" setup, it's important to note that although we're optimizing towards a recreation in pixel space (via canvas network), that recreation is only an approximation of our true goal: the real bezier curves. We can only evaulate these qualitatively -- so let's see how we did!


It's interesting that when increasing the MNIST drawer's timestep count to 20, we begin to see curves that bend around backwards. This is likely due to a line thickness mismatch, which the drawer tries to correct by overlapping itself. Without an LSTM layer, the drawer often repeats curves it has already placed. In general, the drawer learns a satisfactory behavior, reminiscent of human scribbles.

Large Sketches and the Sliding Drawer

Our drawer can recreate digits and small symbols. What would it take to handle large, complicated sketches? The answer lies in a simple slide.


How complicated of a task can we scale these canvas-drawer methods to? As it turns out, the naive method of upgrading the image resolution and increasing the number of timesteps wouldn't work -- the networks simply took too long to train. To handle high-resolution sketches, we would need a new trick.

Hope arrived in the form of a sliding drawer network. Instead of tackling the entire image at once, what if we broke it up into smaller sections and handled those individually? The intution here stems from the convolutional kernel, which assumes that features on an image are translation independent. We're taking this one step further, and assuming that translation independence also applies to drawing behaviors.

Once again, we have a lot of freedom to how we approach this new idea in practice. How do we break up the image? What order should we handle the sections in, and should there be any overlap? The only way to tell for sure is to try them out!


An interesting result appeared when we tried including a two-layer hierarchy. Basically, we first slide across the image with a larger drawer network (128x128), followed by a smaller one (64x64). While the large drawer was not particularly accurate, the small drawer actually learned an incredibly clean drawing behavior. The larger drawer may be acting as a sort of regularizer, eliminating dense sections so the small drawer can focus on drawing precise, continuous curves.

Color? Image Segmentation?

In digital design programs, it's crucial to have a clean, vectorized representation of a diagram. Can we use our canvas-drawer networks to create these representations from pure pixels?

Floorplan pixel hint.

Our drawer's translation, as a series of rectangles!

Ground truth pixels.


In this section, let's revisit our original problem intent. One of the most important benefits of producing images in a high-level space (curves, shapes, etc) is that this representation is interpretable for both humans and computers. It's a lot easier to drag and drop bounding boxes than to redraw an entire design. Creating such representations, however, is often a long and annoying task -- an architect could spend hours converting a pencil sketch into a design file on CAD. If we can automate a portion of this, it could save a lot of time.

Our task is to segment architectural floorplans into a set of room-types. We have pairs of floorplans and segmented rooms, but they're all in pixel format. With canvas-drawer networks, we can fix this! Instead of bezier curves, our canvas network now produces a parametreized rectangle (x, y, width, height, color).

The drawer network has a tougher job than usual. Instead of just recreating the image its given, we want it to translate an architectural floorplan into a corresponding room-type segmentation. The key here is that since our canvas only draws rectangles, the drawer network has to learn how to "build" a segmentation image using only a series of colored rectangles. This is great, because if the drawer does its job correctly, these rectangles are actually just the bounding boxes of various rooms -- and we now have a clean, interpretable segmentation of our floorplan.

The Third Dimension

What fantastical magic awaits in the illusive third dimension?



For our last trick, we're going 3-D. So far, we've defined our high-level actions as curves or rectangles to fit our problem. But really, we can set these actions to be anything -- as long as they can be mapped back to pixels. So let's try out some cubes! To render these, we'll make use of three orthogonal viewpoints from the front, right, and top.

Our trusty drawer network witnesses these three views. Based on them, it has to produce the corresponding corner points and colors of the cubes, without any paired data to train from. This sounds far-fetched at first, but it's really just the same problem we've been solving all along. We optimize our drawer through our canvas network -- which means minimizing the pixelwise distance between the original three viewpoints and the three viewpoints our canvas network is producing. As this distance approaches zero, our drawer network gets closer and closer to producing the true parameters of the cubes.

Conclusions and Beyond

If you've made it this far, thanks for following along! This project was done during my summer at Autodesk, by me (Kevin Frans) and Chin-Yi Cheng. We first set out to find a way to convert messy sketches into CAD design files. Along the way, we created our own messes and ended up with some pretty cool results. For a more technical look at the methods used, check out the paper. Also, I've kept a daily research log with insights on what my research process looked like and what I was thinking about. If you're prepared for incomprehensible thought flows, go check that out.