Robot Goes Wild: Delta Robot Bounces Ball using Deep Reinforcement Learning

No one knows what the right algorithm is, but it gives us hope that if we can discover some crude approximation of whatever this algorithm is and implement it on a computer, that can help us make a lot of progress — Andrew Ng

Delta robot is a quite popular robot: it is widely used in industries, such as packaging, electronics assembly, also it can be used to draw a portrait, like Sketchy. Its simple structure makes it easy to control for handling lightweight objects in high speed motions. This nice little robot is something everyone can build in their home!

Salute in Delta Style, Photo Credit:

Recently, I got to see some really cool tricks done by deep reinforcement learning (DRL), such as a superhuman player playing the Atari Game. Inspired by those videos, I was wondering if we can use a delta robot for some fun tricks that is quite difficult for humans, too. Meanwhile, at Northwestern University, I was taking COMP 469, a deep learning class with Dr. Bryan Pardo and my pal, Yipeng Pan. After some discussion, we decided to make the robot bounce the ball like a trampoline. That’s how the story began.

The robot has 3 legs, some geometry analysis (roboticists also call it “inverse kinematics”) has shown that the robot top platform (aka the“end-effector”, shown in grey in the illustration below) can only travel along the x, y, z directions in space and cannot rotate. This means, we just need to control 3 joints at near the bottom base of this robot. In real life, the motors we mount at each joint has torque τ as a measurable output, so we choose the torque at each motor as the variable we control. It is quite a difficult task for humans, as we are not directly controlling the angle of each joint, and more importantly, we have only two hands but we have 3 legs to control.

Delta Robot Analysis, Photo Credit: Dr. Matt Elwin from Northwestern University

How do we go about this? After some discussions, we agreed that we needed to build 1. an OpenAI Gym Environment 2. a Deep Reinforcement Learner on PyTorch. For that, we choose to use Deep Deterministic Policy Gradient (DDPG) algorithm. Here is the original paper. After implementing the project, we also wrote a report that reflect our thoughts on the process and some challenges we faced.

OpenAI Gym On PyBullet

This is a very fun part. The physics engine I use is PyBullet 3. It’ simulates things with relatively lower CPU usage with very decent precision. My environment is ready for download, make sure you have Python 3.6 or above, you can either build from source

git clone

or Download through Pip

pip3 install gym_delta_robot_trampoline

As the first step, I built the PyBullet simulator of the model. A tutorial on how to build a parallel manipulator in PyBullet is coming soon. In short, this process goes:

  1. Constructing a description of the model in Unified Robot Description Format (URDF)
  2. Load the URDF into PyBullet
  3. Create “constraints” to connect the top platform and all 3 legs. The main challenge in this process is URDF only supports a “serial chain” robot structure, i.e, one link can only have one parent link, so we can only connect the top platform to at most one leg at a time.

After the simulator is completed, the second step is to have it provide the information the OpenAI Gym environment needs. OpenAI has a pre-built API framework during training. The necessary information from the PyBullet Simulator is: [reward, joint positions, joint velocities, ball positions, ball velocities] (yes, this is “God View”, but in real life this can be achieved by object tracking techniques in computer vision). During each update step, the PyBullet simulator takes joint actions [τ0, τ1, τ2] as input from OpenAI Gym.

In terms of how the gym is implemented, please check out my github:

An important note for training is: shape the reward function explicitly to induce the robot to learn the intented behavior. Oftentimes, the robot does not learn what is not reflected in the reward. To make the robot learn to bounce as many times as possible, we shaped our incremental reward as:

  • +10 if the ball gets over $H$ above the end-effector
  • +0 if the ball is above the platform but by less than a height threshold, H
  • -0.1 for every timstep that the robot is idle
  • -100 if the ball falls below the end-effector. Once the ball falls below the end-effector, the game is over.

As a side note, Gerard Maggiolino and Mahyar Abdeeteda have some awesome tutorials on how to create a OpenAI environment on PyBullet as well as how to package it for pip installation.

Deep Deterministic Policy Gradient (DDPG)

As a natural extension of Deep Q Net, DDPG not only has a “critic network” that evaluates Q(s,a) value , the total future reward of a (state, action) pair, but it also has an actor network that learns a deterministic lookup table that maps states → action. Action from the actor is fed into the critic, and during training, we do gradient ascent on actor’s parameters, using loss L.

Loss L is the difference between the Q value from the critic network, and the Q value from the target critic network (the top target network in the illustration). The target critic network can be thought of as an older copy of the critic network, and its purpose is to provide a “reference” Q value (maybe better) as a TD target. Also, according to the original paper, it can provide stability for all models’ parameters while they converge.

During my implementation, I found Fujimoto’s DDPG implementation very aspiring. My own implementation can be found at

Framework of DDPG, Source: Kezhi Wang

The actor network’s main purpose is to learn a “deterministic” policy function, such that given the current system states, it can output the correct action value in the continuous action space. For our actor network, the input of actor network is a (1,18) vector that represents the system states, s_t. These states are: 3 joint angles, 3 joint velocities, Cartesian ball position, Cartesian ball velocity, Cartesian end-effector position and Cartesian end-effector velocity. In total, there are 3 hidden layers, and the output layer is a tanh layer that multiplies by the maximum joint action to output a joint torque. We select our joint action to be joint torque τ applied on a joint by a motor, i.e, a ∈ [τ_min, τ_max]

The input of critic network is a (1,21) vector, including the output of the actor network $a$ and the system states s.
The structure of the network is three fully connected layers with a scalar output. ReLu will be used after each hidden layer.

For both the critic and the actor, since we do not expect any obvious symmetry or any “deeper features” that could be identified by kernels, we do not use any weight sharing structures and use linear layers instead. The optimizer of choice is Adam given its invariance to different input scaling.

Some Practical Considerations

  • It’s very important to have enough randomness at the beginning of training. Otherwise, the policy might converge to a local minima (for example, not doing anything for the entire episode).
  • Having a penalty for local minima (e.g, not doing anything for the entire episode, or not doing anything after bouncing the ball once) cannot guarantee to avoid falling into them, but it is helpful to making the robot bounce many times.
  • Weight Decay in Adam can achieve a similar effect as L2 Regularization in Stochastic Gradient Descent (SGD). It can be used to avoid making model parameters too large. This, however, might be not necessary when we our model parameters are really close to zero already.
  • Batch size is also quite important, a larger batch will be more efficient in learning. However, we do not want it to be so large that GPU will have to slow down in loading them.

Result And Open Question

After around 2 hours of training on an Nvidia GeForce GTX 1080 GPU, we achieved a model that can bounce the ball 8 times.

Delta Robot Performance After Training

However, there remains an open problem: when we train our model, our model did not converge to a single optimal solution. We measured the average reward periodically during training, and it looks like:

Divergence Observed During Training

We tried adjusting the randomness in exploration, enabling amsgrad of the Adam Optimzer, and the model would still not converge. So if you have seen a similar issue and have an idea what might mitigate this, I am happy to hear from you and you can reach me at Otherwise, we know that there may not be a “perfect” solution, but we can always get closer to it:

No one knows what the right algorithm is, but it gives us hope that if we can discover some crude approximation of whatever this algorithm is and implement it on a computer, that can help us make a lot of progress — Andrew Ng

A singer, a gym goer, a robot enthusiast, a human being. MSc. Robotics Student at Northwestern University

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store