Speed Trap

Introduction

Police officers frequently use radar- and laser-based trackers to determine the velocity of passing traffic, and use this information to cite drivers that are in violation of local speed ordinances. However, this requires an individual police officer to point his or her tracker at each individual car in order to obtain velocity information. In addition, these active sensing devices are subject to detection and/or jamming.

In this paper, we propose SpeedTrap: a passive, optically-based system for tracking the speed of passing traffic; one which could easily leverage the inventory of existing traffic cameras in order to perform speed detection and traffice analysis. In our system, we will discuss methods for calibrating a camera to transform pixels into real world units for computing speed, segmenting cars from a potentially dynamically-changing background, and tracking the speed and progress of these cars as they pass down the road.


Related Work

Previous work on traffic tracking includes VISATRAM, a system that automatically monitors traffic seen through a TV camera placed above a highway. The system forms a panoramic view and an epipolar plane image for each lane in the highway. The information can then be used to count vehicles and estimate their speed.

Work has also been done on aspects of the different stages of traffic tracking or monitoring. Francois and Medioni describe a method to adaptively model a background in real-time, though not specifically for traffic tracking purposes. Member describes the importance of correctly thresholding the error images acquired by subtracting a frame from the background image, and presents an image segmentation method that takes advantage of information that passes through a threshold, without being affected by high levels of noise. Since our project looked at a highly constrained scenario, we obtained good results without such a segmentation method; however it would be useful if we wished to generalize our system.

References

-Bevilacqua, Alessandro. "Effective Object Segmentation in a Traffic Monitoring Application."

-Francois, Alexandre and Gerard Medioni. "Adaptive Color Background Modeling for Real-Time Segmentation of Video Streams."

-Zhu, Zhigang, G. Xu, B. Yang, D. Shi and X. Lin. "VISATRAM: A real-time vision system for automatic traffic monitoring."


Calibration

The input to our tracking algorithm is an image aquired from the traffice camera. In order to use the image pixels to compute real world distances and speeds, we must calibrate the camera. We make the following assumptions on the camera and the road.

  1. The road is flat (all points lie along a single plane)

  2. The road is straight (traffic motion is parallel to some axis)

  3. Occlusions are minimized (background images separate individual cars)

The assumptions are not rigid, but the closer the data matches the assumptions, the better the results of the algorithm.

We seek a transformation from the image space into "road space". This is a 2-dimensional coordinate system that is aligned with the plane of the road. The x-axis of road space corresponds to the direction perpindicular to traffice flow, and the y-axis of road space is parallel to traffic flow. Once a car has been placed in road space, it is trivial to determine its velocity as the change euclidean distance between points over time.

The transformation from screen space to road space is a projective transformation. We perform this by representing points in screen space as 2d homogenous coordinates (3 individual components), where (u,v) maps to (u', v', 1) in homogenous representation, and (u', v', w) maps to (u'/w v'/w) in non-homogenous representation. This allows the use of a 3x3 matrix to perform both translation and projection.

The projective transformation is can be charecterized by selecting 4 points in screen space, and providing their correspondences in road space. Because our interest is in vehicle speed, only relative road-space distances are significant. Therefore, we require the user to calibrate the camera by selecting four points of a quadrilateral in screen space that correspond to a rectangle in road space, and the user is required to provide the dimensions of the rectangle. In our system, we have found it useful to select a rectangle corresponding to a single lane wide, and between the start and end of the dashed lane dividers.

Let us refer to the four screen-space points selected by the user as a, b, c, d. When selected in counter-clockwise order around the rectangle with width w and height h, these correspond to the relative road space points (0,0), (w,0), (w,h), (0,h). Therefore, we can state our problem as finding the best transformation matrix that maps the screen-space coordinates to their corresponding road-space coordinates.

To find this, we first use least-squares optimization to directly compute the optimal affine transformation from screen-space to road space. The affine transformation is the best possible result from least squares, as the full projective transformation is a non-linear operation, and the coordinate matrices are rank-deficient, so that entries G and H in the transformation matrix will always be 0 after least-squares.

This produces a reasonable approximation, however, it will not consider the projective effects of the camera, and will not take object depth into account when computing distances. Therefore, we use this affine transformation as the initial quess for an unconstrained minimization, where we wish to find the best projective matrix that transforms the screen-space points into the road space points. This can be expressed using the following objective function:

We wish to minimize the distances in both dimensions from the transformed screen-space points and the corresponding road space points. We have shown the equations after expanding out the matrix multiplications and perspective divisions. We use the MATLAB optimzation toolbox to minimze the equation, which uses a simple Quasi-Newton solver utilizing numeric computation of the gradient. This was shown to be fast and reliable.

[UPDATE 03/29/2011: After some people had contacted me about the above, I have found that it is unreliable because of a tendency to get trapped in local minima. Instead, one can directly compute the homography between the two sets of points, and I have replaced this computation with code generously distributed by Peter Kovesi]

We can visualize the results of the calibration by transforming the points in road space back into screen-space using the inverse transformation, and plotting them to the screen. As can be seen in the figure, the affine transformation (blue quadrilateral) produces a reasonable approximation, but is limited to a parallelogram shaped region, as the parallelism of the rectangle is preserved by the affine transformation.

The projective transformation is able to accuratly reproduce the points, as shown in the red quad. This maintains a rhomboid shape, as required by the projective transformation, which keeps two lines parallel (the front and back), but does not maintain the parallelism of the sides. This transformation will correctly account for the effects of distance.

Once the camera has been calibrated, it will be useful to identify the locations of the lanes in road space and screen space. The user is prompted to click on the center of each lane to be tracked. For each point (xs, ys), the system maps this to the road space points (xr and yr), and represents the lane in road space as the line x=xr. In the figure, we have shown the line mapped back into screen space and drawn over the background image.

It will also be necessary to define the beginning and ending regions of interest along each lane. This is necessary to ensure that cars too far in the distance to be accuratly tracked will not distort the statistics, and in order to allow the system to provide a starting region for tracking initialization. The user will be prompted to click near the start and end of the region of interest of each lane line. The screen-space points will then be transformed into road space, and the points then projected onto the lane line. This 3-step process (selecting the lane, then selecting each point) is necessary because of the diffuculty in selecting points on parallel lines in the projective coordinate system. To ensure that each lane is parallel, we select the lanes first, and to ensure that each starting and ending point is directly on the appropriate line, we select these points seperatly.

Now that this calibration information has been computed by the system, the tracking phase of the system can translate pixel distance into real world distances, and use changes over time to compute speeds.


Background Detection

We locate cars in a frame by looking for pixels that differ by some minimum amount from a standard background image. Therefore the first step was to find an efficient and accurate way to generate a background image from the given frames.

We experimented with several techniques for generating a background image. These fall into three categories:

  • Histogram-based methods

  • Adaptive statistical learning

  • History-based background detection

Histogram-based methods

The techniques in this class involve maintaining a color histogram for each pixel over a given set of frames. Assuming a pixel will most frequently have its background RGB value, its color in the final background image corresponds to the most popular value in its histogram.

We implemented two versions of the histogram-based background generator. The first maintains a separate histogram for each of the color channels for each pixel, and carries out median filtering on the final background image. The second associates with each pixel a single histogram of all possible RGB combinations; also, a fraction of histogram values from adjacent pixels is added to the current pixel's histogram values. The former method takes less memory - O(n) instead of O(n3) - but we expected the latter to be more accurate.

The results were somewhat surprising, since the second method did not appear significantly less noisy than the first. On the other hand, both methods produced reasonably accurate images with only a few frames as input. The images below were acquired from an input set of 10 frames.

With a separate histogram for each color channel, some noise exists in the form of colorful dots, as shown below. Median filtering reduces the noise but does not eliminate it.

The results with a single histogram for all possible RGB combinations also contain some noise, though not as colorful as the previous image. The spillovers between histograms of adjacent pixels reduced noise to the level of the image below.

The small number of frames used as input can explain the noisy regions. Unfortunately, these methods were so slow - up to 20 seconds per frame - that using more frames was unreasonable. It also made these methods inappropriate for our project, as our goal was to build a real-time or almost-real-time system.

Adaptive statistical learning

In "Adaptive Color Background Modeling for Real-Time Segmentation of Video Streams" François and Medioni describe an algorithm for learning the background image over a set of frames. The algorithm stores the mean m and standard deviation s for the hue, saturation and intensity distributions of each pixel, over all frames up to the current one. The means of these distributions correspond to the current value of the background image.

When a new frame is read, each pixel is marked as occluding or background depending on whether its value falls within two standard deviations of the current means. The background model for the pixel is then updated according to the following equations:

m = (1 - a)m + ax

s2 = max ( smin2, (1 - a)s2 + a(x - m)2 )

smin refers to a minimum standard deviation, x is the observed value of the pixel, and a is the learning rate of the model.

We implemented this algorithm as described above and found it to run much faster than the histogram-based methods. The images below display the results with high and low learning rates.

With a low learning rate of 0.02, it takes many frames for an accurate background image to be generated. The image below is the result after 52 frames.

With a high learning rate of 0.2, the background is generated in fewer frames, but some noisy pixels are added as seen below. Also, the resulting background image is unstable, as it rapidly adapts to incorporate newly seen cars into the background before it erases them.

We also tested a variation of this method that preempts the learning process if a pixel is identified as occluding for five consecutive frames. In this case, the mean is reset to the current observed value and the standard deviation is updated according to the new mean. The results did not greatly improve on the original algorithm:


Another variation we experimented with retains the preempting but operates in RGB instead of HSV space. This yielded more satisfactory results, as shown below.

With a low learning rate, it took longer to obtain an accurate background image. However, the preemption made intermediate images look more valid, as seen below. It took many more frames for the noisy color patches to disappear.

With a high learning rate, the background adapted quickly. The image below suggests that this worked well. However, the problem of new cars being incorporated into the background before they disappeared remained. If, for example, the final frame in the input included a car, the background image would contain traces of the car.

History-based background detection

We finally selected a history-based adaptive background detection method. This maintains a mean for each pixel. However, unlike the previous method it maintains the mean over a fixed-size history instead of over all previously seen frames. If the variance of the pixel over that history is below a conservative threshold, the pixel is considered stable. Only stable pixels are added to the background image.

This method converges to an accurate background image more rapidly than the low learning rate RGB statistical learning method described above. At the same time, it is much more stable than the RGB statistical learning method with a high learning rate; new cars do not affect the existing background as they rarely remain in the same place long enough to be considered part of the background. We expect that more global, longer term changes such as changes in lighting would eventually reflect in the background.



Vehicle Tracking

We decided that the goal of the tracking system was to generate per-lane traffic statistics at real-time speeds, as would be used on a major highway to report traffic flow information. As such, it was reasonable to assume a couple things about the input footage. First, we require that the camera be fixed in location. A camera strictly dedicated to analyzing traffic of a road would probably be fixed on a pole or overpass in one direction, making this a reasonable requirement. Second, we required an overhead view of the traffic. This second requirement was made in order to prevent the need for tracking multiple layers of traffic. With an overhead view, a system can simply analyze one area of the frame and know that this area belongs to only one lane of traffic. With side views, cars may be occluded by other cars or foreign objects.

With these two restrictions on input, it becomes feasible to analyze the traffic with minimal calculation. In short, we subtract the background from an input frame and create a mask based on some error threshold. We then wish to analyze this mask to track moving objects along the road. The key observation is made that almost all of the movement along the road is done in the same direction. With this in mind, we can reduce the two-dimensional motion problem to the one-dimensional problem of tracking a vehicle within its lane.

In order to make this simplification, we define a region of interest in our input frames, one region for each traffic lane. The region of interest defines the center line of a lane as well as the beginning and end of the observation window. Once we subtract the background and generate our mask, we are left with a line of 0's and 1's, where 0's indicate background and 1's indicate the presence of a car. Then, from one frame to the next we only need to follow patches of 1's along the line in a consistent manner. By tracking the bottom of these patches and calculating the differences between them, we can determine how many pixels each car moved along the line between each frame. Using the calibration data, we can map these pixels to a virtual road space and determine the distance in feet. Simple calculation leads to a speed for a car in miles per hour, and in the aggregate case, an average speed for the lane.

The implementation of the above system is straightforward. For each region of interest (each traffic lane being analyzed), we store its coordinates in pixel space, the number of cars it has seen, the average speed of the lane, and an array of car structures with an entry for each car currently in the lane. In an initialization step, we extract the background image of the line by pulling corresponding entries out of the input background image.

For each lane in each frame, we extract the pixels on the line of interest for the lane and subtract our background information. We then compare the sum of squared differences with a threshold, and generate a mask based on this comparison. The threshold is a parameter of the system and depends on the noise sensitivity of the camera. We perform a median filter on this vector to remove noise in the form of small chunks of black or white that might disrupt the algorithm.

The images below show the tracking process for 3 frames. The first 2 frames are consecutive. Adjacent to the frames are images of the lines of interest for the middle lane before thresholding, before using the median filter, and after the median filter. White pixels indicate the presence of a car, and black pixels indicate the background.



We then examine the beginning of our line of interest and if we check for the case where we saw a positive error in the previous frame and see no error now. This suggests that a new car may have just completely entered the region of interest. The algorithm proceeds up from the bottom to find the next positive error and thus locate the bottom of the new car. A check is made to find the top of the positive error chunk to determine the length of the car. Cars that do not meet a minimum length requirement are rejected as noise, and are usually due to visual aberrations in the input image near the beginning of the lane. If a new car is indeed found, it is added to the cars array of the lane.

For each existing car in the lane's car array, we start at the last known position of the bottom of the car and look forward along the line to find the next positive error chunk in the current frame. We take this as the new bottom of the car and calculate the distance between the two cars in road space, using the transformation from the calibration procedure. This distance in feet is converted to miles per hour and appended to the car's existing speed information.

If a search for a new bottom of a car reaches the end of the line, the car has exited the region of interest. In this case, we increment the lane statistics and calculate the average speed of the car as the average of the entries in its speed array. This average speed is used to update the average traffic speed of the lane.


Watch a movie showing our results. [Currently (3/29/2011) not functional. I will try to get this up soon.]


Conclusions

We were impressed with the results of our system. SpeedTrap efficiently and reliably tracks vehicle speeds as they travel down the road, and does so with a reasonable degree of accuracy. Noticeably, the performance of the tracking is very fast, and can handle 15 frames per second in real time. We suggest that the techinques used in our system would form a reasonable implementation of a production speed tracking system, if not for issuing citations, at least for automated measurement of traffic conditions.

Some additional work could also be implemented. It might be useful to track multiple lines per lane, so that smaller vehicles, such as motorcycles could be tracked more robustly. We could also add additional logic to track vehicles across lanes. There are some issues with robustness, such as a tall truck, but overall our system showed a strong degree of reliability.


Downloads

You may use this code for any purpose, either commercial or non-commercial. In a distribution of executables based on this code, attribution, though greatly welcomed, is not required. However, I ask that if you intend to distribute the source code, please include attribution to myself, Mike Burns, Ananya Misra, and link to this page.

Important: Please DO NOT submit this work as if it were your own in any case, but in particular for any class assignment or project, or as a thesis or dissertation. I have been informed in several cases that this has occurred. This is a very serious violation of academic ethics.

SpeedTrap.zip - Code and data files for download

SpeedTrap.ppt - Class presentation

speedtrap_dataset1.mp4 - H.264-compressed frames for our testing sample

speedtrap_calibration.mat - calibration data for the sample dataset above

speedtrap_dataset2.mp4 - an additional H.264-compressed dataset


Usage

First, unzip the code into a directory, and extract the frames from the dataset into a subdirectory. To extract the frames, you can use the following, assuming that FFMpeg is installed.

mkdir speedtrap_data1
ffmpeg -i speedtrap_dataset1.mp4 speedtrap_data1/%04d.jpg

You can call the driver routine SpeedTrap.m using the following:

SpeedTrap('speedtrap_data', 0, 2000, 10, 15, .0015, 15)

It will first run the background detection routine, and then prompt you to select a quad for camera calibration. If you are using the sample dataset, click on the points as in the above image. Start at the lower left of the red box, then lower right, then upper right, then upper left. Then click on the center of each lane, from left to right, and press ENTER when done. Finally, click at the bottom and top of each lane to set the tracking range.

Note that the driver is hard-coded to assume that the width of the quad is 15 feet, and the height 40 feet. To change this, alter the commands inthe driver that look like this:

world_quad_width=15;
world_quad_height=40;
[TM, lane_x, lane_start, lane_end] = CalibrateFunction(bgimage, world_quad_width, world_quad_height);

Hopefully, this should all work. Please note that you may need to play around a bit with the parameters. If so, it probably would be best to run the individual steps seperately as needed, rather than running it all from the beginning each time. If you are having problems with the either the background detection or the camera calibration, download speedtrap_calibration.mat, put it in your speedtrap code directory, and do the following:

load speedtrap_calibration.mat
TrackCars('speedtrap_data', 0, 1000, bgimage, 10, 15, .001, 20, 0, lane_x, lane_start, lane_end, TM)

 


Copyright © 2004 Christopher DeCoro, Mike Burns, Anaya Misra