Santa Monica Spaces: Approach to Transforming the Data

If you read my last post, you may remember that there were a couple of issues that need to be overcome before putting the data into any sort of machine learning algorithm. Namely, the data could potentially be noisy (i.e. lots of events from one meter within a few seconds); the data is unbalanced; and the raw data, which is in the form of events, is not the ideal format- what we’d prefer to know is which meters were occupied at any given time. So, how are we going to do this with what we have?

Enter Meter Sessions

The key, as I mentioned in the previous post, is by taking advantage of sessions. Each sensor_event from the Santa Monica API contains both an event_type and session_id field, which can be used to construct sessions: a period of time that a given parking meter was occupied. By constructing all of the sessions in our data, we can go back and query our data set to see whether or not each parking meter was occupied at a given time. To show how this works, I’ve prepared some graphs with dummy data below to illustrate the concept on a smaller scale.

Example Scenario

In this example, assume we have two parking meters in our town that send out data in the form of events (in a similar manner to Santa Monica’s meters). Whenever somebody drives over or leaves Meter 1, we receive that information and it is stored in our database. Same goes for Meter 2. Let’s say that we decided to take a look at a 12-hour snapshot of this event data- that data might look something like this (NOTE: the data in this example is simplified for the sake of illustration):

Raw Event Data (click to expand)
[
	{"event_id": 11, "event_time": 1, "event_type": "SS", "meter_id": 2, "session_id": 9},
	{"event_id": 12, "event_time": 2, "event_type": "SE", "meter_id": 1, "session_id": 8},
	{"event_id": 13, "event_time": 4, "event_type": "SS", "meter_id": 1, "session_id": 10},
	{"event_id": 14, "event_time": 5, "event_type": "SE", "meter_id": 1, "session_id": 10},
	{"event_id": 15, "event_time": 7, "event_type": "SE", "meter_id": 2, "session_id": 9},
	{"event_id": 16, "event_time": 8, "event_type": "SS", "meter_id": 1, "session_id": 11},
	{"event_id": 17, "event_time": 9, "event_type": "SS", "meter_id": 2, "session_id": 12},
	{"event_id": 18, "event_time": 10, "event_type": "SE", "meter_id": 1, "session_id": 11},
]

As you can see, we have an array of JavaScript objects, each of which represents an event. Inside we find a unique event_id, which allows us to find this particular event amongst a sea of others; an event_time, which tells us exactly when the event occurred; the event_type, which identifies if the event represents somebody entering the space ("SS") or leaving the space ("SE"); meter_id, which lets us know which of our two meters sent this event; and session_id, which connects two events together.

Even though this sample data set is small, it’s already hard to get a good handle on what exactly is going on. Let’s start doing some simple visualizations and try to get a better picture, step by step.

Step 0: Organize Events By Meter

First graph of data. All events are seen as dots, and the only difference between them we can discern is which meter they came from.

The first thing we do is separate the events by parking meter. Here, I’ve graphed the parking meter ID on the y-axis so that each meter has room to organize its own events along the x-axis (time).

Using this, it’s much easier to see that Meter 1 (in blue) has five events, while Meter 2 (in red) has three events in this 12-hour snapshot. This by itself isn’t particularly useful. At best, we get an idea of how busy the meters are in relation to one another, but we don’t have any idea when either meter is occupied or available. Let’s apply the event_type property to our chart and see what things look like.

Step 1: Identify Event Type (Start or End Event)

First transformation of the data now identifies some events as 'Start Events' and others as 'End Events'

Now that we’ve marked which events are start events ("SS") and which are end events ("SE"), we can say a little more about the data. For example, a car pulled into Meter 2 at 1:00, and left at 7:00. For Meter 2, a car left at 2:00, and another pulled in at 4:00. It’s starting to become clear how these events are connected together, but let’s actually connect them together by their session_id.

Step 2: Connect Events by Session

Each 'start event' pairs up with one 'end event' to create horizontal line segments.

Things are starting to fall into place. We’ve connected start events and end events, creating the sessions shown above. A session, as stated before, is a period of time where a meter is occupied. The one thing we are inferring from our event data are the two sessions that go off the edge of the graph: Meter 1, session 8, and Meter 2, session 12. Because we don’t have matching events for those sessions, we have to make some assumptions:

  • If the first event we see from a parking meter is an end event, then there must have been a start event before our observation period, and thus the session was active from at least the start of our observation period.
  • If the last event we see from a parking meter is a start event, then the session must either still going on, or the end event was after our observation period. In either case, that session must have been active at least until the end of our observation period.

In the full application of the dataset, we’re going to avoid making any such assumptions by truncating our data on either side – when you have many months of data, cutting off a few hours will not make a huge difference, and it’s best to keep the data as unbiased as we are able to.

This graph gives us a nice visual indication of how long a each car stayed at a meter- any time that is on a line is occupied. All we have to do to convert this session data into balanced data is to pick a time increment and sample both meters at the same time with that increment amount. For this example, let’s make the time increment 30 minutes (way to large for the real application, but easier to see here)- below is a graph with vertical grids marking half hour slices:

The same previous chart, but with evenly spaces vertical lines slicing the chart into 30 minute increments.

Great! Now all we have to do is do our sampling.

Step 3: Convert Session Data into Availability Data

There are markers at regular time intervals for both parking meters- the markers are either a red 'x' or a green circle.

Here is what our final data looks like graphed. Green circles represent a meter being available at a given time, and red x’s represent a meter being occupied. These are the data points that we’ll be able to feed into something like a logistic regression or a neural network!

<side note>

You may notice that a meter is being marked “available” at the same time end events occur. That is a particularity that is being used for this example due to the huge time increments we’re using (in the actual data set, each event_time is given to the second). In order to account for this, I reasoned it was fair to say that once an end event occurred, the parking space was immediately available. Although this is less likely to occur frequently in the real data, I will continue to use a non-inclusive right bound on the sessions. Put in other terms, I will say that a parking space is occupied if a session is active at that time, and an active session will be defined by:

the time
a) greater than or equal to a start event (as a UNIX timestamp)
b) strictly less than the end event (as a UNIX timestamp)
where the two events share the same session_id
</side note>

To get a look at how the raw data compares to what we started with, here’s one possible approach to storing the data:

Balanced Availability Data (click to expand)
[
	{"time": 0, "meter_1_available": false, "meter_2_available": true}
	{"time": 0.5, "meter_1_available": false, "meter_2_available": true}
	{"time": 1, "meter_1_available": false, "meter_2_available": false}
	{"time": 1.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 2, "meter_1_available": true, "meter_2_available": false}
	{"time": 2.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 3, "meter_1_available": true, "meter_2_available": false}
	{"time": 3.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 4, "meter_1_available": false, "meter_2_available": false}
	{"time": 4.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 5, "meter_1_available": true, "meter_2_available": false}
	{"time": 5.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 6, "meter_1_available": true, "meter_2_available": false}
	{"time": 6.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 7, "meter_1_available": true, "meter_2_available": true}
	{"time": 7.5, "meter_1_available": true, "meter_2_available": true}
	{"time": 8, "meter_1_available": false, "meter_2_available": true}
	{"time": 8.5, "meter_1_available": false, "meter_2_available": true}
	{"time": 9, "meter_1_available": false, "meter_2_available": false}
	{"time": 9.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 10, "meter_1_available": true, "meter_2_available": false}
	{"time": 10.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 11, "meter_1_available": true, "meter_2_available": false}
	{"time": 11.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 12, "meter_1_available": true, "meter_2_available": false}
]

This is fantastic! Let’s go over how this approach solves the various problems with the data we had at the start:

  • Unbalanced Data: By definition, this data is now balanced. For each time slice (0:00, 0:30, 1:00, etc.), we have an availability value for both meters. Additionally, the sampling rate, or how we sliced up our session data, is constant.
  • Not Best Data: We now have data that instantly says whether or not a meter was available at a particular time. Since we will be training our machine learning algorithms on that metric, it’s vital we nail that piece down.
  • Noisy data: This one was harder to showcase with this dummy data, but this approach alleviates much of the noise. This due to the fact that our final dataset doesn’t care how many events happened in a short period of time. If a session lasts for three seconds, it is unlikely to show up or affect our dataset. To clean it even further, we can try out some measures that take length of session into account and weed out the pesky sessions that last an unusually brief period of time. Luckily, by creating the session data in Step 2, we’ll be able to easily go through and see which sessions are abnormally short (or long, for that matter)

To top it off, we did more than just fix issues with had with the original data, we’ve also made some additional improvements without even intentionally trying to do so:

  • MORE DATA: By slicing up our sessions, we multiplied the amount of data points we had by SIX. And that was slicing with a time interval of half an hour- think of how much data we can extract by reducing it to five minutes, or even one minute! The nice thing is that this isn’t false or fabricated data, we’re just extracting more from the data that was already there!
  • COMPACT DATA: Our data is going to be super compact. The above data can easily be stored as a comma separated values (CSV) file, and even when we have gigabytes of data, it will compress down immensely due to the majority of the file being commas, spaces, and the words “true” and “false”.

I love happy accidents.

Alright, that’s the end of this long post. I think it’ll be a week or so before the next Santa Monica Spaces update (I’ll have some other features to keep the content flowing), but next time I’ll be showing off my code progress and maybe finally get some code out on my GitHub! Speaking of which, you can see the iPython Notebook used to produce the graphs in this post here. I’m still learning the matplotlib library, so if you have any suggestions to improve my visuals, leave me a comment below! Peace out, data nerds.

Next up: Enough Talk- Where is Santa Monica Spaces Now?

Leave a Reply

Your email address will not be published. Required fields are marked *