Aside: Back from Israel and A New Small Program

Happy New Year!

Alright! Back from Taglit/Birthright a little more tan, a little more mensch-y. Here’s what I’ve been up to in the digital (though occasionally also spiritual) world:

  • Helped out with some documentation on the TensorFlow repo. Really enjoying the little bits of collaboration with the Google folks I’ve been able to have with that. I’d say that one thing that could be improved is a little bit more information about which components Google is actively working toward releasing. After taking notes on their white paper, there are several important features that exist in Google’s internal implementation that have yet to be open-sourced. I think making an explicit list of these features will help guide the open source community to work on new problems (and maybe just maybe stop asking about the same things over and over again. It’s coming guys! Google wants TensorFlow to succeed)
  • A pull request I made in October for the Node.js linear algebra package finally got merged! The maintainer hadn’t touched the project for over a year, but now it seems like there’s been more life. Sadly, my ambitions of creating a simple JavaScript feed-forward neural network executor have been set off course, as learning and using TensorFlow seems to be a better use of time (along with several other projects). Hopefully the new feature is useful to someone!
  • Continued to make some updates on the previously mentioned TensorFlow white paper notes

Additionally, I’ve got a little Python project that I hope will improve people’s lives! For the moment, it’s called:

AnchorHub

Check out the repo here. Basically, it should make intra- and inter-Markdown document anchor links more sensible to use in GitHub. Headers in Markdown files are automatically given anchor tags on GitHub, but they can get a little verbose and/or difficult to utilize. One might end up uploading a file, copying the anchor URL (using the chain icon next to the header), pasting that URL back into the document, and uploading a second time. A waste of time AND it clutters up the commit log. Here’s the basics of what AnchorHub does:

What you write:
# This is a header that I would like to make an id for {#head}
[This link points to that header!](#head)

What AnchorHub outputs:
# This is a header that I would like to make an id for
[This link points to that header!](#this-is-a-header-that-i-would-like-to-make-an-id-for)

It works automatically within an entire directory tree, so you can link across documents the same way you might use id tags in HTML. There’s still plenty of work to be done, but it is actually usable right now! Command-line arguments coming up next after fixing a last second bug. Then onto pip-packaging!

Read More

TensorFlow White Pages Notes – On GitHub

Hey all!

Huge content in a short post. I just released my TensorFlow white paper notes on GitHub. It’s been a much larger project than I originally thought, but I hope that it will be useful to people. The notes go from the very beginning of the paper all the way through conclusions, highlighting the important information from each section.

Here are the top features of the notes:

  • Notes broken down and organized section by section, as well as subsection by subsection
  • Relevant links to documentation, resources, and references throughout
  • SVG versions of figures/graphs from the paper

If you haven’t checked out the TensorFlow white paper yet, I highly recommend it. There’s no replacement for reading the original, but I hope these notes are a worthy supplement alongside the actual paper.

Next up: Using TensorFlow on transformed Santa Monica Parking data

Read More

Santa Monica Parking Meter API: “street_address” Field Is More Useful Than It Appears at First Glance

“Approximate Street Address” Doesn’t Do It Justice

Whew! It’s been a while since the last update- don’t worry, I’ve been hard at work learning TensorFlow (and I’ve even contributed to its documentation a touch), and I’ll have a fairly large post later this week. In the meantime, I thought I’d share something I’ve discovered about one of the Santa Monica Parking API‘s fields that I had previously shrugged off as unhelpful.

I was looking through some of my parking data and decided to print some information for all parking meters, ordered by meter_id, when I noticed something interesting:

list

Multiple meters were given the same street_address field. On further inspection, I also noticed that address numbers in the list either ended in 0 or 1. I couldn’t think of a better thing to do than plot some of them on a map and see what I came up with.

First, I picked two group of meters that had similar addresses. In this example, “00 Pico Blvd” and “01 Pico Blvd”.

pico00

pico01

Here’s the “01 Pico Blvd” coordinates mapped:
pins_00

And then with “00 Pico Blvd” added in:
pins

The street_address is a label for their block! I tested this out with several groups of meters, and found the block-by-block grouping consistent. That makes me comfortable to say this:

street_address Groups Parking Meters Together by Block

There’s your TLDR. Two reasons I’m sharing this today:

  1. As of writing, that information is not conveyed in the API
  2. It’s going to save a huge amount of effort when people inevitably want to group these meters together block-by-block

Before realizing this, I was thinking of various way of trying to use a combination of meter_id and GPS coordinates to try to group these together without doing it manually, but this provides a very natural way to group them! Hooray for the data being even better than first thought!

Read More

TensorFlow: Google’s Latest Machine Learning Software is Open Sourced!

Yes. Another ML Library.

But this one is different! The hacker community is wickedly excited about it, and so should you! TensorFlow was released by Google today, and it looks to be a really exciting step forward for open source machine learning, or even the entire computational mathematics community.

What is it? From TensorFlow’s introduction, it is “an open source library for numerical computation using data flow graphs”!

What is “an open source library for numerical computation using data flow graphs”?

Sounds like a mouthful, but “data flow graphs” are just a more-encompassing term for the kind of modeling neural networks use. And the library is described that way for a reason- TensorFlow is designed to not only provide flexible, highly optimized neural networks, but to be able to perform any sort of computation that is organized with a similar graph-like structure.

<aside>

More on data flow graphs

These graphs are composed of two primary components, nodes and edges.

Graph showing difference between edge and node
Original chart property of Google.

Nodes are the squares, circles, or ellipses on charts such as the one to the right here. They represent any sort of mathematical operation or function. In a neural network, these are your activation functions (like a sigmoid function).

Edges are the connections between the nodes. As you can see, they are directional, in that data flows from the output of one node and into the input of the next node (or several nodes) through these edges. Edges represent the “tensors”, or multi-dimensional arrays, which contain the weights for each of the outputs from the previous node to the next.

Compare that with a typical neural network model, and you can see how a neural network is just a specialized version of a data flow graph. Back to TensorFlow!

</aside>

So what exactly is there to get excited about with TensorFlow? There are a jillion machine learning libraries out there, so how does this stick out amongst the crowd (other than it’s created by Google)? Well, a fair amount, actually. Here are some of the things I’m most excited about:

  • Easier Transition from Research to Production: Something that was always troublesome in machine learning, especially Neural Networks, was trying to take the model crafted in research and then applying it to a real production setting. Much research is done using Python, R, or MatLab (with accompanying libraries), which allows for faster iterations through the design and testing phase. Before, that code would hardly be touched once the model moved to production, as it needed to be reimplemented with a faster language, such as C++ or Java. Because of the way TensorFlow is designed, we should be able to take what we have and bring it directly to production with minimal, if any, code changes.

  • Flexibility: This is both a great thing and something to keep in mind. TensorFlow is not a neural network library- it is a data flow graph library. This makes it capable of handling much more nuanced and hand-modeled graphs, but it will require more finagling. While it doesn’t appear too difficult to create a simple neural network now, I expect that there will be some higher-level libraries built on top of TensorFlow to make it extremely easy.

  • Automatic CPU/GPU Integration: This might be the most exciting one for me. GPUs, or graphics processing units, have enabled much faster learning (especially neural networks), and taking advantage of them is crucial in having the power to create robust models. The problem, however, is that most machine learning libraries out there don’t have GPU support, and those that do are either hard to use or are much less flexible. For example, scikit-learn, one of the most popular libraries for machine learning, while extremely useful for testing out ideas, has no plans for GPU support in the near future. TensorFlow promises to bring both flexibility and power by taking advantage of all of your computing resources.

I’ll be digging into this more over the next few weeks! Check out Google Research’s blog post if you’re interested in reading more about it.

Read More

Santa Monica Spaces: Approach to Transforming the Data

If you read my last post, you may remember that there were a couple of issues that need to be overcome before putting the data into any sort of machine learning algorithm. Namely, the data could potentially be noisy (i.e. lots of events from one meter within a few seconds); the data is unbalanced; and the raw data, which is in the form of events, is not the ideal format- what we’d prefer to know is which meters were occupied at any given time. So, how are we going to do this with what we have?

Enter Meter Sessions

The key, as I mentioned in the previous post, is by taking advantage of sessions. Each sensor_event from the Santa Monica API contains both an event_type and session_id field, which can be used to construct sessions: a period of time that a given parking meter was occupied. By constructing all of the sessions in our data, we can go back and query our data set to see whether or not each parking meter was occupied at a given time. To show how this works, I’ve prepared some graphs with dummy data below to illustrate the concept on a smaller scale.

Example Scenario

In this example, assume we have two parking meters in our town that send out data in the form of events (in a similar manner to Santa Monica’s meters). Whenever somebody drives over or leaves Meter 1, we receive that information and it is stored in our database. Same goes for Meter 2. Let’s say that we decided to take a look at a 12-hour snapshot of this event data- that data might look something like this (NOTE: the data in this example is simplified for the sake of illustration):

Raw Event Data (click to expand)
[
	{"event_id": 11, "event_time": 1, "event_type": "SS", "meter_id": 2, "session_id": 9},
	{"event_id": 12, "event_time": 2, "event_type": "SE", "meter_id": 1, "session_id": 8},
	{"event_id": 13, "event_time": 4, "event_type": "SS", "meter_id": 1, "session_id": 10},
	{"event_id": 14, "event_time": 5, "event_type": "SE", "meter_id": 1, "session_id": 10},
	{"event_id": 15, "event_time": 7, "event_type": "SE", "meter_id": 2, "session_id": 9},
	{"event_id": 16, "event_time": 8, "event_type": "SS", "meter_id": 1, "session_id": 11},
	{"event_id": 17, "event_time": 9, "event_type": "SS", "meter_id": 2, "session_id": 12},
	{"event_id": 18, "event_time": 10, "event_type": "SE", "meter_id": 1, "session_id": 11},
]

As you can see, we have an array of JavaScript objects, each of which represents an event. Inside we find a unique event_id, which allows us to find this particular event amongst a sea of others; an event_time, which tells us exactly when the event occurred; the event_type, which identifies if the event represents somebody entering the space ("SS") or leaving the space ("SE"); meter_id, which lets us know which of our two meters sent this event; and session_id, which connects two events together.

Even though this sample data set is small, it’s already hard to get a good handle on what exactly is going on. Let’s start doing some simple visualizations and try to get a better picture, step by step.

Step 0: Organize Events By Meter

First graph of data. All events are seen as dots, and the only difference between them we can discern is which meter they came from.

The first thing we do is separate the events by parking meter. Here, I’ve graphed the parking meter ID on the y-axis so that each meter has room to organize its own events along the x-axis (time).

Using this, it’s much easier to see that Meter 1 (in blue) has five events, while Meter 2 (in red) has three events in this 12-hour snapshot. This by itself isn’t particularly useful. At best, we get an idea of how busy the meters are in relation to one another, but we don’t have any idea when either meter is occupied or available. Let’s apply the event_type property to our chart and see what things look like.

Step 1: Identify Event Type (Start or End Event)

First transformation of the data now identifies some events as 'Start Events' and others as 'End Events'

Now that we’ve marked which events are start events ("SS") and which are end events ("SE"), we can say a little more about the data. For example, a car pulled into Meter 2 at 1:00, and left at 7:00. For Meter 2, a car left at 2:00, and another pulled in at 4:00. It’s starting to become clear how these events are connected together, but let’s actually connect them together by their session_id.

Step 2: Connect Events by Session

Each 'start event' pairs up with one 'end event' to create horizontal line segments.

Things are starting to fall into place. We’ve connected start events and end events, creating the sessions shown above. A session, as stated before, is a period of time where a meter is occupied. The one thing we are inferring from our event data are the two sessions that go off the edge of the graph: Meter 1, session 8, and Meter 2, session 12. Because we don’t have matching events for those sessions, we have to make some assumptions:

  • If the first event we see from a parking meter is an end event, then there must have been a start event before our observation period, and thus the session was active from at least the start of our observation period.
  • If the last event we see from a parking meter is a start event, then the session must either still going on, or the end event was after our observation period. In either case, that session must have been active at least until the end of our observation period.

In the full application of the dataset, we’re going to avoid making any such assumptions by truncating our data on either side – when you have many months of data, cutting off a few hours will not make a huge difference, and it’s best to keep the data as unbiased as we are able to.

This graph gives us a nice visual indication of how long a each car stayed at a meter- any time that is on a line is occupied. All we have to do to convert this session data into balanced data is to pick a time increment and sample both meters at the same time with that increment amount. For this example, let’s make the time increment 30 minutes (way to large for the real application, but easier to see here)- below is a graph with vertical grids marking half hour slices:

The same previous chart, but with evenly spaces vertical lines slicing the chart into 30 minute increments.

Great! Now all we have to do is do our sampling.

Step 3: Convert Session Data into Availability Data

There are markers at regular time intervals for both parking meters- the markers are either a red 'x' or a green circle.

Here is what our final data looks like graphed. Green circles represent a meter being available at a given time, and red x’s represent a meter being occupied. These are the data points that we’ll be able to feed into something like a logistic regression or a neural network!

<side note>

You may notice that a meter is being marked “available” at the same time end events occur. That is a particularity that is being used for this example due to the huge time increments we’re using (in the actual data set, each event_time is given to the second). In order to account for this, I reasoned it was fair to say that once an end event occurred, the parking space was immediately available. Although this is less likely to occur frequently in the real data, I will continue to use a non-inclusive right bound on the sessions. Put in other terms, I will say that a parking space is occupied if a session is active at that time, and an active session will be defined by:

the time
a) greater than or equal to a start event (as a UNIX timestamp)
b) strictly less than the end event (as a UNIX timestamp)
where the two events share the same session_id
</side note>

To get a look at how the raw data compares to what we started with, here’s one possible approach to storing the data:

Balanced Availability Data (click to expand)
[
	{"time": 0, "meter_1_available": false, "meter_2_available": true}
	{"time": 0.5, "meter_1_available": false, "meter_2_available": true}
	{"time": 1, "meter_1_available": false, "meter_2_available": false}
	{"time": 1.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 2, "meter_1_available": true, "meter_2_available": false}
	{"time": 2.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 3, "meter_1_available": true, "meter_2_available": false}
	{"time": 3.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 4, "meter_1_available": false, "meter_2_available": false}
	{"time": 4.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 5, "meter_1_available": true, "meter_2_available": false}
	{"time": 5.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 6, "meter_1_available": true, "meter_2_available": false}
	{"time": 6.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 7, "meter_1_available": true, "meter_2_available": true}
	{"time": 7.5, "meter_1_available": true, "meter_2_available": true}
	{"time": 8, "meter_1_available": false, "meter_2_available": true}
	{"time": 8.5, "meter_1_available": false, "meter_2_available": true}
	{"time": 9, "meter_1_available": false, "meter_2_available": false}
	{"time": 9.5, "meter_1_available": false, "meter_2_available": false}
	{"time": 10, "meter_1_available": true, "meter_2_available": false}
	{"time": 10.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 11, "meter_1_available": true, "meter_2_available": false}
	{"time": 11.5, "meter_1_available": true, "meter_2_available": false}
	{"time": 12, "meter_1_available": true, "meter_2_available": false}
]

This is fantastic! Let’s go over how this approach solves the various problems with the data we had at the start:

  • Unbalanced Data: By definition, this data is now balanced. For each time slice (0:00, 0:30, 1:00, etc.), we have an availability value for both meters. Additionally, the sampling rate, or how we sliced up our session data, is constant.
  • Not Best Data: We now have data that instantly says whether or not a meter was available at a particular time. Since we will be training our machine learning algorithms on that metric, it’s vital we nail that piece down.
  • Noisy data: This one was harder to showcase with this dummy data, but this approach alleviates much of the noise. This due to the fact that our final dataset doesn’t care how many events happened in a short period of time. If a session lasts for three seconds, it is unlikely to show up or affect our dataset. To clean it even further, we can try out some measures that take length of session into account and weed out the pesky sessions that last an unusually brief period of time. Luckily, by creating the session data in Step 2, we’ll be able to easily go through and see which sessions are abnormally short (or long, for that matter)

To top it off, we did more than just fix issues with had with the original data, we’ve also made some additional improvements without even intentionally trying to do so:

  • MORE DATA: By slicing up our sessions, we multiplied the amount of data points we had by SIX. And that was slicing with a time interval of half an hour- think of how much data we can extract by reducing it to five minutes, or even one minute! The nice thing is that this isn’t false or fabricated data, we’re just extracting more from the data that was already there!
  • COMPACT DATA: Our data is going to be super compact. The above data can easily be stored as a comma separated values (CSV) file, and even when we have gigabytes of data, it will compress down immensely due to the majority of the file being commas, spaces, and the words “true” and “false”.

I love happy accidents.

Alright, that’s the end of this long post. I think it’ll be a week or so before the next Santa Monica Spaces update (I’ll have some other features to keep the content flowing), but next time I’ll be showing off my code progress and maybe finally get some code out on my GitHub! Speaking of which, you can see the iPython Notebook used to produce the graphs in this post here. I’m still learning the matplotlib library, so if you have any suggestions to improve my visuals, leave me a comment below! Peace out, data nerds.

Next up: Enough Talk- Where is Santa Monica Spaces Now?

Read More