Matthew Beckler's Home Page

Home :: Blog :: General :: Software :: Hardware :: Microcontrollers :: Artsy

About Me
Past Coursework
Stadium Pictures

XKCD Velociraptor Problem
Java Heat Map
Convert Cadence Layout to SVG / PDF / PNG
Makefile -jX setting: How many concurent jobs is optimal?
Infinite Grid of Resistors
How to Start and Stop Folding@Home on OSX
Lego Rendering with POV-Ray
Class Scheduler
Lightsaber Rotoscoping Scripts Updated!

Handy Scripts Updated!

GPG Tips and Tricks
NCAA Basketball Tournament Data New!

ATX to Lab Bench Power Supply Conversion
Pneumatic Cannon
Ye Olde Chain Maille Rings
Extension Cord Outlet Box
Homebuilt MP3 Player

Fun with an RGB LED
Build an AVR-GCC Toolchain
LED Bargraph
Using a Voltage Divider

Animation with Inkscape: SVGAni
Isometric Projection in Inkscape
Perspective Motivational Poster
SVG Drawings
SVG Circuit Symbols
Mysterious Textures New!

NCAA Basketball Tournament Data

In which I obtain data from the NCAA Division 1 Mens Basketball Tournament and try to use machine learning techniques to excel at March Madness.

Introduction: There is an excellent series of websites with tons of statistical data from many sporting events, including the NCAA Div 1 Mens Basketball Tournament. I captured the data from past tournaments and have saved it in an easy-to-parse plaintext file format.

Data Capture: The data is not provided in plaintext form from the original website, so I had to write a simple page scraper in Python using the LXML library:

#!/usr/bin/env python
#
# Script to fetch and parse NCAA tournament results from
# http://www.databasesports.com/ncaab/tourney.htm?yr=XXXX
#
# Matthew Beckler - matthew at mbeckler dot org
# For more details, visit http://www.mbeckler.org/ncaa_tournament/

from lxml import etree
import time

def parse_ncaa(url):
    parser = etree.HTMLParser()
    tree = etree.parse(url, parser)

    data = {}
    data["year"] = int(tree.xpath("//h1/text()")[0].split()[0])
    data["data"] = tree.xpath("//td[@valign=\"middle\"]/text()")
    data["data"] = [ int(text.strip(" ()")) for text in data["data"] ]
    return data

def main():
    start = 1985
    end = 2009
    header = """# NCAA Division 1 Mens Basketball Championship Tournament Data %d - %d
# http://www.databasesports.com/ncaab/
#
# For format information, please see website below.
# Screen scraped and compiled by:
# Matthew Beckler <matthew at mbeckler dot org>
# http://www.mbeckler.org/ncaa_tournament/
""" % (start, end)
    filename = "ncaa_data_%d_%d.txt" % (start, end)
    print "Writing data to \"%s\"..." % filename
    fid = open(filename, "w")
    fid.write(header)
    for year in range(start, end + 1):
        url = "http://www.databasesports.com/ncaab/tourney.htm?yr=%d" % year
        data = parse_ncaa(url)
        fid.write("%d" % data["year"])
        for i in (1, 3, 4, 5, 6, 8, 10, 11, 13, 15, 16, 17, 19, 21, 22, 24, 26, 27, 28, 29, 31, \
                  33, 34, 36, 38, 39, 40, 42, 44, 45, \
                  47, 49, 50, 51, 52, 54, 56, 57, 59, 61, 62, 63, 65, 67, 68, 70, 72, 73, 74, \
                  75, 77, 79, 80, 82, 84, 85, 86, 88, 90, 91, \
                  93, 95, 96, 97, 98, 100, 102, 103, 105, 107, 108, 109, 111, 113, 114, 116, \
                  118, 119, 120, 121, 123, 125, 126, 128, 130, 131, 132, 134, 136, 137, \
                  139, 141, 142, 143, 144, 146, 148, 149, 151, 153, 154, 155, 157, 159, 160, \
                  162, 164, 165, 166, 167, 169, 171, 172, 174, 176, 177, 178, 180, 182, 183, \
                  185, 187, 188, 190, 192, 193):
            fid.write(" %d" % data["data"][i])
        fid.write("\n")
        #print data["year"], data["data"]
        print year
        time.sleep(1)
    fid.close()


if __name__ == "__main__":
    main()

I was able to get data for tournament years 1985 through 2009. Before 1985, the tournament had a different structure.

Data Format: The data is provided as a series of integer scores. The data is split into the four regional brackets, with the same structure for each. The Final Four and championship games are handled at the end. In the image below, the score for each team in each regional game is identified by an index number in blue. Using these numbers we can determine which team won each game. Since the four regional brackets are the same, they have the same offsets in each bracket, just offset by 30. The order of brackets is Midwest, West, East, South. Click on the image below for a larger view of the bracket.

Download: Here is the data in the following plaintext format, with one line per year:

year 000 001 002 ... 120 121 122 123 124 125

ncaa_data_1985_2009.txt

Results: Nothing yet, but hey, I'm a busy guy... Also, it's too late to generate an awesome bracket for this year's tournament, anyway.