Matthew Beckler's Home Page

Home :: Blog :: General :: Software :: Hardware :: Microcontrollers :: Artsy


NCAA Basketball Tournament Data

In which I obtain data from the NCAA Division 1 Mens Basketball Tournament and try to use machine learning techniques to excel at March Madness.


Introduction: There is an excellent series of websites with tons of statistical data from many sporting events, including the NCAA Div 1 Mens Basketball Tournament. I captured the data from past tournaments and have saved it in an easy-to-parse plaintext file format.

Data Capture: The data is not provided in plaintext form from the original website, so I had to write a simple page scraper in Python using the LXML library:

#!/usr/bin/env python
#
# Script to fetch and parse NCAA tournament results from
# http://www.databasesports.com/ncaab/tourney.htm?yr=XXXX
#
# Matthew Beckler - matthew at mbeckler dot org
# For more details, visit http://www.mbeckler.org/ncaa_tournament/

from lxml import etree
import time

def parse_ncaa(url):
    parser = etree.HTMLParser()
    tree = etree.parse(url, parser)

    data = {}
    data["year"] = int(tree.xpath("//h1/text()")[0].split()[0])
    data["data"] = tree.xpath("//td[@valign=\"middle\"]/text()")
    data["data"] = [ int(text.strip(" ()")) for text in data["data"] ]
    return data

def main():
    start = 1985
    end = 2009
    header = """# NCAA Division 1 Mens Basketball Championship Tournament Data %d - %d
# http://www.databasesports.com/ncaab/
#
# For format information, please see website below.
# Screen scraped and compiled by:
# Matthew Beckler <matthew at mbeckler dot org>
# http://www.mbeckler.org/ncaa_tournament/
""" % (start, end)
    filename = "ncaa_data_%d_%d.txt" % (start, end)
    print "Writing data to \"%s\"..." % filename
    fid = open(filename, "w")
    fid.write(header)
    for year in range(start, end + 1):
        url = "http://www.databasesports.com/ncaab/tourney.htm?yr=%d" % year
        data = parse_ncaa(url)
        fid.write("%d" % data["year"])
        for i in (1, 3, 4, 5, 6, 8, 10, 11, 13, 15, 16, 17, 19, 21, 22, 24, 26, 27, 28, 29, 31, \
                  33, 34, 36, 38, 39, 40, 42, 44, 45, \
                  47, 49, 50, 51, 52, 54, 56, 57, 59, 61, 62, 63, 65, 67, 68, 70, 72, 73, 74, \
                  75, 77, 79, 80, 82, 84, 85, 86, 88, 90, 91, \
                  93, 95, 96, 97, 98, 100, 102, 103, 105, 107, 108, 109, 111, 113, 114, 116, \
                  118, 119, 120, 121, 123, 125, 126, 128, 130, 131, 132, 134, 136, 137, \
                  139, 141, 142, 143, 144, 146, 148, 149, 151, 153, 154, 155, 157, 159, 160, \
                  162, 164, 165, 166, 167, 169, 171, 172, 174, 176, 177, 178, 180, 182, 183, \
                  185, 187, 188, 190, 192, 193):
            fid.write(" %d" % data["data"][i])
        fid.write("\n")
        #print data["year"], data["data"]
        print year
        time.sleep(1)
    fid.close()


if __name__ == "__main__":
    main()

I was able to get data for tournament years 1985 through 2009. Before 1985, the tournament had a different structure.

Data Format: The data is provided as a series of integer scores. The data is split into the four regional brackets, with the same structure for each. The Final Four and championship games are handled at the end. In the image below, the score for each team in each regional game is identified by an index number in blue. Using these numbers we can determine which team won each game. Since the four regional brackets are the same, they have the same offsets in each bracket, just offset by 30. The order of brackets is Midwest, West, East, South. Click on the image below for a larger view of the bracket.
NCAA Bracket

Download: Here is the data in the following plaintext format, with one line per year:

year 000 001 002 ... 120 121 122 123 124 125

ncaa_data_1985_2009.txt

Results: Nothing yet, but hey, I'm a busy guy... Also, it's too late to generate an awesome bracket for this year's tournament, anyway.


Homepage Made with Vim! Validate HTML Email Me! Made with Inkscape! Validate CSS

Copyright © 2004 - 2024, Matthew L. Beckler, CC BY-SA 3.0
Last modified: 2010-03-21 03:00:53 PM (EDT)