:: :: :: :: :: ::
In which I obtain data from the NCAA Division 1 Mens Basketball Tournament and try to use machine learning techniques to excel at March Madness.
Introduction: There is an excellent series of websites with tons of statistical data from many sporting events, including the NCAA Div 1 Mens Basketball Tournament. I captured the data from past tournaments and have saved it in an easy-to-parse plaintext file format.
Data Capture: The data is not provided in plaintext form from the original website, so I had to write a simple page scraper in Python using the LXML library:
#!/usr/bin/env python # # Script to fetch and parse NCAA tournament results from # http://www.databasesports.com/ncaab/tourney.htm?yr=XXXX # # Matthew Beckler - matthew at mbeckler dot org # For more details, visit http://www.mbeckler.org/ncaa_tournament/ from lxml import etree import time def parse_ncaa(url): parser = etree.HTMLParser() tree = etree.parse(url, parser) data = {} data["year"] = int(tree.xpath("//h1/text()")[0].split()[0]) data["data"] = tree.xpath("//td[@valign=\"middle\"]/text()") data["data"] = [ int(text.strip(" ()")) for text in data["data"] ] return data def main(): start = 1985 end = 2009 header = """# NCAA Division 1 Mens Basketball Championship Tournament Data %d - %d # http://www.databasesports.com/ncaab/ # # For format information, please see website below. # Screen scraped and compiled by: # Matthew Beckler <matthew at mbeckler dot org> # http://www.mbeckler.org/ncaa_tournament/ """ % (start, end) filename = "ncaa_data_%d_%d.txt" % (start, end) print "Writing data to \"%s\"..." % filename fid = open(filename, "w") fid.write(header) for year in range(start, end + 1): url = "http://www.databasesports.com/ncaab/tourney.htm?yr=%d" % year data = parse_ncaa(url) fid.write("%d" % data["year"]) for i in (1, 3, 4, 5, 6, 8, 10, 11, 13, 15, 16, 17, 19, 21, 22, 24, 26, 27, 28, 29, 31, \ 33, 34, 36, 38, 39, 40, 42, 44, 45, \ 47, 49, 50, 51, 52, 54, 56, 57, 59, 61, 62, 63, 65, 67, 68, 70, 72, 73, 74, \ 75, 77, 79, 80, 82, 84, 85, 86, 88, 90, 91, \ 93, 95, 96, 97, 98, 100, 102, 103, 105, 107, 108, 109, 111, 113, 114, 116, \ 118, 119, 120, 121, 123, 125, 126, 128, 130, 131, 132, 134, 136, 137, \ 139, 141, 142, 143, 144, 146, 148, 149, 151, 153, 154, 155, 157, 159, 160, \ 162, 164, 165, 166, 167, 169, 171, 172, 174, 176, 177, 178, 180, 182, 183, \ 185, 187, 188, 190, 192, 193): fid.write(" %d" % data["data"][i]) fid.write("\n") #print data["year"], data["data"] print year time.sleep(1) fid.close() if __name__ == "__main__": main()
I was able to get data for tournament years 1985 through 2009. Before 1985, the tournament had a different structure.
Data Format: The data is provided as a series of integer scores. The data is split into the four regional brackets, with the same structure for each. The Final Four and championship games are handled at the end. In the image below, the score for each team in each regional game is identified by an index number in blue. Using these numbers we can determine which team won each game. Since the four regional brackets are the same, they have the same offsets in each bracket, just offset by 30. The order of brackets is Midwest, West, East, South. Click on the image below for a larger view of the bracket.
Download: Here is the data in the following plaintext format, with one line per year:
year 000 001 002 ... 120 121 122 123 124 125
Results: Nothing yet, but hey, I'm a busy guy... Also, it's too late to generate an awesome bracket for this year's tournament, anyway.
Copyright © 2004 - 2025, Matthew L. Beckler, CC BY-SA 3.0
Last modified: 2010-03-21 03:00:53 PM (EDT)