=============================================================================== This is the UCI Repository Of Machine Learning Databases 3 March 1990 ics.uci.edu: /usr2/spool/ftp/pub/machine-learning-databases Site Librarian: David W. Aha (aha@ics.uci.edu) 48 databases (5884K plus 1 offline database of unknown size) =============================================================================== Included in this directory are data sets that have been or can be used to evaluate learning algorithms. Each data file (*.data) consists of individual records described in terms of attribute-value pairs. See the corresponding *.names file for voluminous documentation. (Some files _generate_ databases; they do not have *.data files.) The contents of this repository can be remotely copied to other network sites via ftp to ics.uci.edu. Both the userid and password are "anonymous". Notes: 1. We're always looking for additional databases. Please send yours, with documentation. Thanks. Current documentation requirements are located in file DOC-REQUIREMENTS. Complaints and suggestions for improvements are welcome anytime. Presently, all databases except 4 with unusual formats have the following format: 1 instance per line, no spaces, commas separate attribute values, and missing values are denoted by "?". Exceptions: audioogy, labor-negotiations, spectrometer, university, and the undocumented databases. 2. There is also the "undocumented" sub-directory which contains six databases that require attention before being incorporated into the repository. You are welcome to access them. 3. Ivan Bratko has asked me to restrict the access on the databases he donated from the Ljubljana Oncology Institute. These databases, under the breast-cancer, lymphography, and primary-tumor directories, are unreadable to you. However, we are allowed to share them with academic institutions upon request. If used, these databases (like several others) require providing proper citations be made in published articles that use them. The citation requirements can be found in each database's corresponding documentation file. 4. Finally, I'm maintaining a list of CORRESPONDENTS and TRANSACTIONS. Perhaps someone on your site is listed among the CORRESPONDENTS and can provide you with some of these databases and related information. TRANSACTIONS is a log of my correspondence with others. David W. Aha Repository Librarian ---------------------------------------------------------------------- Brief Overview of Databases: Quick Listing: 1. annealing 2. audiology 3. autos 4. breast-cancer (restricted access) 5-6. chess-end-games 7. cpu-performance 8. echocardiogram 9. glass 10. hayes-roth 11-14. heart-disease 15. hepatitis 16. iris 17. labor-negotiations 18-19. led-display-creator 20. lymphography (restricted access) 21. mushroom 22. primary-tumor (restricted access) 23. shuttle-landing-control 24-25. soybean 26. spectrometer 27-34. thyroid-disease 35. university 36. voting-records 37-38. waveform domain 39-47. Undocumented databases: sub-directory undocumented 1. Bradshaw's flare data 2. Pat Langley's data generator 3. David Lewis's information retrieval (IR) data collection (offline) 4. Mike Pazzani's economic sanctions database 5. Ross Quinlan's latest version of the thyroid database 6. Philippe Collard's database on cloud cover images 7. Mary McLeish & Matt Cecile's database on horse colic 8. Paul O'Rorke's database containing theorems from Principia Mathematica 9. John Gennari's program for creating structured objects ("animals") 48. Nine small EBL domain theories and examples in sub-directory ebl Quick Summaries of Each Database: 1. Annealing data (unknown source) -- Documentation: On everything except database statistics -- Background information on this database: unknown -- Many missing attribute values 2. Audiology data (Baylor College) -- Documentation: On everything except database statistics -- Non-standardized attributes (differs between instances) -- All attributes are nominally-valued 3. Automobile data (1985 Ward's Automotive Yearbook) -- Documentation: On everything except statistics and class distribution -- Good mix of numeric and nominal-valued attributes -- More than 1 attribute can be used as a class attribute in this database 4. Breast cancer database (Ljubljana Oncology Institute) -- Documentation: On everything except database statistics -- Well-used database -- 286 instances, 2 classes, 9 attributes + the class attribute 5-6. Chess endgames data creator 1. king-rook-vs-king-knight -- Documentation: limited (nothing on class distribution, statistics) -- This concerns king-knight versus king-rook end games -- The database creator is coded in Common Lisp 2. king-rook-vs-king-pawn -- Documentation: sufficient -- This concerns king-rook versus king-pawn end games -- Originally described by Alen Shapiro 7. Computer hardware described in terms of its cycle time, memory size, etc. and classified in terms of their relative performance capabilities (CACM 4/87) -- Documentation: complete -- Contains integer-valued concept labels -- All attributes are integer-valued 8. Echocardiogram database (Reed Institute, Miami) -- Documentation: sufficient -- 13 numeric-valued attributes -- Binary classification: patient either alive or dead after survival period 9. Glass Identification database (USA Forensic Science Service) -- Documentation: completed -- 6 types of glass -- Defined in terms of their oxide content (i.e. Na, Fe, K, etc) -- All attributes are numeric-valued 10. Hayes-Roth and Hayes-Roth's database -- Described in their 1977 paper -- Topic: human subjects study 11-14. Heart Disease databases (Sources listed below) -- Documentation: extensive, but statistics and missing attribute information not yet furnished (perhaps later) -- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach -- 13 of the 75 attributes were used for prediction in 2 separate tests, each of which achieved approximately 75%-80% classification accuracy -- The chosen 13 attributes are all continuously valued 15. Hepatitis database (G.Gong: CMU) -- Documentation: incomplete -- 155 instances with 20 attributes each; 2 classes -- Mostly Boolean or numeric-valued attribute types 16. Iris Plant database (Fisher, 1936) -- Documentation: complete -- 3 classes, 4 numeric attributes, 150 instances -- 1 class is linearly separable from the other 2, but the other 2 are not linearly separable from each other (simple database) 17. Labor relations database (Collective Bargaining Review) -- Documentation: no statistics -- Please see the labor directory for more information 18-19. LED display domains (Classification and Regression Trees book) -- Documentation: sufficient, but missing statistical information -- All attributes are Boolean-valued -- Two versions: 7 and 24 attributes -- Optimal Baye's rate known for the 10% probability of noise problem -- Several ML researchers have used this domain for testing noise tolerancy -- We provide here 2 C programs for generating sample databases 20. Lymphography database (Ljubljana Oncology Institute) -- Documentation: incomplete -- CITATION REQUIREMENT: Please use (see the documentation file) -- 148 instances; 19 attributes; 4 classes; no missing data values 21. Mushrooms in terms of their physical characteristics and classified as poisonous or edible (Audobon Society Field Guide) -- Documentation: complete, but missing statistical information -- All attributes are nominal-valued -- Large database: 8124 instances (2480 missing values for attribute #12) 22. Primary Tumor database (Ljubljana Oncology Institute) -- Documentation: incomplete -- CITATION REQUIREMENT: Please use (see the documentation file) -- 339 instances; 18 attributes; 22 classes; lots of missing data values 23. Shuttle Landing Control database -- tiny, 15-instance database with 7 attributes per instance; 2 classes -- appears to be well-known in the decision-tree community 24-25. Soybean data (Michalski) -- Documentation: Only the statistics is missing -- (2 sizes) -- Michalski's famous soybean disease databases 26. Low resolution spectrometer data (IRAS data -- NASA Ames Research Center) -- Documentation: no statistics nor class distribution given -- LARGE database...and this is only 531 of the instances -- 98 attributes per instance (all numeric) -- Contact NASA-Ames Research Center for more information 27-34. Thyroid patient records classified into disjoint disease classes (Garavan Institute) -- Documentation: as given by Ross Quinlan -- 6 databases from the Garavan Institute in Sydney, Australia -- Approximately the following for each database: -- 2800 training (data) instances and 972 test instances -- plenty of missing data -- 29 or so attributes, either Boolean or continuously-valued -- 2 additional databases, also from Ross Quinlan, are also here -- hypothyroid.data and sick-euthyroid.data -- Quinlan believes that these databases have been corrupted -- Their format is highly similar to the other databases 35. University data (Lebowitz) -- Documentation: scant; we've left it in its original (LISP-readable) form -- 285 instances, including some duplicates -- At least one attribute, academic-emphasis, can have multiple values per instance -- The user is encouraged to pursue the Lebowitz reference for more information on the database 36. Congressional voting records classified into Republican or Democrat (1984 United Stated Congressional Voting Records) -- Documentation: completed -- All attributes are Boolean valued; plenty of missing values; 2 classes -- Also, their is a 2nd, undocumented database containing 1986 voting records here. (will be) 37-38. Waveform data generator (Classification and Regression Trees book) -- Documentation: no statistics -- CART book's waveform domains -- 21 and 40 continuous attributes respectively -- difficult concepts to learn, but known Bayes optimal classification rate of 86% accuracy 39-47. Undocumented databases: see the sub-directory named undocumented 1. Bradshaw's flare data 2. Pat Langley's data generator 3. David Lewis's information retrieval (IR) data collection (offline) 4. Mike Pazzani's economic sanctions database 5. Ross Quinlan's latest version of the thyroid database 6. Philippe Collard's database on cloud cover images 7. Mary McLeish & Matt Cecile's database on horse colic 8. Paul O'Rorke's database containing theorems from Principia Mathematica 9. John Gennari's program for creating structured objects ("animals") 48. Nine simple small EBL domain theories and examples in sub-directory ebl 1. cup 2. deductive.assumable (contains three domain theories) 3. emotion 4. ice 5. pople 6. safe-to-stack 7. suicide