# 12 Bibliography

[1]

D. W. Aha. Incremental, instance-based learning of independent and graded concept descriptions. In Sixth International Machine Learning Workshop, pages 387–391, San Mateo, CA, 1989. Morgan Kaufmann.

[2]

D. W. Aha. A Study of Instance-Based Algorithms for Supervised Learning Tasks: Mathematical, Empirical and Psychological Observations. PhD thesis, University of California, Irvine, Department of Information and Computer Science, 1990.

[3]

D. W. Aha. Editorial of special issue on lazy learning. Artificial Intelligence Review, 11(1–5):1–6, 1997.

[4]

H. Akaike. Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mechanics, 21:243–247, 1969.

[5]

D. M. Allen. The relationship between variable and data augmentation and a method of prediction. Technometrics, 16:125–127, 1974.

[6]

B. D. O. Anderson and M. Deistler. Identifiability in dynamic errors-in-variables models. Journal of Time Series Analysis, 5:1–13, 1984.

[7]

C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1–5):11–73, 1997.

[8]

R. Babuska. Fuzzy Modeling and Identification. PhD thesis, Technische Universiteit Delft, 1996.

[9]

R. Babuska and H. B. Verbruggen. Fuzzy set methods for local modelling and identification. In R. Murray-Smith and T. A. Johansen, editors, Multiple Model Approaches to Modeling and Control, pages 75–100. Taylor and Francis, 1997.

[10]

A. R. Barron. Predicted squared error: a criterion for automatic model selection. In S. J. Farlow, editor, Self-Organizing Methods in Modeling, volume 54, pages 87–103, New York, 1984. Marcel Dekker.

[11]

W. G. Baxt. Improving the accuracy of an artificial neural network using multiple differently trained networks. Neural Computation, 4:772–780, 1992.

[12]

M. G. Bello. Enhanced training algorithms, and integrated training/architecture selection for multilayer perceptron networks. IEEE Transactions on Neural Networks, 3(6):864–875, 1992.

[13]

H. Bersini and G. Bontempi. Fuzzy models viewed as multi-expert networks. In IFSA ’97 (7th International Fuzzy Systems Association World Congress, Prague), pages 354–359, Prague, 1997. Academia.

[14]

H. Bersini and G. Bontempi. Now comes the time to defuzzify the neuro-fuzzy models. Fuzzy Sets and Systems, 90(2):161–170, 1997.

[15]

H. Bersini, G. Bontempi, and C. Decaestecker. Comparing rbf and fuzzy inference systems on theoretical and practical basis. In F. Fogelman-Soulie’ and P. Gallinari, editors, ICANN ’95,International Conference on Artificial Neural Networks, pages 169–174, 1995.

[16]

J. C. Bezdek. Fuzzy Mathematics in Pattern Classification. PhD thesis, Applied Math. Center, Cornell University, Ithaca, 1973.

[17]

J. C. Bezdek and J. C. Dunn. Optimal fuzzy partition: a heuristic for estimating the parameters in a mixture of normal distributions. IEEE Transactions on Computers, 24:835–838, 1975.

[18]

J.C. Bezdek, C. Coray, R. Gunderson, and J. Watson. Detection and characterization of cluster substructure. SIAM Journal of Applied Mathematics, 40(2):339–372, 1981.

[19]

M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least-squares algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11, pages 375–381, Cambridge, 1999. MIT Press.

[20]

C. M. Bishop. Neural Networks for Statistical Pattern Recognition. Oxford University Press, Oxford, UK, 1994.

[21]

G. Bontempi and H. Bersini. Identification of a sensor model with hybrid neuro-fuzzy methods. In A. B. Bulsari and S. Kallio, editors, Neural Networks in Engineering systems (Proceedings of the 1997 International Conference on Engineering Applications of Neural Networks (EANN ’97), Stockolm, Sweden), pages 325–328, 1997.

[22]

G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for modeling and control design. International Journal of Control, 72(7/8):643–658, 1999.

[23]

G. Bontempi, M. Birattari, and H. Bersini. A model selection approach for local learning. Artificial Intelligence Communications, 121(1), 2000.

[24]

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984.

[25]

D. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321–355, 1988.

[26]

V. Cherkassky and F. Mulier. Learning from Data: Concepts, Theory, and Methods. Wiley, New York, 1998.

[27]

W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74:829–836, 1979.

[28]

W. S. Cleveland and S. J. Devlin. Locally weighted regression: an approach to regression analysis by local fitting. Journal of American Statistical Association, 83:596–610, 1988.

[29]

W. S. Cleveland and C. Loader. Smoothing by local regression: Principles and methods. Computational Statistics, 11, 1995.

[30]

T. Cover and P. Hart. Nearest neighbor pattern classification. Proc. IEEE Trans. Inform. Theory, pages 21–27, 1967.

[31]

P. Craven and G. Wahba. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math., 31:377–403, 1979.

[32]

G. Cybenko. Just-in-time learning and estimation. In S. Bittanti and G. Picci, editors, Identification, Adaptation, Learning. The Science of Learning Models from data, NATO ASI Series, pages 423–434. Springer, 1996.

[33]

Peter Dalgaard. Introductory statistics with R. Springer, 2002.

[34]

A. Dean and D. Voss. Design and Analysis of Experiments. Springer Verlag, New York, NY, USA, 1999.

[35]

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, B, 39(1):1–38, 1977.

[36]

L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer Verlag, 1996.

[37]

N. R. Draper and H. Smith. Applied Regression Analysis. John Wiley and Sons, New York, 1981.

[38]

R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1976.

[39]

B. Efron. Bootstrap methods: Another look at the jacknife. Annals of Statistics, pages 1–26, 1979.

[40]

B. Efron. The Jacknife, the Bootstrap and Other Resampling Plans. SIAM, 1982. Monograph 38.

[41]

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New York, NY, 1993.

[42]

B. Efron and R. J. Tibshirani. Cross-validation and the bootstrap: estimating the error rate of a prediction rule. Technical report, Stanford University, 1995.

[43]

J. Fan and I. Gijbels. Variable bandwidth and local linear regression smoothers. The Annals of Statistics, 20(4):2008–2036, 1992.

[44]

J. Fan and I. Gijbels. Adaptive order polynomial fitting: bandwidth robustification and bias reduction. J. Comp. Graph. Statist., 4:213–227, 1995.

[45]

J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. Chapman and Hall, 1996.

[46]

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34, November 1996.

[47]

V. Fedorov. Theory of Optimal Experiments. Academic Press, 1972.

[48]

F. Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531–1555, 2004.

[49]

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.

[50]

J. H. Friedman. Flexible metric nearest neighbor classification. Technical report, Stanford University, 1994.

[51]

A. Gelman. Bayesian Data Analysis. Chapman and Hall, 2004.

[52]

Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an em approach. In J. D. Cowan, G. T. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 120–127, San Mateo, CA, 1994. Morgan Kaufmann.

[53]

D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA, 1989.

[54]

T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, and M. Gaasenbeek. Molecular clssification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999.

[55]

D. E. Gustafson and W. C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proc. IEEE CDC, pages 761–776, San Diego, CA,USA, 1979.

[56]

I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003.

[57]

Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lotfi A. Zadeh. Feature Extraction: Foundations and Applications. Springer-Verlag New York, Inc., 2006.

[58]

D. J. Hand. Discrimination and classification. John Wiley, New York, 1981.

[59]

W. Hardle and J. S. Marron. Fast and simple scatterplot smoothing. Comp. Statist. Data Anal., 20:1–17, 1995.

[60]

T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall, London, UK, 1990.

[61]

T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–615, 1996.

[62]

T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2001.

[63]

J. S. U. Hjorth. Computer Intensive Statistical Methods. Chapman and Hall, 1994.

[64]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of American Statistical Association, 58:13–30, 1963.

[65]

P. J. Huber. Robust Statistics. Wiley, New York, 1981.

[66]

A. K. Jain, R. C. Dubes, and C. Chen. Bootstrap techniques for error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:628–633, 1987.

[67]

J.-S. R. Jang. Anfis: Adaptive-network-based fuzzy inference systems. IEEE Transactions on Fuzzy Systems, 23(3):665–685, 1993.

[68]

J. S. R. Jang, C. T. Sun, and E. Mizutani. Neuro-Fuzzy and Soft Computing. Matlab Curriculum Series. Prentice Hall, 1997.

[69]

E.T. Jaynes. Probability theory : the logic of science. Cambridge University Press, 2003.

[70]

T. A. Johansen and B. A. Foss. Constructing narmax models using armax models. International Journal of Control, 58:1125–1153, 1993.

[71]

M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of American Statistical Association, 90, 1995.

[72]

V. Y. Katkovnik. Linear and nonlinear methods of nonparametric regression analysis. Soviet Automatic Control, 5:25–34, 1979.

[73]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.

[74]

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of IJCAI-95, 1995. available at http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.

[75]

R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997.

[76]

A. N. Kolmogorov. Foundations of Probability. Berlin, 1933.

[77]

J. Kolodner. Case-Based Reasoning. Morgan Kaufmann, 1993.

[78]

R. J.A. Little and D. B. Rubin. Statistical analysis with missing data. Wiley, 2002.

[79]

L. Ljung. System identification: Theory for the User. Prentice-Hall, Englewood Cliffs, NJ, 1987.

[80]

D. G. Luenberger. Linear and Nonlinear Programming. Addison Wesley, Reading, MA, 1984.

[81]

C. Mallows. Discussion of a paper of beaton and tukey. Technometrics, 16:187–188, 1974.

[82]

C. L. Mallows. Some comments on $c_ p$. Technometrics, 15:661, 1973.

[83]

O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1–5):193–225, 1997.

[84]

J. Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. Moody, Hanson, and Lippmann, editors, Advances in Neural Information Processing Systems, volume 4, pages 847–854, Palo Alto, 1992. Morgan Kaufmann.

[85]

J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281–294, 1989.

[86]

A. W. Moore, D. J. Hill, and M. P. Johnson. An empirical investigation of brute force to choose features, smoothers and function approximators. In S. Janson, S. Judd, and T. Petsche, editors, Computational Learning Theory and Natural Learning Systems, volume 3. MIT Press, Cambridge, MA, 1992.

[87]

R. Murray-Smith. A local model network approach to nonlinear modelling. PhD thesis, Department of Computer Science, University of Strathclyde, Strathclyde, UK, 1994.

[88]

R. Murray-Smith and T. A. Johansen. Local learning in local model networks. In R. Murray-Smith and T. A. Johansen, editors, Multiple Model Approaches to Modeling and Control, chapter 7, pages 185–210. Taylor and Francis, 1997.

[89]

R. H. Myers. Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition, 1994.

[90]

E. Nadaraya. On estimating regression. Theory of Prob. and Appl., 9:141–142, 1964.

[91]

A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw Hill, 1991.

[92]

E. Parzen. On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33:1065–1076, 1962.

[93]

J. Pearl. Causality. Cambridge University Press, 2000.

[94]

H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 2005.

[95]

M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone, editor, Artificial Neural Networks for Speech and Vision, pages 126–142. Chapman and Hall, 1993.

[96]

D. Plaut, S. Nowlan, and G. E. Hinton. Experiments on learning by back propagation. Technical Report CMU-CS-86-126, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 1986.

[97]

M. J. D. Powell. Algorithms for Approximation, chapter Radial Basis Functions for multivariable interpolation: a review, pages 143–167. Clarendon Press, Oxford, 1987.

[98]

W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, 1992. Second ed.

[99]

J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.

[100]

J. R. Quinlan. Simplyfying decision trees. International Journal of Man-Machine Studies, 27:221–234, 1987.

[101]

R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2004. ISBN 3-900051-07-0.

[102]

J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

[103]

M. Rosenblatt. Remarks on some nonparametric estimates of a density function. Annals of Mathematical Statistics, 27:832–837, 1956.

[104]

D. E. Rumelhart, G. E. Hinton, and R. K. Williams. Learning representations by backpropagating errors. Nature, 323(9):533–536, 1986.

[105]

Y. Saeys, I. Inza, and P. Larranaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23:2507–2517, 2007.

[106]

R. E. Schapire. Nonlinear Estimation and Classification, chapter The boosting approach to machine learning: An overview. Springer,.

[107]

D. W. Scott. Multivariate density estimation. Wiley, New York, 1992.

[108]

C. Stanfill and D. Waltz. Toward memory-based reasoning. Communications of the ACM, 29(12):1213–1228, 1987.

[109]

C. Stone. Consistent nonparametric regression. The Annals of Statistics, 5:595–645, 1977.

[110]

M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36(1):111–147, 1974.

[111]

M. Stone. An asymptotic equivalence of choice of models by cross-validation and akaike’s criterion. Journal of Royal Statistical Society, Series B, 39:44–47, 1977.

[112]

T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1):116–132, 1985.

[113]

H. Tijms. Understanding probability. Cambridge, 2004.

[114]

V. N. Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, volume 4, Denver, CO, 1992.

[115]

V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY, 1995.

[116]

V. N. Vapnik. Statistical Learning Theory. Springer, 1998.

[117]

W. N. Venables and D. M. Dmith. An Introduction to R. Network Theory, 2002.

[118]

T. P. Vogl, J. K. Mangis, A. K. Rigler, W. T. Zink, and D. L. Alkon. Accelerating the convergence of the back-propagation method. Biological Cybernetics, 59:257–263, 1988.

[119]

L. Wasserman. All of statistics. Springer, 2004.

[120]

G. Watson. Smooth regression analysis. Sankhya, Series, A(26):359–372, 1969.

[121]

S. M. Weiss and C. A. Kulikowski. Computer Systems that learn. Morgan Kaufmann, San Mateo, California, 1991.

[122]

D. H. Wolpert and R. Kohavi. Bias plus variance decomposition for zero-one loss functions. In Prooceedings of the 13th International Conference on Machine Learning, pages 275–283, 1996.

[123]

L. A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.