Study on rural landscape design strategies integrating computer vision and deep learning: an analysis based on human perception and visual aesthetics

Table of Contents

Background, issues, and research objectives

Over the past decade, advances in artificial intelligence—particularly computer vision and deep learning—have been increasingly leveraged in urban landscape and environmental design to quantify spatial aesthetics and capture human visual preferences. By extracting low- to high-level image features, these methods offer objective, data-driven insights into how people perceive and respond to their surroundings. Yet, most existing work has concentrated on urban streetscapes or natural parks, with little systematic investigation of rural landscapes—environments distinguished by intricate spatial structures and rich cultural diversity. In particular, model interpretability, feedback mechanisms, and the capacity to generalize across different cultural contexts remain underexplored.

Chinese rural landscapes represent a quintessential form of cultural heritage, serving vital social, ecological, and cultural functions¹. Beyond supporting agricultural production and ecosystem services, they embody vernacular traditions that have evolved over centuries. Despite their importance for sustainable development and cultural continuity, rural revitalization projects in China have often suffered from landscape homogenization and planning oversights^2,3. Several rural revitalization projects in China exemplify the issue of urban-centric design logic. For example, the Longtan Water Township in Chengdu replicates Jiangnan-style waterscapes and royal garden forms, yet lacks grounding in local Bashu cultural elements⁴. The Hexianfang Folk Culture Village was developed as a themed tourism site, but many commercial stalls closed shortly after opening due to weak local integration and over-replication⁵. Similarly, in Longtou Village in Shaanxi, traditional cave dwellings have been replaced with stylized “new rural” architectures aimed at media visibility rather than cultural continuity⁶. These cases reflect broader tendencies to prioritize symbolic replication and aesthetic spectacle over ecological and vernacular integrity.

To address these gaps, this study integrates computer vision, deep learning, and human visual perception theory to develop a quantifiable, interpretable scoring model for assessing visual comfort in rural landscapes. Our framework extracts both deep semantic features and structural descriptors from landscape imagery, then translates these into a human-understandable feedback loop for design optimization. By bridging computational evaluation with subjective perception, we aim to overcome current limitations in rural landscape assessment and provide practitioners with a robust, data-driven tool for perception-oriented spatial design.

Literature review

Traditional studies of landscape aesthetics have relied primarily on field surveys, expert ratings, and formalist analyses. While these approaches played an important role in early planning practice, they are widely acknowledged to be limited by subjectivity, weak comparability, and small sample sizes^7,8,9,10. In addition, such methods struggle to capture subtle structural variations dynamically, reducing their applicability to large-scale or multi-temporal contexts^11,12. With the advance of artificial intelligence, computer vision and deep learning have provided new tools for the quantitative study of landscape aesthetics. Convolutional neural networks (CNNs), in particular, can extract high-level features from images to approximate human perception^13,14. or instance, Havinga and colleagues employed social media images and deep learning models to evaluate the aesthetic quality of British landscapes, demonstrating the efficiency of image-based aesthetic scoring and spatial mapping¹⁵. Similarly, Law and colleagues applied CNNs to classify urban images for design purposes¹⁴. Collectively, these studies signal a paradigm shift in landscape perception research from “subjective assessment—expert-driven” toward “data-driven—model-oriented.”

Rural landscapes, which combine ecological, cultural, and livelihood functions, have increasingly attracted attention in aesthetic research. Rural areas often preserve ecological textures, embody local cultural symbols, and have become major objects of aesthetic experience through tourism^16,17. However, compared with urban landscapes, rural spaces present greater complexity in spatial structure, scale organization, and functional hybridity, which makes direct transfer of conventional quantitative models difficult.

Recent empirical studies have increasingly adopted computer vision and deep learning to model landscape aesthetics, ranging from street-level imagery to rural UAV photography. These works demonstrate the feasibility of automated prediction and large-scale mapping, yet they often emphasize technical accuracy over interpretability or design applicability. In addition to these empirical studies, Ma and Furuya conducted a systematic review of computer vision methods in landscape research, identifying semantic segmentation, object detection, and generative adversarial network (GAN)-based image synthesis as emerging techniques in perceptual analysis¹⁸. These models allow for the extraction of high-level visual features—such as color harmony, texture diversity, and spatial configuration—that are closely aligned with human aesthetic judgment. Beyond these technical advances, alternative paradigms have also been proposed. Symbolic visual analysis links visual and structural patterns to culturally coded meanings, embedding semantics into preference modeling. GAN-based landscape generation enables the simulation and testing of virtual interventions prior to implementation. Participatory modeling frameworks directly engage stakeholders, integrating community perspectives to complement computational predictions and bridging the gap between algorithmic evaluation and lived experience.

Recent studies have sought to introduce computer vision techniques to improve efficiency and objectivity. For example, Shen and colleagues used semantic segmentation to identify rural roadside green-view indices, farmland density, and building enclosure, developing preference models that demonstrate the effect of structural features on visual experience¹⁹. Gong and colleagues developed a perception system that integrates subjective labels with objective visual features for rural street scenes, enabling large-scale automated assessment of visual quality²⁰. Wang combined DenseNet with Internet of Things (IoT) technologies to optimize visual representations of rural tourism images, enhancing classification and visitor experience²¹. While informative, these studies largely emphasize technical enablement and predictive output rather than translating results into actionable strategies for rural landscape design.

The distinctive characteristics of Chinese rural landscapes further require context-specific modeling and design logic. First, Chinese rural spaces integrate agricultural production, daily life, and traditional culture, producing heterogeneous and multi-scalar spatial patterns²². Second, local culture and collective memory strongly shape aesthetic preferences, embedding ethical, spiritual, and ritual values into judgments of landscape quality²³. Third, Chinese villages remain in transition, where new and old built environments intersect to generate atypical forms that are poorly captured by urban-based models²⁰. Developing an integrated framework that reflects the internal structural logic and perceptual preferences of Chinese rural landscapes is therefore essential.

In summary, although computer vision and deep learning are increasingly applied in landscape aesthetics, significant gaps remain in the rural Chinese context. Current models face at least three limitations: (1) training data are predominantly drawn from Western cities or natural parks, with limited cultural or contextual adaptability²⁴; (2) outputs are often restricted to image-level aesthetic scores, lacking mechanisms to translate “visual comfort” into spatial design strategies²⁵; and (3) models generally lack interpretability, limiting their acceptance by designers and decision-makers. While explainable AI techniques such as SHAP have proven effective in other domains²⁶their application in landscape aesthetics remains rare. To address these gaps, this study develops a feedback framework linking perceptual scoring, structural features, and design strategies, thereby providing methodological support for applying computer vision and deep learning to rural landscape design.

Theoretical framework

This study is grounded in a multidisciplinary theoretical foundation that integrates computational modeling, environmental psychology, and landscape design theory, as illustrated in the conceptual framework shown in Fig. 1. The first-tier theoretical framework is based on computational visual perception models, particularly the principles of computer vision and deep learning. Convolutional neural networks (CNNs) have demonstrated strong capabilities in extracting high-level aesthetic and semantic features from both natural and built environments. A place-based database provides the semantic basis for scene classification, which is essential for categorizing rural landscape types based on visual input²⁷.

The model is fine-tuned using a custom rural scene dataset that includes both self-collected and publicly available images. It is trained to perform multi-class classification of rural landscape types using cross-entropy loss, following standard image classification protocols²⁸. Prior to model training, standard image preprocessing and augmentation techniques are applied to enhance the model’s robustness and generalization, consistent with the principles of visual representation learning.

The second tier of the framework is grounded in theories of environmental aesthetics and affective computing, linking perceived visual attributes with psychological and emotional responses. At this stage, the study employs the Neural Image Assessment (NIMA) model, which utilizes a deep CNN to predict the distribution of human aesthetic scores, framing visual evaluation as a regression task trained using mean squared error (MSE) loss²⁹. In addition to the deep learning outputs, the analysis incorporates traditional image quality metrics—such as contrast, sharpness, and color harmony—enabling the integration of low-level visual indicators with the high-level perceptual features extracted by the CNN. This fusion aligns with established findings in visual comfort modeling, which suggest that both perceptual cues and semantic content influence human preferences³⁰.

To situate the computational results within the context of human perceptual theory, this study draws upon the classic framework of environmental psychology³¹. Specifically, the preference matrix identifies four key predictors of visual preference—coherence, complexity, legibility, and mystery—which are employed in this research to interpret the relationship between CNN-derived visual categories and perceived visual comfort. Furthermore, Berlyne’s arousal theory (1971) is utilized to explain how aesthetic pleasure is modulated by the complexity and novelty of stimuli, offering a theoretical foundation for associating specific image types—such as expansive farmland versus densely built villages—with emotional resonance³².

Moreover, this study goes beyond mere citation by anchoring the computational scoring framework in recent empirical evidence. For instance, Firestone & Sun (2022) demonstrate that aesthetic preference peaks at moderate structural complexity following an informationtheoretic Goldilocks principle, reinforcing Berlyne’s inverted-U concept for edge density and novelty³³. Shayestefar et al. (2022) applied Kaplan’s preference predictors—coherence, legibility, complexity, mystery—in urban park contexts through discrete choice experiments, showing each predictor significantly guides users’ landscape preference³⁴. Weinberger et al. (2021) found that coherence and fascination (akin to mystery/legibility) are dominant dimensions in both built and natural environments, aligning with our semantic feature correlations³⁵. Additionally, Lin et al. (2021) revealed that images with higher mystery attract greater attention restoration potential, suggesting that mystery-related features in our model serve functional perceptual roles³⁶. Finally, Hussain & Li (2023) confirmed that aesthetic experiences facilitate learning by enhancing complexity integration—supporting the cognitive relevance of our featurebased scoring approach³⁷. Together, these post-2020 studies empirically ground our use of Berlyne’s and Kaplan’s dimensions—transforming our model from a mere optimization tool into a psychologically interpretable architecture that aligns with human perceptual and affective heuristics.

Building upon this psychologically grounded framework, the study further integrates a data-driven feedback mechanism into the process of rural landscape design optimization. This transformation from perceptual modeling to practical application aligns with the paradigm of evidence-based design, where environmental interventions are guided by quantifiable perceptual and emotional responses. The proposed system empowers landscape designers to identify problematic visual patterns—such as overly dense edges or low naturalness—and simulate alternative design scenarios that enhance visual comfort. By translating abstract model outputs into design-relevant insights, this approach supports a more human-centered, responsive, and culturally attuned rural planning process.

Terminology and definitions

To enhance conceptual clarity and ensure consistency throughout the manuscript, the following key terms are defined:

Visual comfort: Refers to the subjective degree of visual pleasure or ease experienced by viewers when observing a landscape scene. It is used as a proxy for aesthetic preference in this study^31,32.
Green ratio: The proportion of green pixels in an image, including trees, grass, and other vegetation. It is calculated using color-based segmentation techniques³⁸.
Sky openness: The ratio of visible sky area in an image, derived through semantic segmentation. It reflects the degree of spatial openness perceived by the viewer³⁹.
Edge density: The amount of edge information within an image, calculated using Canny edge detection. It serves as a proxy for visual complexity or structural clutter⁴⁰.
Deep feature score (DFS): A high-dimensional semantic feature vector extracted from a trained ResNet101 model. These features are used as inputs for the regression model to predict visual comfort scores.
Multi-Perceptual scores (MPS): A composite set of perceptual ratings that includes dimensions such as aesthetic appeal, naturalness, intimacy, usability, and resilience.