Driving style detection is an essential real-world requirement in diverse contexts, such as traffic safety, car insurance and fuel consumption optimization. However, the existing methods either rely on handcrafted features or fail to explore deep spatialtemporal features from multi-modal sensing signals. In this paper, we propose a novel attention-based hybrid convolutional neural network (CNN) and long short-term memory (LSTM) framework named DSDCLA to address these problems. Specifically, DSDCLA first introduces CNN and self-attention for extracting local spatial features from multi-modal driving sequences. Then, we utilize LSTM and multi-head attention to explore the long-term temporal relationships between timesteps. Therefore, DSDCLA can identify driving style by short- and long-term spatial-temporal features. Furthermore, we design three variants with different levels of fusion, which shows the advantage of selecting components and improves the interpretability. We extensively evaluated the proposed DSDCLA on two public real-world datasets, and the experimental results show that DSDCLA outperforms the current state-of-the-art methods, achieving the F1-scores of 97.03% and 97.65%. Numerous ablation studies and visualizations indicate the effectiveness of the model and the importance of multi-level attention fusion for identifying driving style between timesteps.