EnDeep4mC: A dual-adaptive feature encoding framework in deep ensembles for predicting DNA N4-methylcytosine sites
A deep learning ensemble predictor for DNA 4mC modification sites.
Combines CNN, BLSTM and Transformer models with advanced feature engineering.
Predictor

ℹ️ Variable-Length Sequence Processing

EnDeep4mC can process DNA sequences of any length (20-100,000 bp) using the following pipeline:

1. Length Standardization
All sequences are standardized to 41bp windows through appropriate padding or segmentation
2. Cytosine Filtering
Only windows with cytosine (C) at the center position are analyzed for 4mC prediction
3. Sliding Window Analysis
Long sequences are analyzed using overlapping 41bp windows with 1bp step size
4. Ensemble Prediction
Each 41bp window is analyzed by CNN, BLSTM, and Transformer ensemble
Note: For sequences longer than 41bp, predictions are averaged per cytosine position to provide robust methylation likelihood scores.

Model Architecture Details

Table 1. Architecture and Hyperparameters of Base Deep Learning Models

Component CNN Bi-LSTM Transformer
Input (1, feature_dim) (1, feature_dim) (None, feature_dim)
Layer 1 Conv1D(256,1)+BN BiLSTM(128) MultiHead(8 heads)
Layer 2 SepConv1D(128,3)+Pool BN FFN(512)+LayerNorm
Layer 3 Conv1D(64,1) BiLSTM(64) 2 encoder layers
Pooling GlobalMaxPool - GlobalAvgPool
Dense Dense(128) Dense(64) Dense(128)
Output Dense(1, sigmoid) Dense(1, sigmoid) Dense(1, sigmoid)
Regularization L2(0.001), Drop(0.3) L2(0.001), Drop(0.2), RecDrop(0.1) L2(0.001), Drop(0.1)
Optimizer Adam(lr=0.001, clip=1.0) Adam(lr=0.001, clip=1.0) Adam(lr=0.001)

Table 2. Configuration of Ensemble Learning Framework

Component XGBoost Configuration LightGBM Configuration Meta-Learner Configuration
Model Type XGBClassifier LGBMClassifier LogisticRegression
Number of Trees n_estimators=500 n_estimators=300 -
Learning Rate 0.05 0.05 -
Depth max_depth=7 num_leaves=63 -
Regularization gamma=0.1, subsample=0.8 reg_alpha=0.2, reg_lambda=0.2 C=0.6, l1_ratio=0.5
Others colsample_bytree=0.8 min_child_samples=20 penalty='elasticnet', solver='saga'

Model Architecture Diagram

Model Architecture Diagram

Integrated Deep Learning Architecture with Dual-Adaptive Encoding System

Example FASTA Format

>Sample_Sequence_1 (41bp)
ATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
>Sample_Sequence_2 (50bp)
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
>Sample_Sequence_3 (30bp)
ATCGATCGATCGATCGATCGATCGATCGAT

Variable-Length Examples: The webserver accepts sequences of any length (20-100,000 bp). Short sequences will be standardized to 41bp, and long sequences will be analyzed using sliding windows.

⬇️ Download Test Samples