UNIVERSIDADE DE CAXIAS DO SUL 

ÁREA DE CONHECIMENTO DE CIÊNCIAS DA VIDA 

INSTITUTO DE BIOTECNOLOGIA 

PROGRAMA DE PÓS GRADUAÇÃO EM BIOTECNOLOGIA  

 
Promoter sequence characterization through the analysis of 

enthalpy, entropy, stability and base-pair stacking values 

 
Gustavo Sganzerla Martinez 

 
Caxias do Sul 

2018 

 
ii 
 

GUSTAVO SGANZERLA MARTINEZ 

 
Promoter sequences classification through the analysis of 

enthalpy, entropy, stability and base-pair stacking values 

 
Dissertação apresentada ao Programa de Pós Graduação em Biotecnologia na 

Universidade de Caxias do Sul, visando a obtenção do título de Mestre em 

Biotecnologia 

Orientadora: Prof. Dra. Scheila de Ávila e Silva 

DISSERTAÇÃO APROVADA EM 19 DE OUTUBRO DE 2018. 

______________________________________________________________________ 

Orientadora Prof. Dra. Scheila de Ávila e Silva 

______________________________________________________________________ 

Prof. Dr. Sérgio Echeverrigaray 

______________________________________________________________________ 

Prof. Dr. Julio Collado-Vides 

______________________________________________________________________ 

Prof. Dr. Luis Fernando Saraiva Macedo Timmers  

 
Caxias do Sul 

2018 


iii 
 

Dados Internacionais de Catalogação na Publicação 
(CIP) Universidade de Caxias do Sul 

Sistema de Bibliotecas UCS - Processamento Técnico 

 
Catalogação na fonte elaborada pela(o) bibliotecária(o) 

Carolina Machado Quadros - CRB 10/2236 

 
M385p Martinez, Gustavo Sganzerla 
Promoter sequences classification through the analysis of enthalpy, 

entropy, stability and base-pair stacking values / Gustavo Sganzerla 
Martinez. – 2018. 

xiii, 79 leaves : il. ; 30 cm 

Dissertation (Masters) - University of Caxias do Sul, Graduate 
Biotechnology Program, 2018. 

Advisor: Scheila de Ávila e Silva. 

1. Enzymes. 2. RNA polymerases. 3. Enthalpy. 4. Entropy. I. Silva, 
Scheila de Ávila e, orient. II. Título. 

CDU 2. ed.: 604.4:577.15 


iv 
 

Acknowledgements 

 
I am so thankful to: 

all of my family members who have supported me through all the steps in this dissertation, 

specially my mother Teresinha, grandparents Darcy and Nair, my aunt and uncle Jane and 

Marcos, along with their children Lorenzo and Eduarda; 

all the staff from Inclass school, my workplace, including my boss Carina and our 

secretary Marilis, they understood and supported me in all steps which required my 

absence in workdays; 

my friends, who spent time and patience to understand my absences in key moments, 

specially João Pedro, Felipe, Eduardo, Lucas. The whole crew from our online gaming 

community who was patient enough to understand that I had other obligations rather than 

only playing games; the beloved classmates from undergraduate school who were present 

in several steps and helped me; 

the examining board from UCS, honored professors Daniel and Sérgio who elucidated 

this dissertation with helpful advice; 

my advisor, professor Scheila who was a key person helping me and presenting me the 

right guidance and decision making throughout the whole process; 

all the members from PPGBIO in UCS, and the secretary always ready to assist Lucimara 

 
v 
 

The world is indeed full of peril, and in it there are many dark places; but still there is 

much that is fair, and though in all lands love is now mingled with grief, it grows 

perhaps the greater 

J.R.R Tolkien 

 
vi 
 

Table of Contents 
LIST OF FIGURES ....................................................................................................... viii	

LIST OF TABLES ............................................................................................................ x	

LIST OF ABBREVIATIONS .......................................................................................... xi	

1 INTRODUCTION ......................................................................................................... 1	

2 AIMS AND OBJECTIVES ........................................................................................... 3	

2.1 SPECIFIC AIMS AND OBJECTIVES .................................................................. 3	

3 THEORETICAL REFERENCES .................................................................................. 4	

3.1 GENE EXPRESSION AND ITS REGULATION .................................................. 4	

3.1.1 PROMOTER SEQUENCES, TRANSCRIPTION AND THE RNAP 

ENZYME .................................................................................................................. 6	

3.2 PHYSICAL FEATURES IN PROMOTER SEQUENCES .................................. 12	

3.2.1 BASE PAIR CONSERVATION ................................................................... 12	

3.2.2 STABILITY ................................................................................................... 15	

3.2.3 STRESS INDUCED DNA DUPLEX DESTABILIZATION (SIDD) .......... 18	

3.2.4 DNA CURVATURE AND BENDABILITY ................................................ 19	

3.2.5 BASE PAIR STACKING .............................................................................. 21	

3.2.6 ENTROPY, IRREVERSIBLE PROCESS AND ENTROPY VARIATION 

WITHIN THE DNA MOLECULE ......................................................................... 23	

3.2.7 ENTHALPY .................................................................................................. 26	

3.2.8 ENTROPIC AND ENTHALPIC CONTRIBUITIONS TO DNA 

STABILIZATION .................................................................................................. 27	

3.3  ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ....................... 28	

3.3.1 CLUSTERING AND THE K MEANS ALGORITHM ................................. 30	

4 MATERIALS AND METHODS ................................................................................. 35	

4.1 PROMOTER DATA AQUISITION ..................................................................... 35	

4.2 DATA TRANSFORMATION .............................................................................. 35	

4.3 K-MEANS CLUSTERING ................................................................................... 36	


vii 
 

5 RESULTS AND DISCUSSION .................................................................................. 39	

5.1 ARTICLE 1 – COMPARISON OF ENTROPY, ENTHALPY, STABILITY AND 

BASE-PAIR STACKING PROFILES OF E. COLI IN PROMOTER SEQUENCES 

RECOGNIZED BY DIFFERENT σ FACTORS ........................................................ 40	

5.2 ARTICLE 2 – A LOOKTHROUGH TO CLUSTERS CONTAINING ENTROPY, 

ENTHALPY, BASE-PAIR STACKING AND STABILITY PROFILES OF E. COLI 

PROMOTER SEQUENCES IDENTIFIED BY DIFFERENT σ FACTORS. ........... 55	

5.3 – AN ASSESSMENT REGARDING INTERSECTED PROMOTERS 

SEQUENCES RECOGNIZED BY DIFFERENT σ FACTORS IN THE MOST 

POPULATED ENTHALPY, BASE-PAIR STACKING AND STABILITY 

CLUSTERS ................................................................................................................. 68	

5.3.1 INTERSECTION ANALYSIS OF THE MOST POPULATED CLUSTERS 

IN ORDER TO VERIFY SIMILARITIES ............................................................. 69	

6 CONCLUSIONS ......................................................................................................... 72	

7 REFERENCES ............................................................................................................ 74	

 
viii 
 

LIST OF FIGURES 

 
Figure 3.1: Molecular biology central dogma (adapted from KREBS, GOLDSTEIN and 

KILPATRICK, 2014) ....................................................................................................... 5	

Figure 3.2: Orthogonal model from molecular biology central dogma (Adapted from LIU 

et al., 2018) ....................................................................................................................... 6	

Figure 3.3: Transcription process (adapted from KREBS, GOLDSTEIN and 

KILPATRICK, 2014) ....................................................................................................... 7	

Figure 3.4: Different functions of the σ factor in E. coli. σ70 promoters (Adapted from 

PAGET and HELMANN, 2003; KREBS, GOLDSTEIN and KILPATRICK, 2014) .... 10	

Figure 3.5: Bacterial transcription process (Adapted from REECE et al., 2014). ......... 11 

Figure 3.6: Presence of GC pairs in E. coli genome (Adapted from MEYSMAN et al., 

2014) ............................................................................................................................... 13 

Figure 3.7: Nucleotides after -12 consensus in σ54 promoters (Adapted from BARRIOS, 

VALDERRAMA and MORETT, 1999) ......................................................................... 14 

Figure 3.8: Dinucleotide presence in positions -14/-15 and -16/-17 the TSS (BURR et al., 

2000) ............................................................................................................................... 14 

Figure 3.9: Analysis of the insertion/deletion rate of dinucleotides grouped in different 

segments through the promoter sequence (Adapted from EZER, ZABER and ADRYAN, 

2014)...………………………………………………………………………………….15 

Figure 3.10: Hydrogen bonds on nucleic base pairs (WATSON and CRICK, 1953) ... 16	

Figure 3.11: Stability profile in E. coli.’s promoter region (RANGANNAN and 

BANSAL, 2007) ............................................................................................................. 17	

Figure 3.12: SSID profiles (Adapted from WANG and BENHAM, 2006) ................... 19	

Figure 3.13: DNA curvature profiles from E. coli, comparing the DNA segments with 

the moment some specific genes (gray region) (OLIVARES-ZAVALETA, JÁUREGUI 

e MERINO, 2006) ........................................................................................................... 20 

Figure 3.14: Bendability profile of the region promoter compared to other genomic 

regions (Adapted from MEYSMAN et al., 2014) ........................................................... 21	

Figure 3.15: Base stacking energy profile (MEYSMAN et al., 2014) ........................... 22	

Figure 3.16: Irreversible process (YOUNG and FREEDMAN, 2016) .......................... 23	


ix 
 

Figure 3.17: Portrayal of the possible distribution of a gas in a closed environment 

(Adapted from MAIA and BIANCHI, 2007) ................................................................. 25	

Figure 3.18: Data pre(a) and post(b) clustering (JAIN, 2010) ....................................... 30	

Figure 3.19: Different recognition patterns in the same cluster (CONNEL and JAIN, 

2002) ............................................................................................................................... 31	

Figure 3.20: K-means illustrated (JAIN, 2010) ............................................................. 33 

Figure 4.1: Processes workflow in this dissertation………………..………………..…38 

Figure 5.1: Intersection representation………………..………………………..………69 

Figure 5.2: Enthalpy, stacking and stability profiles in intersected promoter sequences 
recognized by different σ factors………………………………..………………...…….70 

  
x 
  

LIST OF TABLES 
 

Table 3.1: Description of the RNAp's subunits in E. coli (LEWIN, 2008) ...................... 8	

Table 3.2: Stability values for DNA base pairs (SANTALUCIA and HICKS, 2004) ... 17	

Table 3.3: DNA nucleoside stacking values (ORNSTEIN et al., 1978) ........................ 22	

Table 3.4: Entropy values for DNA base pairs (Adapted from SANTALUCIA and 

HICKS, 2004) ................................................................................................................. 26	

Table 3.5: Enthalpy values of DNA base pairs (SANTALUCIA and HICKS, 2004) ... 27	

Table 3.6: Description of the steps performed by the K-means algorithm (Adapted from 

JAIN and DUBES, 1998) ................................................................................................ 32	

Table 3.7: Comparative between clustering tools UCLUST and CD-HIT (Adapted from 

EDGAR, 2010) ............................................................................................................... 34	

Table 4.1: Distribution of examples through the research………………..…………….35 

Table 4.2: K in stability………………..……………….………………………………36 

Table 4.3: New K values for entropy, enthalpy, base-pair stacking and stability values..37  

Table 5.1: Clustering outcomes …………………………..……………………………68 

 
xi 
 

LIST OF ABBREVIATIONS 

 
UCLUST UCLUST algorithm 

A  Adenine nucleotide 

AI  Artificial intelligence 

BACPP Bacterial promoter prediction 

BLAST Basic local alignment search tool 

BTSS  BTSS finder promoter tool 

C  Cytosine nucleotide 

CD-HIT CD-HIT clustering and comparing tool 

CNN Prom CNN promoter prediction tool 

Dace  DACE clustering algorithm 

DNA  Deoxyribonucleic acid  

FASTA FASTA file format 

G  Guanine nucleotide 

H  Enthalpy 

IT   Information technology 

mRNA Messenger ribonucleic acid 

P  Pressure 

pN  PicoNewton 

RegulonDB Regulon database 

RNA  Ribonucleic acid 

RNAP  Ribonucleic acid polymerase enzyme 


xii 
 

SIDD  Stress induced deoxyribonucleic acid destabilization 

T  Thymine nucleotide 

TSS  Transcription starting site 

U  Heat 

UBLAST UBLAST algorithm 

V  Volume 

Δs  Entropy variation 

 
xiii 
 

Abstract 

 
Promoter sequence recognition by RNAp enzyme is a key step in gene transcription. Its 

location is found in a few base pairs before the coding region. An in-depth study of 

promoters sequences role might provide an enhanced foundation to understand how genes 

are expressed under different conditions and produce biological rules to be used in 

computer techniques such as data clustering. Somehow, a cell behaves similarly as man-

made machine, thus, its processes involve the best possible use of energy sources without 

producing too much heat. By this means, some physical concepts applied to machines, 

might, as well, be applied in cells, such as entropy and enthalpy variation. The present 

dissertation looks to assess the role of physical properties of the DNA: entropy, enthalpy, 

base-pair stacking and stability, in the characterization of Escherichia coli (E. coli) 

promoter sequences. To do so, a clustering technique was used to group promoter 

sequences clusters including the beforementioned features. With the cluster results in 

hand, a profile of the physical aspects of the DNA in promoter sequences may be drawn 

and biological inferences made upon these. Currently, not a big number of promoter 

identification tools make the use of combined profiles of enthalpy, entropy, base-pair 

stacking and stability. This paper has reported a strong correlation between enthalpy, 

stability and base pair stacking, where each combination of these features behaves 

differently in promoter sequences recognized by different sigma factors. We understand, 

according to the literature, that promoter sequences are known to be different in 

comparison to other genomic sequences, the results displayed in this paper enable a wider 

comprehension of difference between promoters themselves. Where, according to the 

sigma factor that is associated to the RNA polymerase recognition, the physical profile 

tends to be different, and by this, this paper’s results might bring a big acquisition to 

bioinformatics. 

 
Keywords:    Gene transcription; enthalpy; entropy; base-pair stacking; stability; 


1 
 

1 INTRODUCTION 
Technological advances in several fields such as biology provide a huge amount 

of data to the scientific community. With this, emerges the need of computing techniques 

that are capable of predict, identify and classify this data, seeking to produce biological 

inferences. 

The DNA molecule presents coding regions, which have their expression 

controlled by regulatory elements. According to Krebs, Goldstein and Kilpatrick (2014) 

the studying of these elements aids in understanding the gene function in different species 

and how they respond to environmental changes. In addition, comprehending the gene’s 

functionalities provides the scientific ground to develop new drugs and to understand the 

biological mechanisms related to diseases, as instance. 

One of the regulatory elements is the promoter sequence. This can be found before 

the coding region and the holoenzyme RNA polymerase DNA dependent (RNAP) 

recognizes the promoters, triggering the start of the transcription process. The promoter 

regions have in their nucleotide composition some segments with certain level of 

conservation, which helps its recognition by the RNAP. However, this biological pattern 

presents some degeneration, which hinders their computing analysis. Moreover, several 

works bring other features rather than the similarity in nucleotide composition. Some 

physical features are conserved in promoter regions and they can be used for their 

identification. These features: i) stability; ii) base-pair stacking; iii) entropy; and iv) 

enthalpy aid in RNAP recognition. 

The stability of base pairs is a characteristic that refers to the amount of free-

energy present in a base pair interaction. The base-pair stacking is a value found in the 

bond between the nucleotide duplexes. The entropy and enthalpy are physical and 

structural features of the DNA molecule. It is believed that certain molecular tasks 

performed only in the promoter region can cause changes in the entropy and enthalpy 

value, thus being able to differentiate promoters and non-promoters based on their 

entropy/enthalpy values. Several tools rely only in identifying promoters based on the 

similarities in their nucleotide composition, this computing task can be improved and 

become more precise with the use of a physical and structural feature of the DNA. 

The first step to conclude this paper is to analyze how enthalpy, entropy, base-pair 

stacking and stability behave in different sites within the promoter sequence. Once these 


2 
 

results are achieved, with this present dissertation, it will be able to discuss in each 

different group of promoters the impact that the beforementioned physical features have 

in the transcription process. It is also proposed the use of a clustering analysis to identify 

and group together promoter sequences based on their physical features. It is important 

to mention that bright results may be inferred from a deep cluster analysis, where, 

depending on the outcomes, there can be stated whether or not promoter sequences are 

being captured by the clustering algorithm based on their physical and structural 

properties.  

 
3 
 

2 AIMS AND OBJECTIVES 
 The aim of this dissertation is to be able to present a profile on how enthalpy, 

entropy, base pair stacking and stability inside promoter regions. The interpretation of 

these profiles provides us to understand better what happens within promoter regions and 

to be able to differentiate bacterial promoters recognized by different RNAP σ factors. 

 
2.1 SPECIFIC AIMS AND OBJECTIVES 

 The specific goals of this paper are: 

• To analyze promoter sequences in terms of their DNA entropy, enthalpy, base-

pair stacking and stability in different RNAP sigma groups; 

• To clusterize promoter sequences according to their own physical aspects; 

• To promote an overview of the physical aspects permeating promoter sequences; 

• To use the knowledge gathered by the physical feature assessment to produce 

biological inferences; 

• To enable the physical aspect comprehension of promoter sequences to be used in 

in silico analysis. 

 
4 
 

3 THEORETICAL REFERENCES 
The upcoming section will explain the literature used to conduct this dissertation. 

Here, a multidisciplinary junction of biology, physics and computer science is presented. 

These three fields of study show synergy and vary from early biological and physical 

concepts to the state of art of computer techniques regarding artificial intelligence. This 

section is organized in a way that: i) biology concepts regarding molecular biology comes 

first, then; ii) a profile of physical characteristics concerning promoter sequences; iii) a 

physical review of enthalpy, entropy in thermodynamics point of view; and finally; iv) 

information technology’s (IT) optics throughout clustering technique and how IT aids 

other areas such as biology. 

 
3.1 GENE EXPRESSION AND ITS REGULATION 

The deoxyribonucleic acid (DNA) sequence is described by the molecular biology 

as an information repository, which is necessary to build RNA, and through this – in most 

cases – a protein that has a function related to cell structure, regulation, or catalysis (DE 

ROBERTIS, 2003). There are several mechanisms that make sure the correct gene is 

expressed in the right moment. These mechanisms are defined as regulators of the genetic 

expression. Thus being, a cell, tissue or organism will be capable of improve, decrease, 

start or halt the production of RNA, proteins, and the gene’s final products according to 

the metabolic demand. There are mechanisms responsible for controlling the genetic 

expression, when succeeded, grants to the organism – apart from its complexity (single 

or multi cell) – the ability of having its following necessities supplied (DE ROBERTIS, 

2003; SANDERS and BOWMAN, 2014; BROWNING and BUSBY, 2016). 

Figure 3.1 represents the central dogma of molecular biology, proposed by 

Watson and Crick in 1953 (WATSON and CRICK, 1953), in which the flow of genetic 

information is tracked. The first step represents the DNA replication, occurring when the 

organism needs to produce more DNA molecules starting from another DNA template. 

The step where the DNA carries information to form the ribonucleic acid (RNA) is the 

transcription, where DNA information is transferred to an intermediary molecule, the 

RNA in order to produce, in the last step portrayed by the Figure 3.1, protein molecules 

in the translation process, attending to an evolutionary response and granting survivability 


5 
 

to the organism itself (CASES et al. 2003; MCADAMS et al. 2004; KREBS, 

GOLDSTEIN and KILPATRICK, 2014). 

 
Figure 3.1: Molecular biology central dogma (adapted from KREBS, GOLDSTEIN and 
KILPATRICK, 2014) 

 
Liu et al. (2018) proposed an orthogonal model to analyze the molecular biology 

central dogma. The model, shown in Figure 3.2, sought a comparison from the dogma 

itself with a computer software that must run in different platforms. Any specific big 

change on the software can affect its compatibility capacity in other systems. Thus, a 

software, in order to function in other platforms and to be compatible with other systems 

must diminish nuances that represent specificity of a single organism. This is the 

definition of an orthogonal system presented by the authors, where the components of 

this system (DNA, RNA, proteins) interact between themselves to achieve a specific goal, 

without interrupting or be interrupted by native cellular functions. On the other hand, 

universal rules are something that should be avoided in biologic sciences, tt is commonly 

said the main biology rule is that there are no rules. Exceptions can be found, basically, 

in every fundamental principle (KOONIN, 2012; LIU et al., 2018). 


6 
 

Figure 3.2: Orthogonal model from molecular biology central dogma (Adapted from LIU 
et al., 2018) 

 
The DNA molecules is composed by: i) coding regions, which contain information 

regarding the final genetic products and; ii) regulating elements that act as controllers, 

assuring that the processes start and finish in the right place and time. The study of these 

regulating elements aids in understanding the genes’ functions in different species and 

enabling for the organism to have its needs suited when facing environmental changes. 

Additionally, the genes’ functionalities provide the scientific ground to develop new 

drugs and to understand the biological mechanisms related to diseases (KREBS, 

GOLDSTEIN and KILPATRICK, 2014). 

3.1.1 PROMOTER SEQUENCES, TRANSCRIPTION AND THE RNAP ENZYME 

The transcription process can be defined as one of the main steps in regulating the 

genetic expression in any organism. Bacteria are beings that live in soils, colonize plants 

and infect animal tissues and are likely to extreme environmental changes such as heat, 

humidity and acidity, which can affect the survivability of the cell if there is no suitable 

metabolic answer. In this meaning, bacteria rely on regulation systems that seek to 

optimize the metabolic answer, providing the cell the skill to make right decisions about 

which nutrient should have its production prioritized and which environmental changes 


7 
 

should be considered (CASES et al., 2004; MCADAMS et al. 2004; CASES and 

LORENZO, 2005). 

The transcription process will produce RNA, a molecule that this is almost 

identical in terms of sequence to the DNA coding strand. The DNA coding strand follows 

the 5’-3’ direction and is complementary to the sample strand, which follows a 

termination 3’-5’ and works as a model for RNA synthesis. The synthesis of RNA is 

catalyzed by the enzyme DNA dependent RNA polymerase (RNAP), this will be detailed 

in the next section. The transcription process in Figure 3.3 initiates when RNAP identifies 

a promoter region in the upstream the gene. Starting from this position, the RNAP moves 

over the gene, performing RNA synthesis until it finds a terminator sequence that liberates 

the DNA molecule, finishing the transcription process. Previous sequences to the 

transcription starting site (TSS) are named upstream and the sequences that symbolize the 

coding region are named downstream. The sequences are generally written in a way that 

its transcription advances from left (upstream) to right (downstream) this corresponds to 

the writing in the messenger RNA (mRNA) in a direction 5’-3’ (KREBS, GOLDSTEIN 

and KILPATRICK, 2014). 

 
Figure 3.3: Transcription process (adapted from KREBS, GOLDSTEIN and 
KILPATRICK, 2014) 

 
8 
 

A key element in the transcription process is the promoter sequence. As shown in 

Figure 3.3, it tells the RNAP where exactly to start and end the mRNA writing. This 

element is characterized by being a segment of DNA that precedes the coding region. The 

first step involved in the transcription process is to the RNAP identify the promoter 

sequence (KIM et al., 2005; ABEEL et al., 2009).  

Another key factor related to bacterial promoters is the action of the RNAP 

enzyme. The RNAP is a protein complex with its assigned function: to look specifically 

for the promoter region, and just after this acknowledgement the transcription process 

will continue. It is worth observing that RNAP plays a similar role in a variety of living 

organisms, which makes it an important element in transcription processes of all living 

organisms. In fact, this enzyme has suffered few alterations in the evolution process. The 

RNAP counts with six highly conserved units: two α subunits, one β subunit, one β’ 

subunit; one subunit ω and σ. The σ subunit directs the bacterial RNAP to specific sites 

to connect to the DNA, matching environmental needs of the organism. The Table 3.1 

shows each subunit present in the RNAP with their specific function. 

Table 3.1: Description of the RNAP's subunits in E. coli (LEWIN, 2008) 

SUBUNIT RNAP Function  

α Connecting regulatory elements  

β Phosphodiester bonds creation  

β’ Connecting to the sample strand  

σ Promoter identification and initiation of transcription  

Ω Addition in the connection force between the units  

 
The σ factor is responsible for mediating the interaction between RNAP and 

promoter. Each σ factor can start the transcription of different genes and groups of genes, 

being associated with promoters that regulate the expression of a group of genes during a 

specific cellular moment. The σ subunit is identified according to a molecular weight 

value of the cell (σ24, σ28, σ32, σ38, σ54 and σ70). A feature involved in the interaction 

between promoter and RNAP are the consensus sequences. These are groups of 

nucleotides which face a level of base pair conservation, they are just one of the 


9 
 

identifying quotas of the RNAP. Any change in nucleotides in the consensus sequences 

can affect the efficiency and the speed of RNAP. As displayed in Figure 3.4, the 

sequences of nucleotides can be found in several promoters, creating a biological 

identification pattern. Thus, when recognizing a promoter, the RNAP looks for these 

regions, enhancing the idea of base conserving and the formation of consensuses 

(KREBS, GOLDSTEIN and KILPATRICK, 2014).  

It is already known that bacterial promoters have a level of nucleotide 

conservation. This helps the RNAP when looking for promoters, the nucleotides tend to 

follow a certain biological pattern. In E. coli promoters regulated by the σ70 it is possible 

to find two regions presenting a higher nucleotide retention level. As indicated in Figure 

3.4, this consensus regions are located in the nucleotides -10 and -35, the sequences are: 

TATAAT and TTGACA, respectively. The Figure 3.4 shows promoters recognized by 

σ70, which as the σ factor that is responsible for starting the transcription processes in a 

handful of genes, this σ factor is identified as a housekeeping one, when the σ24, σ28, σ32, 

σ38 and σ54 are known to be alternative σ factors (HAUGUEN, ROSS and GOUSE, 2008; 

LEWIN, 2008; DE ÁVILA E SILVA et al., 2011, BABU, 2013). 

During the RNAP and promoter interaction, two main moments can be 

highlighted. The transcription starts with the association between RNAP and the promoter 

sequence, forming a closed complex, in this process, the DNA remains untouched and 

protected by catalytic sites, which the function is to protect any single kind of alteration 

in the DNA that can cause unwanted mutations. After this first step, the closed complex 

is then converted in an open complex in which the DNA is partially untangled, initiating 

the RNA synthesis. Lastly, the σ subunit present in the RNAP detaches from the DNA 

and the process is ended when a terminator region is found (LEHNINGER, 2000; 

LEWIN, 2008). 

 
10 
 

Figure 3.4: Different functions of the σ factor in E. coli. σ70 promoters (Adapted from 
PAGET and HELMANN, 2003; KREBS, GOLDSTEIN and KILPATRICK, 2014) 

 
As depicted in Figure 3.5, the transcription process has it first step: the initiation, 

when the promoter is recognized by the RNAP and the RNA synthesis begins. Then, there 

is the elongation stage, in which the so-called DNA bubble is created and moves along 

the DNA strand, synthesizing RNA. At the end, there is the termination, where a 

terminator region is found, the RNAP detaches from the DNA, the transcript RNA is 

released and the bubble closed. 


11 
 

Figure 3.5: Bacterial Transcription process (Adapted from REECE et al., 2014). 

 
Thus being, it is possible to contrast three main stages in bacterial transcription: i) 

initiation, where RNAP connects to the promoter; ii) elongation to form RNA, and; iii) 

termination. To start this process, the RNAP must associate the promoter with a σ factor, 

linking the RNAP with the promoter and the RNAP becoming a holoenzyme. After, 

approximately fourteen DNA nucleotides are merged in the upstream region towards the 

TSS forming an open complex. When about fourteen  RNA nucleotides are synthesized 

the σ factor is released and an elongation complex is formed and begin to synthesize a 

RNA molecule a time, lastly when a terminator element is found, the RNAP is 

dissociated, allowing the starting of another transcription round (YARNELL and 

ROBERTS, 1999; BURGUESS and ANTHONY, 2001; SKORDALAKES and 

BERGER, 2003; MURAKAMI and DARST, 2003; KAPANIDIS et al., 2006; COOK 

and DEHASET, 2007; MA et al., 2016). 

 So far, the literature has presented that promoter sequences can be fairly 

distinguished than other genomic sequences. One of the parameters used by RNAP to 

identify promoters: the consensus regions, are not sites that presents absolute 

conservation in terms of sequence. The next section will explore how physical features 


12 
 

can be assessed to classify and identify promoters, enhancing the classification performed 

by in silico tools. 

3.2 PHYSICAL FEATURES IN PROMOTER SEQUENCES 

Now that it is well known what a promoter sequence is, as it was said in the 

previous section, there are a lot of critical factors that can tell promoters apart from other 

coding regions, these factors aid the RNAP targeting. A set of relevant, physical features 

of the promoter region will be explored in details through this section, providing a 

theoretical foundation for this paper. Computationally, the identification and 

classification of promoters can be a harsh task, one of the reasons for this is due to the 

fact that for some bacterial promoters there is an overlapping of the promoter into the 

coding region (RANGANNAN and BANSAL, 2007). This demands a detailed analysis 

up and downstream of the TSS, in order to overcome this challenge, a computer analysis 

is a way to look through the DNA molecule and extract more than just a simple physical 

feature and/or a determined sequence of nucleotides (RYASIK et al., 2018). 

There are physical features inside the promoter region that can distinguish when 

compared to non-promoter regions. The features that will be explored are: stability, 

curvature, bendability, entropy, enthalpy and nucleotide composition. The level that these 

features appear inside promoter regions turn them unique when compared to other 

regions. It is worth mentioning that through literature review, sometimes the sheer 

sequence analysis does not show any conservation in the promoter region, but some 

functionalities remain conserved. It is believed that the presence of these traces make 

available the in silico promoter identification, in a way that there is a biological function 

– the RNAP identification, linked to the existence of these features (KANHERE and 

BANSAL, 2005).  

3.2.1 BASE PAIR CONSERVATION 

Taking the TSS as a reference point, where after it, the transcription process will 

produce mRNA, there can be found some patterns regarding the presence of certain base 

pairs as an addition to the energetic viability, approached in section 3.2.1 and the 

consensus, explored in section 3.1.1. KOZOBAY-AVRAHAM et al, (2008) have 

indicated that in intergenic regions – where promoters can be found, AT nucleotides are 

more common to happen. The authors have also concluded that the beginning of a 

sequenced gene is very rich in its GC content, this amount of GC will start to decrease at 


13 
 

the very end section of the gene, where the coding region gives places to intergenic 

sequences such as promoters and terminators.  

The identification of certain patterns in relation to base pair presence in specific 

DNA segments does not stop here. As shown in Figure 3.6, is possible to perceive a 

profile where the nitrogenous bases found directly after the TSS. The GC content found 

in E. coli Examples is 42.32% upstream the TSS and 50.79% downstream the TSS. This 

difference in GC levels happened due to the promoter region being constantly opened by 

the RNAP. In this way, to open a DNA strand in the promoter region there is an energetic 

cost, owing to the number of hydrogen bonds beforehand mentioned, it makes sense – 

physical and biologically, that the region to be opened have base pairs that are easier to 

open. 

 
Figure 3.6: Presence of GC content in E. coli promoter sequences (Adapted from 
MEYSMAN et al., 2014) 

 
Other studies have focused on the presence of certain bases along the promoter 

region. In one of these, Barrios, Valderrama and Morett (1999) have shown that for 

bacterial promoters recognized by RNAP σ54, the TSS starts the transcription in the 

positions -12 and -24. Differently from sequences acknowledged by σ70, with the 

consensus happening in positions -10 and -35 upstream the TSS. The authors proposed 

and experiment that sought to analyze 84 promoters σ54 dependent and came to the 

conclusions that a significant number of promoters had its TSS beginning in a purine, 

with mRNA transcription starting precisely 12 nucleotides upstream the retained purine. 

The Figure 3.7 shows this data distributed, construing the first nucleotide upstream the 

TSS (BARRIOS, VALDERRAMA and MORETT, 1999). 

 
14 
 

Figure 3.7: Nucleotides after -12 consensus in σ54 promoters (Adapted from BARRIOS, 
VALDERRAMA and MORETT, 1999). 

 
Another study was conducted by Burr et al., (2000) where E. coli Promoters 

containing σ70 transcribed genes. It was found evidence that there is promoter activity 

found in the -14/-15 upstream positions. It is worth it mentioning that, as stated by the 

literature, the two motifs that help RNAP recognition in this σ are -10 and -35 upstream 

the TSS. The authors found a hierarchy in promoter activity that tends to be lower when 

TG base pairs are present. The reports have shown the TG presence aids in the open-

closed complex transition. When the test is directed to the next 2 nucleotides, in positions 

-16/-17, again TG dinucleotides were found. The Figure 3.8 shows the dinucleotides 

through a histogram for the two mentioned positions: -14/-15 and -16/-17. The authors 

concluded in this study, using 300 E. coli Promoters, the right after the consensus motif, 

there is another identifier, enhancing the RNAP recognition (BURR et al., 2000). 

 
Figure 3.8: Dinucleotide presence in positions -14/-15 and -16/-17 the TSS (BURR et al., 
2000) 

 
15 
 

One more work making analysis in the dinucleotide presence in E. coli. was done 

by Ezer, Zabet and Adrvan (2014), one of the goals was to check the evolutionary aspect 

of E. coli. promoters. The authors have identified a high conservation in locations where 

the RNAP binds to the DNA – called binding sites, through the analysis of base pair 

insertion/deletion rate in different regions located in the promoter sequence. The 

evolutionary traits were checked in the following groups: i) between the TSS and the first 

binding site; ii) inside the binding sites; iii) inside the binding sites with small base pairs 

spacing; iv) inside the binding sites, further than 100 bp starting and between the last 

binding site and the terminator sequence (EZER, ZABER and ADRYAN, 2014). 

The results of this study are disposed in the Figure 3.9, in which is possible to 

perceive the evolutionary conservation in spots where the RNAP binds to the DNA. This 

conservation regarding the evolutionary aspect is due to the fundamental role RNA has 

in all living organisms. This is not different with the enzyme responsible for RNA 

synthetization, this suggests the frameworks of RNAP are not so different when 

comparing Archaea, Bacteria and Eukarya (WERNER and GROHMANN, 2011; EZER, 

ZABER and ADRYAN, 2014). 

 
Figure 3.9: Analysis of the insertion/deletion rate of dinucleotides grouped in different 
segments through the promoter sequence (Adapted from EZER, ZABER and ADRYAN, 
2014) 

 
3.2.2 STABILITY 

The stability values presented by the DNA molecule directly rely on the nucleotide 

sequence. In bonds between purines (adenine and guanine, A and G, respectively) 

chemical bonds of two hydrogen bounds are found, when there is union between 

0

0,002

0,004

0,006

0,008

0,01

BS <1000bp	
spacers

TSS	to	1st	BS After	last	BS >1000bp	
spacers

Insert-deletion	rate


16 
 

pyrimidines (thymine and cytosine, T and C, respectively) the number of hydrogen bonds 

is three, as shown on Figure 3.10. The transcription process involves the opening of a 

DNA strand, turning it into an open complex, hence, it is necessary to break the DNA 

strand, in other words, to spend power to get the DNA opened and separate the base pairs 

to write them to mRNA. For this process to be energetically viable, it is reasonable that 

the DNA segment with a lower stability between its base pairs, with a weaker chemical 

bond, is the one to be broken (KANHERE and BANSAL, 2005; RAMPRAKASH and 

SCHWARZ, 2007; DE ÁVILA E SILVA and ECHEVERRIGARAY, 2011). 

 
Figure 3.10: Hydrogen bonds on nucleic base pairs (WATSON and CRICK, 1953) 

 
Figure 3.11 shows an energetic profile of a DNA slice containing the upstream 

and downstream region. The segment corresponds to 611 E. coli Promoters, the TSS is 

located in the position 0. When analyzing the image, it is possible to identify three 

stability peaks, matching the consensus regions -10, -35 and -50, spots that RNAP 

identifies promoters (RANGANNAN and BANSAL, 2007). 


17 
 

Figure 3.11: Stability profile in E. coli’s promoter region (RANGANNAN and 
BANSAL, 2007) 

The following Table 3.2 shows the stability values for each base pair present in 

DNA sequences. This being, it is possible to highlight the genetic stability when the 

analysis object is the promoter region, whereas a whole genome can be checked through 

stability values and promoters be found (SANTALUCIA and HICKS, 2004). 

Table 3.2: Stability values for DNA base pairs (SANTALUCIA and HICKS, 2004) 

Nucleotide 
Duplex 

Stability Value 
(kcal/mol-bp-1)  Base pair Stability Value 

(kcal/mol-bp-1) 

AA -1  TG -1.44 

AT -0.88  GT -1.44 

TA -0.58  TC -1.28 

AG -1.3  CT -1.28 

GA -1.3  CC -1.84 

TT -1  CG -2.17 

AC -1.45  GC -2.24 

CA -1.45  GG -1.84 

  
18 
 

3.2.3 STRESS INDUCED DNA DUPLEX DESTABILIZATION (SIDD) 

The initiation and transcription process during the regulation both involve the 

untwisting of the DNA duplex, this process of separation must be controlled. In simple 

terms, according to Benham, Wang and Noordewier (2006), this feature includes a 

relaxation in the bonds between the base pairs, which in case of constant separation, 

decreases the use of energy needed to be constantly opening the DNA strand. This process 

occurs through superhelical stresses imposed on the duplex. This feature does not depend 

on the primary structure of the DNA strand nor the stability values. In this process, there 

is a difference between the energy spent in separating the strands to form an open 

complex, with the specific base pairs and the benefitted energy from the fractional 

relaxation in the superhelical stress. It provides energy to control the SIDD process and 

for the DNA strand to remain open during the process when the mRNA is being written. 

It is known that in E. coli genome, the promoter sequences present a higher SIDD level. 

Some of the non-coding regions containing promoters are unstable, while coding regions 

are more stable under the stress imposed by negative superhelical value. The variations 

in the superhelical level in a promoter can show several effects in final product coded by 

the gene, one of them is the SIDD variation (WANG, BENHAM and NOORDEWIER, 

2004; WANG and BENHAM, 2006; DE AVILA E SILVA and ECVHERERRIGARAY, 

2012). 

The Figure 3.12 shows the destabilization level G(x) needed to the DNA strand 

with its base pairs remain open. Spots where the destabilization level is high have low 

G(x) values. As shown in the Figure 3.8, there are four sets of sequences: i) the promoter 

sequences that were identified immediately present in the upstream direction to the TSS; 

ii) coding sequences starting from the in the located TSS and extend up to 1001 base pairs 

towards the mRNA transcription; ii) intergenic regions, but no promoters; iv) and a 

random set of sequences. The authors (WANG, BENHAM and NOORDEWIER) have 

found that conserved SIDD sites show a higher tendency to avoid coding regions, where 

in intergenic regions, well documented promoter sequences indicate a higher SIDD value.  


19 
 

Figure 3.12: SSID profiles (Adapted from WANG and BENHAM, 2006) 

 
3.2.4 DNA CURVATURE AND BENDABILITY 

The term DNA curvature refers to the ability to the DNA twist itself without the 

aid of external forces. There are variations found in the DNA’s linear trajectory that grant 

the DNA strand its curved shape. When analyzing curvature levels, this can be a feature 

to distinguish positions in the whole genome. Studies from Perez-Martin, Rojo and 

Lorenzo (1994) defined the DNA curved view as a feature that helps that transcription 

initiation, from the moment when RNAP binds to the DNA. In this way, is possible to 

find a difference. When comparing the curvature values between promoter regions and 

coding sequences (OLIVARES-ZAVALETA, JÁUREGUI and MERINO, 2006; 

KOZHOBAY-AVRAHAM et al., 2008). 

The Figure 3.13 shows how the curvature causes influence in the DNA percentage. 

The gray region represents the curvature values in regulatory regions and the activation 

momentum of specific genes in E. coli, when compared to other DNA regions, the 

regulatory elements have a distinction, supporting the idea that RNAP can use the curved 

aspect to initiate transcription. 


20 
 

Figure 3.13: DNA curvature profiles from E. coli, comparing the DNA segments with 
the moment some specific genes (gray region) (OLIVARES-ZAVALETA, JÁUREGUI 
e MERINO, 2006). 

The bendability of the DNA strand is another physical feature of the DNA 

molecule. This is differently presented in promoter regions due to the twist performed by 

the DNA when binding to RNAP. As shown in Figure 3.14, the hardness level contained 

in promoter regions have different levels when compared to other genomic sequences 

(MEYSMAN et al., 2014). 


21 
 

Figure 3.14 : Bendability profile of the region promoter compared to other genomic 
regions (Adapted from MEYSMAN et al., 2014) 

 
3.2.5 BASE PAIR STACKING 

Stacking interaction between two adjacent base pairs is an essential force 

component responsible for DNA stabilization and gene regulation (ZHANG et al., 2015). 

SATTIN et al., (2004) have indicated how this feature works, and due to the twisted 

structure of DNA strand, the relative force needed to unstack GC base pairs would cost 

more to unstack a base pair consisting of AT. This is due to the number of hydrogen bonds 

found in each one of the bindings (KANHERE and BANSAL, 2005). Previously, 

ZHANG et al., (2015) have yet another indicated the amount of binding strength between 

GC and AT base pairs, showing that the first base pair would cost 20.0 piconewton (pN) 

and the second set would cost 14.0 pN. This indicates a lower value in the strength binding 

between the base pairs most found in promoter sequences – which is AT (MEYSMAN et 

al., 2014). 

One major component in order to understand the nuclear details of the gene 

expression is the thermodynamic stability of the double stranded DNA. This stability 

value is determined by the interactions between the nucleic acid base pairs. The DNA is 

known for its helical structure, this structure is stabilized by the hydrogen bonds that 


22 
 

contain in between the beforementioned interactions (ZHANG et al., 2015; HASE and 

ZACHARIAS, 2016). The Figure 3.15 displays the mean value of the base stacking 

energy in E. coli. sequences. The Table 3.3 shows the stacking value for each pair of 

nucleoside (ORNSTEIN et al., 1978) and the AT base pair stacking value will count as -

3.82, being the highest in Table 3.4, while GC connection present the lowest value of -

14.59, encountering the data presented by ZHANG et al., 2015. 

 
Figure 3.15: Base stacking energy profile (MEYSMAN et al., 2014) 

 
Table 3.3: DNA nucleoside stacking values (ORNSTEIN et al., 1978) 

Nucleotide 
Duplex 

Stacking Value 
(kcal/mol-bp-1)  Base pair Stacking Value 

(kcal/mol-bp-1) 

AA -5.37  TG -6.57 

AT -6.57  GT -10.51 

TA -3.82  TC -9.81 

AG -6.78  CT -6.78 

GA -9.81  CC -8.26 

TT -5.37  CG -9.69 

AC -10.51  GC -14.59 


23 
 

CA -6.57  GG -8.26 

  
3.2.6 ENTROPY, IRREVERSIBLE PROCESS AND ENTROPY VARIATION 

WITHIN THE DNA MOLECULE 

According to the second law of thermodynamics, whenever energy is submitted 

to any sort of process resulting in its transformation, part of it becomes unused. This way, 

all natural-occurring phenomenon is classified as being an irreversible process, happening 

in a one hand way, as shown in Figure 3.16 where the ice cube will eventually melt and 

the ice will never lose heat for the water, this makes it impossible for the water to become 

hotter and the ice become colder (YOUNG and FREEDMAN, 2016). 

 
Figure 3.6: Irreversible process (adapted from YOUNG and FREEDMAN, 2016) 

 
Now, when assaying nature laws, every event can only happen in a single 

sequence of events, this is determined through what physicians determine as: the arrow 

of time. What turns systems such as the one exemplified in Image 3.16 a single direction 

thermodynamic system where, according to the thermodynamics, the nature grants the ice 

will not give in heat to water (GASPAR, 2000; YOUNG and FREEDMAN, 2016). 

The role played by the entropy in this scenario will be further explored. The 

concept presented here needs to be considered, where in every physical system, there is a 

single direction for the exchange of heat between the components inside these systems, 


24 
 

this same description of environment is found in a cell (DAVIES, RIEPER and 

TUSZYNSKI, 2013).  

The main principle of the second law of thermodynamics says that no thermal 

system is able to fully convert its energy in work. Every thermal process product an 

amount of unused and dissipated energy. Adding irreversible process into this, and the 

energetic variation can be measured through what physicians define as entropy variation 

(Δs). The entropy is then, treated as a thermodynamic magnitude that measures the level 

of irreversibility in a system, and summoning thermodynamics’ second law, where heat 

will never fully become work, the entropy of a system will always increase. In other 

words, the entropy can be used to describe the parcel of unused energy (HALLIDAY et 

al., 2006). 

Any natural phenomenon leans to have its entropy raised. Inside the innumerous 

settings a system can have, the least organized is always the most probable and natural to 

occur. Under this point of view, it raises a notion that is a bad concept among physicians: 

the one that says entropy is responsible for measuring the disorder level in a universe. 

This idea needs to be more explained, and with the presented single direction that 

processes have, they always tend to move from a more organized to a less organized state. 

Thus, it is a simple-minded explanation to classify the entropy only as a disorder 

measurement without considering the energetic background explored (HALLIDAY, et 

al., 2006, DAVIES et al., 2013). 

The Figure 3.18 depicts the raising disorder level in a system, when advancing in 

this system setting, a less organized state is always more possible, this is characterized by 

entropy raising. This same image shows the possible distributions a gas can achieve in its 

environments. The gas is not limited into a particular place in the environment, spreading 

itself equally through the whole system, this way, the entropy in situation D is higher than 

entropy in situations A, B and C; the state shown in situation D is always more probable 

as time moves. (GASPAR, 2000; HALLIDAY, et al., 2006; DE LIMA, 2007; YOUNG 

and FREEDMAN, 2016). 


25 
 

Figure 3.17: Portrayal of the possible distribution of a gas in a closed environment 
(Adapted from MAIA and BIANCHI, 2007) 

Living beings or, living systems perform a high number of complex processes, all 

synchronized seeking to keep and maintain the system’s biological functions. The 

information repository needed to store the steps required to perform the beforementioned 

processes is the DNA. The DNA has its own alphabet, consisting in four letters, or 

nitrogenous bases and the setting of these will indicate the actions to keep the organism 

living. Working similarly as other systems, the DNA required entry information (e.g. a 

chemical gradient), which will be processed and generate outputs in the system (e.g. a 

protein, DNA). To keep functioning, living organisms need a stable energy supply, which 

will be converted in useful work and keep the system in a constant physiological 

temperature. Therefore, the energetic production of a system is essential for its survival 

(DAVIES et al., 2013; MULLIGAN et al., 2015). 

A cell works exchanging material and heat with an external surrounding, this 

defines the cell as an open system. In thermodynamic issues, a cell is similar as a machine, 

this was explored in a study from Davies et al., (2013). A cell needs to obey the laws of 

physics, and can have Δs measured in four distinct moments: i) chemical bonds leading 

to cell aggregation; ii) mass transport in and out the cell; iii) heat generation due to cellular 

metabolism; and iv) information stored in genetic code. Thus, the entropy level of DNA 

sequences will always be the highest possible (HERZEL et al., 1994; DAVIES et al., 

2013). The following Table 3.4 shows the entropy values for Watson-Crick base pairs in 

termination 5’-3’ (WATSON and CRICK, 1953). The authors Santalucia and Hicks 

(2004) presented the thermodynamic values for base pairs. The parameters used to get to 

these numbers were derived from linear regressions with 108 sequences taken as 

examples. 


26 
 

Table 3.4: Entropy values for DNA base pairs (Adapted from SANTALUCIA and 
HICKS, 2004) 

Nucleotide 
duplex 

Entropy Value 
(kcal/mol-bp-1)  Base pair Entropy Value 

(kcal/mol-bp-1) 

AA -21.3  TG -22.4 

AT -20.4  GT -22.4 

TA -21.3  TC -22.4 

AG -21  CT -22.2 

GA -21  CG -27.2 

TT -21.3  GC -24.4 

AC -22.7  CC -19.9 

CA -22.7  GG -19 

 
Then, it is evident that the entropy is a physic magnitude present in living 

organisms, which makes it possible to calculate the entropy values present in the DNA of 

any living organism.  

3.2.7 ENTHALPY 

Enthalpy is defined as the amount of heat that is present in a system containing 

few to none pressure variation. Mathematically, enthalpy can be defined as: 𝐻 = 𝑈 + 𝑃𝑣, 

where: P is the pressure of a system and v is the volume. As U, P and v are state functions, 

the result of this equation H, the enthalpy is also a state variable. In a way that the variation 

seen in the enthalpy, moving from a start to an endpoint will take the whole system into 

another state. Thus, in a system where volume and pressure are constant, in can be 

asserted that the enthalpy corresponds to the amount of heat added or removed from the 

system. Adding or removing heat relies on the comparison of the products and reagents 

in a system, where if a system between its timeline is characterized as: i) an endothermal 

process where there is heat absorption, and the enthalpy variation is higher in the products 

than the reagents; and ii) exothermal, where there is heat dispersion, turning the enthalpy 

variation negative (ATKINS and DE PAULA, 2012). 


27 
 

Studies have found that the enthalpy level in promoter sequences tend to be 

different than other genomic sequences. According to the literature, the GC content, 

which is lower in promoters, needs an enthalpy value of 10.75 ± 1.43 kcal/mol-bp for its 

stabilization, while AT nucleotides need 7.4 ± 0.7 kcal/mol-bp to stabilize. At first sight, 

it would make sense for promoter sequences to have a different enthalpy value due to its 

low GC content. However, other studies have shown that the forces involved in the bond 

between protein (RNAP) and DNA strand are not simple. They need to be weak enough 

to allow the protein to easily scan the DNA and, simultaneously, must be strong enough 

for longer-living connections (PRIVALOV and CRANE-ROBINSON, 2018). 

In the same way as it was previously done with the entropy values, in study from 

Santalucia and Hicks (2004) the values for enthalpy in DNA base pairs were calculated. 

The values are shown in Table 3.5. 

Table 3.5: Enthalpy values of DNA base pairs (SANTALUCIA and HICKS, 2004) 

Nucleotide 
duplex 

Enthalpy Value 
(kcal/mol-bp-1)  Nucleotide 

duplex 
Enthalpy Value 
(kcal/mol-bp-1) 

AA -7.6  TG -8.4 

AT -7.2  GT -8.4 

TA -7.2  TC -7.8 

AG -8.2  CT -7.8 

GA -8.2  CC -8 

TT -7.6  CG -10.6 

AC -8.5  GC -10.6 

CA -8.5  GG -8 

 
3.2.8 ENTROPIC AND ENTHALPIC CONTRIBUITIONS TO DNA STABILIZATION 

So far, this dissertation has already explained in section 3.2.4 the difference in AT 

and GC contents regarding regulatory sequences, the AT prevalence also has a link with 

thermodynamics measurements that are found in the DNA double strand. The enthalpic 

contribution on AT base pairs is somehow larger than GC base pairs. Previous studies 


28 
 

(MARMUR and DOTY, 1962) believed that the thermostability present in the DNA 

double helix would be increasing when the GC content is higher due to its extra hydrogen 

bond as shown in section’s 3.2.1 Figure 3.7. According to the literature, early studies have 

shown that due to the GC content, promoters would have a different enthalpy and entropy 

value in comparison to other genomic sequences. These studies have shown that the 

entropy value for an AT base pair stabilization would be 7.4 ± 0.7 kJ/mol-bp and the 

average enthalpic contribution for a GC base pair stabilization is 10.75 ± 1.43 kJ/mol-bp. 

Regarding the entropy needed for base pair stabilization, AT is 20.55 ± 2.39 kJ/mol-bp 

and GC is 27.24 ± 3.58 kJ/mol-bp. Promoters are known to be poor in their GC content. 

The GC affects the DNA stabilization both in the enthalpic and entropic contribution. The 

high AT levels presented by promoters comes from water that links to AT, this more 

complex system increases the entropy (PRIVALOV and CRANE-ROBINSON, 2018).  

Entropy and enthalpy are both thermodynamic features of the DNA and these two 

are closely related. Nevertheless, the way that these two measurements behave is quite 

different. Enthalpy refers to the system as a whole, as stated in section 3.3.2, in addition, 

this system being composed of water-enzyme-DNA has high entropy on its interaction. 

Water is a major ingredient that can be found permeating the hydrogen bonds and 

affecting final entropy measurements, in other words, this means the system is 

disorganized due to the amount of its components. While RNAP moves around the strand 

and water connects to the DNA we have a final product of a more organized system. On 

the other hand, enthalpy does not refer to all the system’s components involved in the 

beforementioned system, enthalpy only gets affected by the hydrogen bonds found in the 

connecting nucleotides. This indicates that entropy is a system measurement, while 

enthalpy is used to check particular components (ZU, ZHI and LENG, 2012; 

MORGUNOVA et al., 2018; PRIVALOV and CRANE-ROBINSON, 2018)  

3.3 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 

Some recent advances in technology have provided the science an exponential 

growth in the availability of genomic data. With the data being ready, it comes up the 

need for ways to work with this large amount of data. One form of science that can be 

handy in this moment is the computer science, which is able to provide reliable tools in 

terms of identifying, registering and charting all the sequenced genomic data. With more 

and more trustworthy tools, the bioinformatics plays a key role in unscrambling and 

deciphering genomic, transcriptomic and proteomic datasets. Then, the bioinformatics 


29 
 

can be classified as a needful field of study integrated to biological sciences 

(RANGANNAN and BANSAL, 2007). 

Since the end of 19th century with the industrial revolution, science has developed 

and machines have been used to replace manual work. Furthermore, in 20th century, with 

the arise of the computer and information technology machines that were previously used 

to simply replace the physical effort performed by humans have started to do the work of 

a human brain. This led to the emergence of artificial intelligence (AI). In the 21st century, 

the advent of cloud computing, mobile computing has changed the lifestyle of people, 

which indicated the coming of a new era to computer science. AI is defined as a variety 

of human behaviors such as: perception, memory, emotion, judging, rationalizing, 

acknowledgment, comprehension, design, thinking and creation in a way that all activities 

can be artificially done by a machine, system or network (LI and DU, 2017). 

AI, thus, can be defined as machines designed to perform automated activities that 

require intelligence, such as decision making, learning and problem solving. More AI 

applications try to simulate human intelligence by performing its neural associations in 

an algorithm. The spectrum of AI’s acting is wide, one of its functions is to identify patters 

among big datasets. A common problem found in biological sciences (RUSSEL and 

NORVIG, 2003; CHRISTIAN, 2013).  

Inside the AI field, the agents perform an important set of roles. One of the AI’s 

purposes is to design an agent software, this implements functions and maps the feelings 

around its environment in actions. The agent is executed in a computer device called 

platform. There are agents with different intentions, they can vary from agents based in 

models, agents based in goals, agents based in utility to agents designed to learn. Any 

computer agent that works with learning allows the agent to operate in unknown 

surroundings, and over time, become proficient in how to deal with its environment 

(RUSSELL and NORVIG, 2003). 

The learning, as shown in this section can be used as an AI technique to formulate 

patterns in a large amount of fuzzy data. A task that would take a tremendous amount of 

time when executed by a human brain compared to a computer. The next section will 

focus on a specific AI technique used in this paper: clustering. 

 
30 
 

3.3.1 CLUSTERING AND THE K MEANS ALGORITHM 

Clustering is an AI technique that can be employed to solve several problems with 

the most diverse nature, such as image recognition, biological application, document 

grouping in similar topics, climate data junctions and geographic analysis. The boom in 

the availability of information that has been recently happening gives the scientific 

community whopping amount of data. In this sense, when divided in its respective groups, 

can provide to the scientist some new interpretations and points of view, until then, 

hidden. 

The goal of data clustering is well defined: find patterns that facilitate the natural 

grouping of a dataset. According to Webster dictionary, the cluster analysis uses as 

technique to classify data seeking that individuals of a population belong to different 

groups. Nevertheless, clustering is an AI technique that is capable of recognize features 

in the data that at first sight are not perceived. Clustering techniques deal with different 

ways to group data in a space on n dimensions, so that every element in a group have 

some link between its other group conjuncts. The use of clustering techniques is named 

discovery tools, which grants the user a bigger understanding regarding the data structure 

and the set of data. As demonstrated in Figure 3.18 (a) the entry data does not show any 

similarity. The next step, Figure 3.18 (b) displays the data grouped in distinct groups, 

sharing common resemblances (DUBES and JAIN, 1976; MERRIAN-WEBSTER 

DICTIONARY, 2016; JAIN, 2010). 

 
Figure 3.18: Data pre(a) and post(b) clustering (JAIN, 2010) 

 
31 
 

More studies tried to define some of the main purposes of the clustering technique, 

Jain (2010) propounded three main goals of clustering: i) subjacent structures: used to 

raise the judging of data, formulate hypothesis, detect anomalies and identify ledges; ii) 

natural classification: seeks to identify the resemblance level between individuals; iii) 

comprehension: employed as a method or group data and summarize them according their 

specific clusters. 

Connel and Jain (2002) used clustering to identify subclasses in handwriting 

through an online tool. In this study, different users wrote the same character in different 

ways. According to the literature, when the variance of an element is high, the efficiency 

in the clustering raises. This is a common example in the real world, where not every data 

is always the same, and can be displayed in different ways, still being the same data 

though. As shown in Figure 3.19, the parameter of classification for the same cluster can 

be presented in distinct ways. It is up to the clustering algorithm to find out the same data 

is being displayed in a different way. The better the algorithm is, the higher its recognizing 

precision will be (CONNEL and JAIN, 2002; JAIN, 2010). 

 
Figure 3.19: Different recognition patterns in the same cluster (CONNEL and JAIN, 
2002) 

 
It is possible to divide the clustering algorithms in two groups: hierarchical and 

partitioned. The hierarchical algorithms recursively find the nested clusters in two means: 

i) in a agglomerative way, where the algorithm starts with a data point in its own cluster 

and blends to the pair of more similar clusters, successively till it forms a cluster 

hierarchy; ii) in a divisive way, where the algorithm starts with all data points in one 

cluster and recursively divides each cluster into smaller clusters.  


32 
 

On the other hand, the partitioned algorithms find all clusters simultaneously, 

while hierarchical algorithms are represented by a matrix n x n, where n is the number of 

objects to be clustered. The partitioned algorithm makes use of a n x d matrix, where n 

objects are embedded in a n dimension space (JAIN, 2010). 

The most common partition cluster algorithm is the K-means which was 

implemented more than 50 years ago and is still used in large scale due to its simplicity, 

efficiency and empirical success. A dataset consisting in n dimensional points can be 

divided in a set of k clusters. The K-means algorithm has as its goal to partition this data 

so that the squared error between the cluster arithmetic average and the points inside the 

cluster is minimized. This way, the K-means tries to diminish the squared error in a set of 

k clusters. The K-means starts with a division of k clusters and assigns patterns to the 

clusters, these patterns are based in the called cluster center, where in an optimal universe, 

all elements inside a cluster should be as close as possible as the center of its cluster. As 

the squared error always decreases when the number of clusters raise, K-means will only 

minimize the squared error in a fixed number of clusters, the steps performed by K-means 

are disposed in Table 3.7 (JAIN and DUBES, 1998; JAIN, 2010). 

Table 3.6: Description of the steps performed by the K-means algorithm (Adapted from 
JAIN and DUBES, 1998) 

Step Description 

1 Select a partition with k clusters 
Repeat steps 2 and 3 until the error is minimum  

2 Generate a new partition, assigning the closest value from the 
cluster center 

3 Calculate new cluster centers 

 
The Figure 3.21 illustrated the acting of k-means in a two dimension set with 3 

clusters. The first step (a) the input data are presented; (b) three cluster centers are initially 

selected; (c) and (d) demonstrates middle iterations updating the cluster center and; (e) 

the final grouping done by k-means (JAIN, 2010). 


33 
 

Figure 3.20: K-means illustrated (JAIN, 2010) 

 
In a study from Edgar (2010), the author compared grouping tools to find out a 

good way to search genetic sequences. The study compared a new version of BLAST, the 

UBLAST, used alongside with a clustering algorithm, UCLUST, where for certain query, 

sequence databases are organized in a way the number of matching words is minimized. 

Taking advantage that small sequences have small sets of words in common, considering 

this, for an equivalence found in a database, is highly probable that this equivalence is 

found among the first candidates, and this probability quickly drops when the number of 

failed attempts of a match increase. This leads to a faster search, with less matchings 

being analyzed (EDGAR, 2004; EDGAR, 2010). 

The comparing study between the tool resulted in the UCLUST algorithm 

classifying better quality clusters, where the similarity between the results from CD-HIT 

were inferior in every tested case. The following Table 3.10 shows the comparison 

between the two clustering tools, UCLUST and CD-HIT. The UBLAST and UCLUST 

were used and introduce a new paradigm to sturdily group biological data, its use 

decreases the resource consumption to classify large scale-sequences (EDGAR, 2010). 

 
34 
 

Table 3.7: Comparative between clustering tools UCLUST and CD-HIT (Adapted from 
EDGAR, 2010). 

Algorithm Identity 
(%) Size Similarity 

(%) 
Time 
(mins) 

Memory 
(Mb) 

UCLUST 85 536 91.5 1.44 min 36 

CD-HIT 95 343 88.9 570 min 349 

UCLUST 90 230 96 1.8 min 40 

CD-HIT 90 175 92.1 62 min 349 

UCLUST 95 73 97.7 134 min 55 

CD-HIT 95 68 95.9 61.15 min 349 

UCLUST 99 11 99.5 789 min 165 

CD-HIT 99 15 99.1 123 min 411 

Identity is the clustering limit; Size, the average size of each cluster (higher is better); Similarity is the 
average identity between a cluster member and its representative sequence (higher is better); Time is the 
CPU time; Memory refers to the RAM amount used by the software. 

 
35 
 

4 MATERIALS AND METHODS 
 This section will explore on how this dissertation was conducted in terms of every 

tool and step that had to be used to produce results. 

 
4.1 PROMOTER DATA AQUISITION 

The data used to conduct this study consists of promoter regions from six different 

groups, regarding the σ recognition by RNAP, all of these were retrieved from the 

biological database RegulonDB (GAMMA-CASTRO et al., 2015). Table 4.1 represents 

how many examples of promoter profiles divided into the 6 different sigma factors present 

in the bacterial genetic expression. These examples are DNA sequences in direction 5’-

3’ with 81 nucleotides in the same way RegulonDB displays their promoter sequences. 

Table 4.1: Distribution of E. coli. promoter sequences through this research. 

σ factor Number of sequences 

24  508 

28  133 

32  299 

38  157 

54  83 

70 1869 

  
4.2 DATA TRANSFORMATION 

 After the examples on Table 4.1 where loaded and converted into numerical 

values corresponding to the different DNA physical features: entropy, enthalpy, base-pair 

stacking and stability. The example of each σ group was converted through a Python 

script, designed in order to automatically convert the examples of each σ group. This 

algorithm consisted in analyzing a file with all the examples from all σ groups shown in 

Table 4.1 and had the 81-nucleotide sequence transformed in its correspondent value. The 

values for entropy, enthalpy, stacking and stability are shown in section 3, under each 

physical feature sub-section. 


36 
 

Section’s 3 Tables 3.2, 3.3, 3.4 and 3.5 indicate a different scale in terms of the 

physical features. To soothe this difference and bring the results all gathered under the 

same scale, a normalization algorithm was used to transform the data in the interval 0-1. 

The algorithm that performed this data normalization was developed by the authors. The 

data normalization used in this dissertation sought to transform the present data in the 0-

1 range, by using the mathematical formula: 𝑛𝑑𝑎𝑡𝑎 = +,-./
01+,023

 (JUSZCZAK et al., 

2002), where, ndata is the normalized value between 0-1 range; x is the information being 

normalized at moment; min is the lowest value found in the whole dataset; max represents 

the highest value found in the dataset. 

 
4.3 K-MEANS CLUSTERING 

 
 As soon as the data was ready to perform the promoter sequence characterization, 

some actions needed to happen to produce the discussed results, these steps are divided 

into two section, where both explore different usages of the K-means algorithm. 

 The first approach, presented by section 5.2 indicates a K=2. Two is the minimum 

value K can assume in terms of data clustering (BHOLOWALIA and KUMAR, 2014), 

since in this section, the goal was to be able to distinguish - in terms of physical aspects, 

promoters associated to housekeeping genes, recognized by RNAP σ70 and alternative 

genes, recognized by RNAP σ24, σ28, σ32, σ38 and σ54 factors. 

 The second usage of K-means, depicted in Results section 5.3 sought to cluster 

the data. To cluster the data, the K value had to be found, since there are not only two 

groups of promoters to be clustered, and previous work from Dal’Alba et al., (2018, 

unpublished data) have clustered stability values and have identified optimal K values 

(Table 4.2). 

Table 4.2 – K in stability (DAL’ALBA et al., 2018) 

σ factor K value 

24 7 

28 3 

32 5 


37 
 

38 3 

54 2 

70 3 

However, when the presented K values were used to cluster other physical 

properties of promoter sequences, some clusters have presented a low amount of 

sequences in it, thus, the K value is not optimal (JAIN, 2010; BHOLOWALIA and 

KUMAR, 2014). In order to produce clusters with a higher purity, the K value had to 

reset following Elbow Method, which indicates a validation to determine the appropriate 

number of clusters in a given dataset. This method is based on the idea that the percentage 

of variance is explained by the chosen amounts of clusters, the first cluster will add a lot 

of information about the profile that data has inside this cluster, however, the amount of 

information will significantly decrease, this method’s idea is to start K=2 and increase by 

1 until the cost of this K addition is optimal. Once it starts dropping, the true K for a given 

dataset is found (BHOLOWALIA and KUMAR, 2014). Table 4.3 presents the clusters 

with their K value recalculated. 

Table 4.3 – New K values for entropy, enthalpy, base-pair stacking and stability values 

σ 
factor 

Entropy Enthalpy Stacking Stability 

24 4 4 7 7 

28 3 3 3 3 

32 4 5 5 5 

38 2 3 3 3 

54 2 2 2 2 

70 3 3 3 3 

   
Once the clusters the new clusters were ready, the next step was to perform a series 

of intersections between this clusters with a straightforward goal: determine in promoters 

share the same physical properties. To perform the intersections, a Python script had been 

developed and it checked all the promoters in the most populated cluster from each 

physical feature. Then, if the same promoter was present in the enthalpy, entropy, base-

pair stacking and stability most populated cluster, it could be said that the promoter 

sequence indicates a higher level of conservation in terms of its physical aspects and its 

physical profile is worth to be checked. 


38 
 

The workflow represented in Figure 4.1 represents the key moments to reach the 

discussion section, where, firstly, data is acquired from RegulonDB database, then it has 

to be converted into its entropy, enthalpy, base-pair stacking and stability values. Then 

our results are split into a global biological analysis of the physical profile in promoter 

sequences identified by different RNAP σ factors. The second wave of results include a 

data clustering using the K-means algorithm, where in the first, the K value is set = 2 and 

this paper tries to distinguish between housekeeping and alternative RNA σ factors. The 

second, uses the Elbow approach for setting the K and enabling a physical profile for each 

σ factor. 

 
Figure 4.1: Process workflow in this dissertation.  

 
39 
 

5 RESULTS AND DISCUSSION 
 This dissertation’s results are organized through two articles. The first one seeks 

to propose a biological analysis in promoter sequences physical features. The physical 

aspects’ comparison of promoter sequences recognized by different σ groups enables a 

better comprehension on how promoter sequences behave in comparison to other metrics 

used to classify promoters: the presence of consensual motifs. 

 The second article, yet to be submitted to publishing, uses the same physical 

aspects presented by the Article 1 in terms of the use of the clustering technique. This 

second paper aims to be able to look deeper in promoter sequences that share similar 

values in terms of their physical features. The sequences were submitted to an intersection 

assessment to determine which cluster has indicated a larger number of promoter 

sequences, then, each cluster is separately analyzed, and the examples that are present in 

all features are extracted. 

 The third section in the results, deals with an intersection analysis of our most 

populated clusters. The main step here was to select the clusters from enthalpy, stability 

and base-pair stacking profiles in different sigma groups and intersect these promoters, 

enabling to assess the profile of all the promoters that shared the same features. 

 
40 
 

5.1 ARTICLE 1 – COMPARISON OF ENTROPY, ENTHALPY, STABILITY 

AND BASE-PAIR STACKING PROFILES OF E. COLI IN PROMOTER 

SEQUENCES RECOGNIZED BY DIFFERENT σ FACTORS 

 
COMPARISON OF ENTROPY, ENTHALPY, STABILITY AND BASE-PAIR 

STACKING PROFILES OF E. COLI IN PROMOTER SEQUENCES RECOGNIZED 

BY DIFFERENT σ FACTORS 

Gustavo Sganzerla Martinez1*, Scheila de Ávila e Silva1 

University of Caxias do Sul - Biotechnology Institute  

Rua Francisco Getúlio Vargas, 1130, Bairro Petrópolis, Caxias do Sul, RS – Brazil, 

95070-560 

 
41 
 

ABSTRACT 

The transcription of gene expression in bacteria counts with the presence of a promoter 

sequences. These are conserved DNA sequences that precede the coding region and tells 

the RNAP enzyme where to start producing RNA, different sigma factors that are 

recognized by a subunit in the RNAP can start the transcription of genes with different 

functionalities. Although, the somewhat conserved promoter regions show variations that 

difficult their identification by simple nucleotide sequence analysis. The conservation of 

the nucleotide content this DNA site is not absolute for all promoter sequences. In face 

of this, there are physical aspects of DNA, such as enthalpy, entropy, stability and base-

pair stacking, that aids promoter prediction. In this paper, we propose to analyze the 

beforementioned measurements to help in promoter sequence understanding. There are 

not many promoter sequence recognition and identification tools that make use of 

combined physical aspects of the DNA. The results that were concluded in this paper tell 

us that the tested physical aspects, are, somehow, entwined and each different group of 

promoter sequences behaves differently in terms of their physical point of view. The 

results produced here may aid in bacterial promoter recognition, by delivering this area a 

stronger set of biological inferences. 

Key-words: bacteria promoter; DNA structural properties; gene transcription 

  
42 
 

1 INTRODUCTION 

The transcription process is a key factor during gene expression. This is no 

different in bacteria, which are sensitive to environmental changes, and requires a 

regulating system that prioritized which stresses conditions should be considered, 

granting survivability to the cell [1, 2, 3, 4].  

A key element in the transcription process is the promoter sequence. In a simple 

definition, a promoter sequence is a DNA segment placed upstream of the coding region 

[5]. When analyzing a genome, an important footmark for researchers is to search for 

genes firstly, and then seek for the promoter sequence that precedes the gene in question. 

This being, we can comprehend more way about the gene transcription [6].  

During transcription, an important actor is the enzyme RNA polymerase (RNAP). 

This protein is formed by different polypeptides and units, among which there is the sigma 

(σ) subunit that is responsible for identification and attachment to specific DNA 

sequences called promoters. Different σs can start the transcription of different gene 

groups by associating to different promoters regulating gene expression of a group of 

genes [3]. 

There are several critical factors that can tell promoters apart from other DNA 

regions and help RNAP targeting. An example of this are the consensual motifs, around 

the -10 and -35 upstream the gene. The study of these consensual regions has shown that 

promoters are somehow similar but not identical and have sites of a conservation level, 

however, these regions are not all the same [3]. Additionally, some bacterial promoters 

there is an overlapping of the promoter into the coding region, demanding a detailed 

analysis up and downstream of the site where RNA begins to have its nucleotides inserted, 

the transcription start site (TSS). To overcome this challenge and improve the in-silico 

analysis, approaches can be stated considering a way to look through the DNA molecule 

and extract more than just a simple nucleotide composition analysis linked to the 

consensual regions [7]. 

Despite the effort to computationally identify bacterial promoters, they still 

represent a challenge due to the wide variety of profiles that can be found. 

There are lots of efforts to computationally identify promoters. But this is still a 

challenge due to the wide variety of profiles that can be found in these regulatory 


43 
 

sequences. When identifying a promoter, features need to be considered to enhance the 

sensitivity of the tool. It is biologically known that promoter regions tend to be conserved 

when compared to coding regions. Some tools have been developed over time to aid in 

this field, some of them are: BTSS Finder [9], CNN promoter [11] and BacPP [8]. 

There are some features that RNAP uses to identify promoters, these are DNA 

physical aspects such as entropy, enthalpy, stability, base-pair stacking values, stress 

induced DNA duplex destabilization and DNA curvature and bendability. Those used in 

this paper are: entropy, enthalpy, stability and base-pair stacking.  The entropy and 

enthalpy are thermodynamic values that measure the amount of energy that is not 

converted into labor. Living beings perform a number of processes trying to keep their 

biological functions, requiring energy to occur. Thus, the organism needs a stable energy 

supply to keep functioning. The chemical bonds, mass transport in and outside the cell, 

heat spawning and information stored in the genetic code are all examples of processes 

that cause entropy/enthalpy variation in a cell. The stability of base pairs is a characteristic 

that refers to the amount of free-energy present in a base pair interaction. The base-pair 

stacking is a value found in the bond between the RNAP and the DNA molecule [11, 12, 

13]. 

The study of physical properties of a sequence enables a wide comprehension on 

how enthalpy, entropy, stability and base pair stacking correlate with each other enlighten 

in promoter sequences comprehension [14]. The understanding how the DNA physical 

aspects behave on promoter sequences may bring the scientific community a better 

understanding on the molecular structure itself. In this context, this paper aims to rely on 

the physical features in order to distinguish different promoter sequences, recognized by 

different sigma groups. It is believed, according to the literature that a deeper 

comprehension of these features may aid in promoter recognition tools. According to the 

literature, entropy, enthalpy, base pair stacking and stability show different behavior 

inside the promoter region when compared to other DNA sites. These differences can be 

used as parameters in promoter identification and prediction tools. 

2 MATERIALS AND METHODS 

The data used to conduct this paper was promoter regions from 6 different groups 

retrieved from the biological database RegulonDB [15]. Table 1 shows the number of 

examples divided into the 6 different sigma factors of gram-negative bacteria. 


44 
 

Table 1: E. coli A-T-G-C promoter sequences displayed in 5’-3’ termination. 

σ factor Number of sequences 
24 508 
28 133 
32 299 
38 157 
54 83 
70 1869 

 
All the examples shown in Table 1 were converted into values for the four physical 

features that were tested in this paper: entropy, enthalpy, base-pair stacking and stability 

(Table 2) considering the 16 duplexes. Once the values were converted in each distinct 

DNA physical feature, this data was normalized due to their different in scale in order to 

analyze how these features correlate with each other. The data normalization technique 

chosen was 0-1 normalization.  

Table 2 – Entropy, enthalpy, stacking and stability values in the DNA [11, 16]. 

Nucleotide 
Duplex Entropy Enthalpy Base-pair stacking Stability 

AA -21.3 -7.6 -5.37 -1 
AT -20.4 -7.2 -6.57 -0.88 
TA -21.3 -7.2 -3.82 -0.58 
AG -21 -8.2 -6.78 -1.3 
GA -21 -8.2 -9.81 -1.3 
TT -21.3 -7.6 -5.37 -1 
CC -19.9 -8 -8.26 -1.84 
GC -24.4 -10.6 -9.69 -2.27 
AC -22.7 -8.5 -10.51 -1.45 
CA -22.7 -8.5 -6.57 -1.45 
TG -22.4 -8.4 -6.57 -1.44 
GT -22.4 -8.4 -10.51 -1.44 
TC -22.4 -7.8 -9.81 -1.28 
CT -22.2 -7.8 -6.78 -1.28 
CG -27.2 -10.6 -14.59 -2.24 
GG -19.9 -8 -8.28 -1.84 

 
The final step was to use the tool Weblogo [17] to analyze the conserved 

nucleotides in positions through the 81-length promoter sequence. The goal of the use of 

this tool is to check among a set of n examples the most common nucleotide. 

 
45 
 

3 RESULTS AND DISCUSSION 

 The results involved the analysis promoter sequences from six different σ groups 

of E. coli and the main goal was to assess how entropy, enthalpy, base-pair stacking and 

stability behave in each one of the groups. It its already know that there are different 

consensual regions in the promoter sequences recognized by each σ factor. These 

consensual regions will aid whenever the RNAP enzyme σ subunit needs to look for 

promoter sequences and bind itself in the DNA. These consensuses are, somehow, 

conserved around their -10 and -35 positions. Apart from sequence similarity [3] there 

are other factors that RNAP look for when its σ subunit searches for promoter sequences, 

these factors are the local and distinctive physical features of the promoter regions. 

It is worth mentioning that a cell performs a handful of processes to survive, these 

processes can cause a thermal variation in the cell and its environment. Some components 

such as mitochondria. Thus being, in thermodynamic means, a cell is not so different from 

a machine, where the energetic balance is always sought [14]. 

In the results displayed in Figure 1, it is possible to perceive that around position 

-10, represented by the 49 on x axis, there is a variation on the lines for three features. 

This means that enthalpy, stacking and stability are meaningful features to distinguish 

promoters in between different sigma groups. As the image also shows, the values are 

different in every single group, this indicates that the groups themselves are not alike, and 

not just because they are all promoters, their physical features behave in a similar way.  

In terms of the σ24 promoters, they showed noisy results Figure 1 (A) and the lines 

did not exhibit overlapping. It is important to print out that almost 80% of σ24 promoters 

listed in RegulonDB were predicted in silico and not confirmed by biological experiments 

[18].  

The other σ promoters, σ28 (B), σ32 (C), σ38(D), σ54(E) σ70(F) have shown that on 

their series there is a clear and similar protuberance around -10 and -35, all of them have 

presented a similar variation regarding enthalpy, stacking and stability. These peaks 

match the consensual motifs reported in the literature presents for E. coli. promoters [3]. 

These variations around the motifs indicate that there are structural/physical differences 

in binding sites where the RNAP connects to the DNA molecule. This suggests that the σ 

subunit also looks for the presence of this distinguished set of physical profiles that 


46 
 

promoter sequences have around the -10 and -35 regions, not only relying on the 

consensual regions that each σ factor presents in E. coli. promoter recognition. 

Figure 1 – Each physical feature and its averages per position in the sigma groups 

 
This figure represents the average value per position in entropy, enthalpy, base-pair stacking and stability 

values in all promoters tested in this paper, displayed in Table 4.1 under Materials and Methods. (A) 

indicates promoters recognized by σ24, (B) σ28, (C) σ32, (D) σ38, (E) σ54 and (F) σ70. 

The online tool Weblogo [17] was used to check the average nucleotide 

composition in each one of the 81 length promoter sequences presented in Table 1 and 

the results are: i) σ24 have a prevalence of A nucleotides 25th – 28th nucleotide A, and a 

predominance of A/T from the range 49th to 56th; ii) from the 133 σ28 promoters, the -10 

position has a TGCAAT sequence surrounding the -10 position, while the -35 region is 

composed by A/T nucleotides; iii) σ32 promoter sequences recognized by sigma 32, 

around the degenerated -35 and -10 [19] the first have indicated a slight advantage 

towards T nucleotides a higher C presence on the second; iv) sigma 38 examples show a 


47 
 

leaning to A/T nucleotides around its -10 and a subtle A/T predominance around its -35; 

v) sigma 54 examples has just presented a low G content on its -10 and no predominance 

at all on its -35; vi) finally, sigma 70 promoters have presented a high T composition 

around its -35 and the classic TATAAT motif on its -10. The results from Weblogo are 

displayed on Figure 2. 

The stability and stacking series shown on Figure indicated that in most cases, this 

feature behaves differently when approaching the -10 and -35 consensual regions. The 

literature asserts that the whole stability level in promoter sequences tends to be lesser 

than that of coding genomic sequences due to is constant opening. The promoter is a 

region that during the transcription process is opened by RNAP and have its nucleotides 

written to the mRNA. For this process to be energetically viable, it is reasonable that the 

promoter has a lower stability that other genomic regions. A/T bindings present two 

hydrogen bonds whereas G/C base pairs have 3 hydrogen bonds, which makes promoters 

regions pro present a higher A/T presence due to its constant opening [12, 20, 8]. 

Analyzing the entropy and enthalpy values, it is clear that in a promoter sequence 

context, there is no alteration regarding the consensual regions when the feature is 

entropy. Early studies [13] have shown that due to the GC content, promoter sequences 

would have a different enthalpy and entropy value in comparison to other genomic 

sequences. Authors [13] have shown that the entropy value for an AT base pair 

stabilization would be 31±3 KJ/mol-bp and the average enthalpic contribution for a GC 

base pair stabilization is 45±6 KJ/mol-bp. Regarding the entropy necessary for base pair 

stabilization, AT is 86±10 KJ/mol-bp and GC is 114±15 KJ/mol-bp. Promoters are known 

to be poor in their GC content. The GC amount affects DNA stabilization both in the 

enthalpic and entropic contribution. During the RNAP binding to the promoter sequences 

there are other elements present in this RNAP – DNA binding. According to the literature, 

the more complex a system is, the bigger the entropy of the same system is. If we analyze, 

during this connection between RNAP and DNA, the other elements involved will raise 

the complexity of the system itself, there are water molecules, minerals that increase the 

entropy level. This higher complexity brought by AT nucleotides binding to water 

molecules explains the higher entropy levels [13, 21, 22] – basically all near the maximum 

1 value in our graph - that Figure 1 shows. 

 
48 
 

Figure 2 – Weblogos for the six σ groups tested in this paper 

 
49 
 

Figure also 1 shows that the entropy levels carried close to no alteration regarding 

consensual regions, so a question is raised: how different entropy is and behaves from the 

other tested DNA features? We analyzed that entropy and enthalpy are similar features, 

both related to thermodynamics of the DNA, but the way that these two measurements 

behave is different. The literature asserts that enthalpy refers to the system as a whole, in 

addition, we can perceive a system presence in this DNA-enzyme-water interaction – 

water is a major component that is present in the hydrogen bonds and directly affects 

entropy levels – this whole system is not organized, the entropy tends to decrease as the 

RNAP moves along the DNA strand, with the water connecting to the DNA. On the other 

hand, enthalpy does not refer to all the components involved in this systems, featuring as 

a standalone characteristic, the enthalpy is only affected by the hydrogen bonds found in 

the nucleotide connection. This leads us to the remark that entropy is a system 

measurement and feature, but at the same time, it corresponds to DNA thermodynamics, 

it cannot be measured alone, it is highly affected by the other components in the system 

[13, 21]. 

In terms of base-pair stacking as a feature to distinguish promoter sequences in 

different σ groups our results indicated that base-pair stacking in Figure 1 demonstrates 

an overlapping with stability. ZHANG et al. (2015) performed an analysis using an Atom 

Force Microscopic to evaluate the base pair hydrogen bond strength and base pair 

stacking force in DNA strands. The authors have come up with the data showing that GC 

binding strength would consist as 20 piconewton (pN) and AT base pairs 14pN, the 

stacking force in adjacent base pairs is estimated by the authors by being 2pN. The 

binding strength in GC duplexes makes sense when in promoter regions where the GC 

amount does not exceed the AT content. Turning promoters in a so-called weaker 

sequence in terms of its bindings and adjacent base pairs [12, 23, 24]. This close 

connection between base-pair stacking strength and DNA stability explains the 

superposition displayed in Figure 1. 

There may be a leaning towards AT nucleotides composing the promoter 

sequence, as some sigma groups show on Table 3, it would perfectly make sense for 

promoters having more AT nucleotides than GC due to the amount of force involved with 

these base pairs. However, [13] have portrayed that the forces involved in the bond 

between protein (RNAP) and DNA strand are not simple. They need to be weak enough 

to allow the protein to easily scan the DNA and, simultaneously, must be strong enough 


50 
 

for longer-living connections. Studies have presented that the enthalpy value of a 

sequence follows the increases of the stability value, as Image 2 depicts, in all cases, the 

enthalpy line follows the stability line. This is due to the enthalpy being associated with 

the bond between RNAp and the DNA strand [12, 13].  

The results shown in Figure 1 have indicated a strong correlation in terms of 

enthalpy, base-pair stacking and stability profiles for five out of the six σ groups that got 

tested. This correlation is explained due to the physical connection that these three 

features have between themselves [8, 12, 13, 24]. Promoter sequences recognized by 

different σ groups have shown that in each σ groups, the way these features behave is 

different, the values on Figure 1 vary according to each group. Even though, they all 

presented some sort of leaning towards the consensual motifs, which aids the RNAP 

recognition. 

The Table 3 indicates the correlation level between the four physical features 

tested. The correlation teste used was Spearman, due to the data not following normal 

distribution. All the features indicate a level of correlation.  

Table 3 – Correlations 

 Enthalpy Entropy Stability Stacking 

Enthalpy 1.0 .336** .815** .821** 

Entropy .336** 1.0 .391** .310** 

Stability .815** .391** 1.0 .707** 

Stacking .821** .310** .707** 1.0 

*. The correlation is significant at the 0.05 level (2 extremities). 

Table 4 shows a main component analysis, where 3 components explain 93.3% of 

our data variance. At the same time, Figure 3 shows a rotated space component diagram, 

where it can be perceived exactly what Figure 1 has indicated: enthalpy, base-pair 

stacking and stability are entwined, while entropy is not. 

Table 4 – Main components analysis 

 Component 

1 2 3 

Enthalpy .250 .304 -.081 


51 
 

Entropy -.045 -.202 1.055 

Stability -.648 1.204 -.114 

Stacking 1.181 -.689 -.019 

 
Figure 3 - a rotated space component diagram 

 
4 CONCLUSIONS 

A better comprehension between chemical bonds and DNA can aid to develop better tools 

to predict how strong two molecules will interact only by knowing their structure. The 

results presented on Figure 1 let us know that, except for σ24, every other sigma group 

behaves differently in terms of their physical features. This gives us the opportunity to 

distinguish promoters with their associated sigma factor. There is only one feature that 

could not be used to support this idea, which is the entropy.  

It is already known that promoter prediction is not a simple task due to several 

deformations, mutations and overlapping found in these regulatory sequences. Tools that 

only seek for sequence similarity may sound outdated and limited. The results gathered 


52 
 

here can enlighten in silico promoter prediction and characterization by providing the 

bioinformatics field a deeper analysis in the physical features present in promoters that 

got tested in this paper. In a future study, more segments of a genome may be compared 

with the promoter sequence to analyze the entropy, enthalpy, base-pair stacking and 

stability level outside the promoter sequence and combine these biological inferences in 

artificial intelligence techniques. 

 
5 REFERENCES 

[1] Cases, I., de Lorenzo, V. & Ouzounis, C. A. (2003). Transcription regulation and 

environmental adaptation in bacteria. Trends Microbiol. 11, 248–253. 

[2] McAdams, H. H., Srinivasan, B. & Arkin, A. P. (2004). The evolution of genetic 

regulatory systems in bacteria. Nature Rev. Genet. 5, 169–178. 

[3] Krebs, J. E.; Goldstein, E. S., Kilpatrick, S. T. Lewin’s GENES XI, Jones & Bartlett 

Publ., 2014. 

[4] Barnard, A.,