CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
1GUQ_1 5SQL_1 9DUT_1 Letter Amino acid
19 3 65 Q Glutamine
17 13 123 G Glycine
14 6 70 H Histidine
26 5 91 P Proline
11 7 89 Y Tyrosine
33 15 117 A Alanine
6 3 41 C Cysteine
6 6 166 I Isoleucine
8 2 53 M Methionine
19 9 119 D Aspartic acid
23 6 116 E Glutamic acid
31 19 241 L Leucine
13 13 117 K Lycine
23 6 118 T Threonine
22 22 143 V Valine
21 2 129 R Arginine
15 5 86 F Phenylalanine
19 11 177 S Serine
10 0 25 W Tryptophan
12 16 97 N Asparagine

1GUQ_1|Chains A, B, C, D|GALACTOSE-1-PHOSPHATE URIDYLYLTRANSFERASE|Escherichia coli (562)
>5SQL_1|Chains A, B|Non-structural protein 3|Severe acute respiratory syndrome coronavirus 2 (2697049)
>9DUT_1|Chain A|RNA-directed RNA polymerase L|Measles virus strain Edmonston-B (70146)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
1GUQ , Knot 151 348 0.84 40 219 337
MTQFNPVDHPHRRYNPLTGQWILVSPHRAKRPWQGAQETPAKQVLPAHDPDCFLCAGNVRVTGDKNPDYTGTYVFTNDFAALMSDTPDAPESHDPLMRCQSARGTSRVICFSPDHSKTLPELSVAALTEIVKTWQEQTAELGKTYPWVQVFENKGAAMGCSNPHPGGQIWANSFLPNEAEREDRLQKEYFAEQKSPMLVDYVQRELADGSRTVVETEHWLAVVPYWAAWPFETLLLPKAHVLRITDLTDAQRSDLALALKKLTSRYDNLFQCSFPYSMGWHGAPFNGEENQHWQLHAHFYPPLLRSATVRKFMVGYEMLAETQRDLTAEQAAERLRAVSDIHFRESGV
5SQL , Knot 80 169 0.81 38 119 162
SMVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFL
9DUT , Knot 731 2183 0.85 40 377 1758
MDSLSVNQILYPEVHLDSPIVTNKIVAILEYARVPHAYSLEDPTLCQNIKHRLKNGFSNQMIINNVEVGNVIKSKLRSYPAHSHIPYPNCNQDLFNIEDKESTRKIRELLKKGNSLYSKVSDKVFQCLRDTNSRLGLGSELREDIKEKVINLGVYMHSSQWFEPFLFWFTVKTEMRSVIKSQTHTCHRRRHTPVFFTGSSVELLISRDLVAIISKESQHVYYLTFELVLMYCDVIEGRLMTETAMTIDARYTELLGRVRYMWKLIDGFFPALGNPTYQIVAMLEPLSLAYLQLRDITVELRGAFLNHCFTEIHDVLDQNGFSDEGTYHELIEALDYIFITDDIHLTGEIFSFFRSFGHPRLEAVTAAENVRKYMNQPKVIVYETLMKGHAIFCGIIINGYRDRHGGSWPPLTLPLHAADTIRNAQASGDGLTHEQCVDNWKSFAGVKFGCFMPLSLDSDLTMYLKDKALAALQREWDSVYPKEFLRYDPPKGTGSRRLVDVFLNDSSFDPYDVIMYVVSGAYLHDPEFNLSYSLKEKEIKETGRLFAKMTYKMRACQVIAENLISNGIGKYFKDNGMAKDEHDLTKALHTLAVSGVPKDLKESHRGGPVLKTYSRSPVHTSTRNVRAAKGFIGFPQVIRQDQDTDHPENMEAYETVSAFITTDLKKYCLNWRYETISLFAQRLNEIYGLPSFFQWLHKRLETSVLYVSDPHCPPDLDAHIPLYKVPNDQIFIKYPMGGIEGYCQKLWTISTIPYLYLAAYESGVRIASLVQGDNQTIAVTKRVPSTWPYNLKKREAARVTRDYFVILRQRLHDIGHHLKANETIVSSHFFVYSKGIYYDGLLVSQSLKSIARCVFWSETIVDETRAACSNIATTMAKSIERGYDRYLAYSLNVLKVIQQILISLGFTINSTMTRDVVIPLLTNNDLLIRMALLPAPIGGMNYLNMSRLFVRNIGDPVTSSIADLKRMILASLMPEETLHQVMTQQPGDSSFLDWASDPYSANLVCVQSITRLLKNITARFVLIHSPNPMLKGLFHDDSKEEDEGLAAFLMDRHIIVPRAAHEILDHSVTGARESIAGMLDTTKGLIRASMRKGGLTSRVITRLSNYDYEQFRAGMVLLTGRKRNVLIDKESCSVQLARALRSHMWARLARGRPIYGLEVPDVLESMRGHLIRRHETCVICECGSVNYGWFFVPSGCQLDDIDKETSSLRVPYIGSTTDERTDMKLAFVRAPSRSLRSAVRIATVYSWAYGDDDSSWNEAWLLARQRANVSLEELRVITPISTSTNLAHRLRDRSTQVKYSGTSLVRVARYTTISNDNLSFVISDKKVDTNFIYQQGMLLGLGVLETLFRLEKDTGSSNTVLHLHVETDCCVIPMIDHPRIPSSRKLELRAELCTNPLIYDNAPLIDRDATRLYTQSHRRHLVEFVTWSTPQLYHILAKSTALSMIDLVTKFEKDHMNEISALIGDDDINSFITEFLLIEPRLFTIYLGQCAAINWAFDVHYHRPSGKYQMGELLSSFLSRMSKGVFKVLVNALSHPKIYKKFWHCGIIEPIHGPSLDAQNLHTTVCNMVYTCYMTYLDLLLNEELEEFTFLLCESDEDVVPDRFDNIQAKHLCVLADLYCQPGTCPPIQGLRPVEKCAVLTDHIKAEAMLSPAGSSWNINPIIVDHYSCSLTYLRRGSIKQIRLRVDPGFIFDALAEVNVSQPKIGSNNISNMSIKAFRPPHDDVAKLLKDINTSKHNLPISGGNLANYEIHAFRRIGLNSSACYKAVEISTLIRRCLEPGEDGLFLGEGSGSMLITYKEILKLSKCFYNSGVSANSRSGQRELAPYPSEVGLVEHRMGVGNIVKVLFNGRPEVTWVGSVDCFNFIVSNIPTSSVGFIHSDIETLPDKDTIEKLEELAAILSMALLLGKIGSILVIKLMPFSGDFVQGFISYVGSHYREVNLVYPRYSNFISTESYLVMTDLKANRLMNPEKIKQQIIESSVRTSPGLIGHILSIKQLSCIQAIVGDAVSRGDINPTLKKLTPIEQVLINCGLAINGPKLCKELIHHDVASGQDGLLNSILILYRELARFKDNQRSQQGMFHAYPVLVSSRQRELISRITRKFWGHILLYSGNRKLINKFIQNLKSGYLILDLHQNIFVKNLSKSEKQIIMTGGLKREWVFKVTVKETKEWYKLVGYSALIKD

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(1GUQ_1)}(2) \setminus P_{f(5SQL_1)}(2)|=136\), \(|P_{f(5SQL_1)}(2) \setminus P_{f(1GUQ_1)}(2)|=36\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:100101100100000110101111010010011011000110011110010011011010101000100010011000111110001011000011100001010001101010000011010111100110010000101100011101100011111000101110111001110010000010000110000111100100011010001100001111110111111001111010110100100100001111100100000011000110011101111010000010101010111100101001111001110000010100110010110010100011
Pair \(Z_2\) Length of longest common subsequence
1GUQ_1,5SQL_1 172 3
1GUQ_1,9DUT_1 172 5
5SQL_1,9DUT_1 262 4

Newick tree

 
[
	9DUT_1:11.92,
	[
		1GUQ_1:86,5SQL_1:86
	]:31.92
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{517 }{\log_{20} 517}-\frac{169}{\log_{20}169})=101.\)
Status Protein1 Protein2 d d1/2
Query variables 1GUQ_1 5SQL_1 132 94.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]