Downloading data from SIM¶
In this notebook we will use PySUS to download and treat mortality data from SIM.
[1]:
from pysus.online_data import SIM, parquets_to_dataframe
from pysus.preprocessing.decoders import decodifica_idade_SIM, translate_variables_SIM
[2]:
df = parquets_to_dataframe(SIM.download('AL',2019))
df
[2]:
ORIGEM | TIPOBITO | DTOBITO | HORAOBITO | NATURAL | CODMUNNATU | DTNASC | IDADE | SEXO | RACACOR | ... | FONTES | TPRESGINFO | TPNIVELINV | NUDIASINF | DTCADINF | MORTEPARTO | DTCONCASO | FONTESINF | ALTCAUSA | CONTADOR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 07022019 | 2145 | 827 | 270910 | 23041944 | 474 | 1 | 1 | ... | 444 | |||||||||
1 | 1 | 2 | 08022019 | 0600 | 827 | 270020 | 10011933 | 486 | 1 | 1 | ... | 445 | |||||||||
2 | 1 | 2 | 27012019 | 1630 | 823 | 230730 | 10051930 | 488 | 2 | 1 | ... | 768 | |||||||||
3 | 1 | 2 | 14012019 | 2030 | 827 | 270400 | 03111929 | 489 | 2 | 4 | ... | 769 | |||||||||
4 | 1 | 2 | 17022019 | 0927 | 827 | 270000 | 10091935 | 483 | 1 | 4 | ... | 803 | |||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20282 | 1 | 2 | 25102019 | 0920 | 08111942 | 476 | 2 | 4 | ... | 1377883 | |||||||||||
20283 | 1 | 2 | 26102019 | 1810 | 827 | 270000 | 07071968 | 451 | 2 | 4 | ... | 1377884 | |||||||||
20284 | 1 | 2 | 26102019 | 1650 | 827 | 270000 | 15121997 | 421 | 1 | 4 | ... | 1377885 | |||||||||
20285 | 1 | 2 | 25102019 | 1450 | 02041940 | 479 | 1 | 4 | ... | 1377886 | |||||||||||
20286 | 1 | 2 | 24102019 | 0300 | 20102019 | 204 | 2 | ... | SXXSXX | 24122019 | 3 | 18122019 | 2 | 1377887 |
20287 rows × 87 columns
[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20287 entries, 0 to 20286
Data columns (total 87 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORIGEM 20287 non-null string
1 TIPOBITO 20287 non-null string
2 DTOBITO 20287 non-null string
3 HORAOBITO 20287 non-null string
4 NATURAL 20287 non-null string
5 CODMUNNATU 20287 non-null string
6 DTNASC 20287 non-null string
7 IDADE 20287 non-null string
8 SEXO 20287 non-null Int64
9 RACACOR 20287 non-null string
10 ESTCIV 20287 non-null string
11 ESC 20287 non-null string
12 ESC2010 20287 non-null string
13 SERIESCFAL 20287 non-null string
14 OCUP 20287 non-null string
15 CODMUNRES 20287 non-null Int64
16 LOCOCOR 20287 non-null string
17 CODESTAB 20287 non-null string
18 ESTABDESCR 20287 non-null string
19 CODMUNOCOR 20287 non-null string
20 IDADEMAE 20287 non-null string
21 ESCMAE 20287 non-null string
22 ESCMAE2010 20287 non-null string
23 SERIESCMAE 20287 non-null string
24 OCUPMAE 20287 non-null string
25 QTDFILVIVO 20287 non-null string
26 QTDFILMORT 20287 non-null string
27 GRAVIDEZ 20287 non-null string
28 SEMAGESTAC 20287 non-null string
29 GESTACAO 20287 non-null string
30 PARTO 20287 non-null string
31 OBITOPARTO 20287 non-null string
32 PESO 20287 non-null string
33 TPMORTEOCO 20287 non-null string
34 OBITOGRAV 20287 non-null string
35 OBITOPUERP 20287 non-null string
36 ASSISTMED 20287 non-null string
37 EXAME 20287 non-null string
38 CIRURGIA 20287 non-null string
39 NECROPSIA 20287 non-null string
40 LINHAA 20287 non-null string
41 LINHAB 20287 non-null string
42 LINHAC 20287 non-null string
43 LINHAD 20287 non-null string
44 LINHAII 20287 non-null string
45 CAUSABAS 20287 non-null string
46 CB_PRE 20287 non-null string
47 COMUNSVOIM 20287 non-null string
48 DTATESTADO 20287 non-null string
49 CIRCOBITO 20287 non-null string
50 ACIDTRAB 20287 non-null string
51 FONTE 20287 non-null string
52 NUMEROLOTE 20287 non-null string
53 TPPOS 20287 non-null string
54 DTINVESTIG 20287 non-null string
55 CAUSABAS_O 20287 non-null string
56 DTCADASTRO 20287 non-null string
57 ATESTANTE 20287 non-null string
58 STCODIFICA 20287 non-null string
59 CODIFICADO 20287 non-null string
60 VERSAOSIST 20287 non-null string
61 VERSAOSCB 20287 non-null string
62 FONTEINV 20287 non-null string
63 DTRECEBIM 20287 non-null string
64 ATESTADO 20287 non-null string
65 DTRECORIGA 20287 non-null string
66 CAUSAMAT 20287 non-null string
67 ESCMAEAGR1 20287 non-null string
68 ESCFALAGR1 20287 non-null string
69 STDOEPIDEM 20287 non-null string
70 STDONOVA 20287 non-null string
71 DIFDATA 20287 non-null string
72 NUDIASOBCO 20287 non-null string
73 NUDIASOBIN 20287 non-null string
74 DTCADINV 20287 non-null string
75 TPOBITOCOR 20287 non-null string
76 DTCONINV 20287 non-null string
77 FONTES 20287 non-null string
78 TPRESGINFO 20287 non-null string
79 TPNIVELINV 20287 non-null string
80 NUDIASINF 20287 non-null string
81 DTCADINF 20287 non-null string
82 MORTEPARTO 20287 non-null string
83 DTCONCASO 20287 non-null string
84 FONTESINF 20287 non-null string
85 ALTCAUSA 20287 non-null string
86 CONTADOR 20287 non-null string
dtypes: Int64(2), string(85)
memory usage: 13.5 MB
Humanizing some of the encoded variables.¶
[4]:
df2 = translate_variables_SIM(df)
df2
2023-04-10 08:33:13.185 | DEBUG | pysus.online_data.SIM:get_municipios:180 - Stablishing connection with ftp.datasus.gov.br.
220 Microsoft FTP Service
2023-04-10 08:33:13.209 | DEBUG | pysus.online_data.SIM:get_municipios:184 - Changing FTP work dir to: /dissemin/publicos/SIM/CID10/TABELAS
2023-04-10 08:33:13.210 | INFO | pysus.online_data.SIM:get_municipios:194 - Local parquet file found at /home/luabida/pysus/SIM_CADMUN_.parquet
/home/luabida/Projetos/InfoDengue/PySUS/pysus/preprocessing/decoders.py:122: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
return df["MUNCODDV"].append(df["MUNCOD"]).astype("int64").values
[4]:
ORIGEM | TIPOBITO | DTOBITO | HORAOBITO | NATURAL | CODMUNNATU | DTNASC | IDADE | SEXO | RACACOR | ... | TPRESGINFO | TPNIVELINV | NUDIASINF | DTCADINF | MORTEPARTO | DTCONCASO | FONTESINF | ALTCAUSA | CONTADOR | IDADE_ANOS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 07022019 | 2145 | 827 | 270910 | 23041944 | 474 | Masculino | Branca | ... | 444 | 74.000000 | ||||||||
1 | 1 | 2 | 08022019 | 0600 | 827 | 270020 | 10011933 | 486 | Masculino | Branca | ... | 445 | 86.000000 | ||||||||
2 | 1 | 2 | 27012019 | 1630 | 823 | 230730 | 10051930 | 488 | Feminino | Branca | ... | 768 | 88.000000 | ||||||||
3 | 1 | 2 | 14012019 | 2030 | 827 | 270400 | 03111929 | 489 | Feminino | Parda | ... | 769 | 89.000000 | ||||||||
4 | 1 | 2 | 17022019 | 0927 | 827 | 270000 | 10091935 | 483 | Masculino | Parda | ... | 803 | 83.000000 | ||||||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20282 | 1 | 2 | 25102019 | 0920 | 08111942 | 476 | Feminino | Parda | ... | 1377883 | 76.000000 | ||||||||||
20283 | 1 | 2 | 26102019 | 1810 | 827 | 270000 | 07071968 | 451 | Feminino | Parda | ... | 1377884 | 51.000000 | ||||||||
20284 | 1 | 2 | 26102019 | 1650 | 827 | 270000 | 15121997 | 421 | Masculino | Parda | ... | 1377885 | 21.000000 | ||||||||
20285 | 1 | 2 | 25102019 | 1450 | 02041940 | 479 | Masculino | Parda | ... | 1377886 | 79.000000 | ||||||||||
20286 | 1 | 2 | 24102019 | 0300 | 20102019 | 204 | Feminino | NA | ... | 24122019 | 3 | 18122019 | 2 | 1377887 | 0.010959 |
20287 rows × 88 columns
[5]:
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20287 entries, 0 to 20286
Data columns (total 88 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ORIGEM 20287 non-null string
1 TIPOBITO 20287 non-null string
2 DTOBITO 20287 non-null string
3 HORAOBITO 20287 non-null string
4 NATURAL 20287 non-null string
5 CODMUNNATU 20287 non-null string
6 DTNASC 20287 non-null string
7 IDADE 20287 non-null string
8 SEXO 20287 non-null category
9 RACACOR 20287 non-null category
10 ESTCIV 20287 non-null string
11 ESC 20287 non-null string
12 ESC2010 20287 non-null string
13 SERIESCFAL 20287 non-null string
14 OCUP 20287 non-null string
15 CODMUNRES 20287 non-null category
16 LOCOCOR 20287 non-null string
17 CODESTAB 20287 non-null string
18 ESTABDESCR 20287 non-null string
19 CODMUNOCOR 20287 non-null string
20 IDADEMAE 20287 non-null string
21 ESCMAE 20287 non-null string
22 ESCMAE2010 20287 non-null string
23 SERIESCMAE 20287 non-null string
24 OCUPMAE 20287 non-null string
25 QTDFILVIVO 20287 non-null string
26 QTDFILMORT 20287 non-null string
27 GRAVIDEZ 20287 non-null string
28 SEMAGESTAC 20287 non-null string
29 GESTACAO 20287 non-null string
30 PARTO 20287 non-null string
31 OBITOPARTO 20287 non-null string
32 PESO 20287 non-null string
33 TPMORTEOCO 20287 non-null string
34 OBITOGRAV 20287 non-null string
35 OBITOPUERP 20287 non-null string
36 ASSISTMED 20287 non-null string
37 EXAME 20287 non-null string
38 CIRURGIA 20287 non-null string
39 NECROPSIA 20287 non-null string
40 LINHAA 20287 non-null string
41 LINHAB 20287 non-null string
42 LINHAC 20287 non-null string
43 LINHAD 20287 non-null string
44 LINHAII 20287 non-null string
45 CAUSABAS 20287 non-null string
46 CB_PRE 20287 non-null string
47 COMUNSVOIM 20287 non-null string
48 DTATESTADO 20287 non-null string
49 CIRCOBITO 20287 non-null string
50 ACIDTRAB 20287 non-null string
51 FONTE 20287 non-null string
52 NUMEROLOTE 20287 non-null string
53 TPPOS 20287 non-null string
54 DTINVESTIG 20287 non-null string
55 CAUSABAS_O 20287 non-null string
56 DTCADASTRO 20287 non-null string
57 ATESTANTE 20287 non-null string
58 STCODIFICA 20287 non-null string
59 CODIFICADO 20287 non-null string
60 VERSAOSIST 20287 non-null string
61 VERSAOSCB 20287 non-null string
62 FONTEINV 20287 non-null string
63 DTRECEBIM 20287 non-null string
64 ATESTADO 20287 non-null string
65 DTRECORIGA 20287 non-null string
66 CAUSAMAT 20287 non-null string
67 ESCMAEAGR1 20287 non-null string
68 ESCFALAGR1 20287 non-null string
69 STDOEPIDEM 20287 non-null string
70 STDONOVA 20287 non-null string
71 DIFDATA 20287 non-null string
72 NUDIASOBCO 20287 non-null string
73 NUDIASOBIN 20287 non-null string
74 DTCADINV 20287 non-null string
75 TPOBITOCOR 20287 non-null string
76 DTCONINV 20287 non-null string
77 FONTES 20287 non-null string
78 TPRESGINFO 20287 non-null string
79 TPNIVELINV 20287 non-null string
80 NUDIASINF 20287 non-null string
81 DTCADINF 20287 non-null string
82 MORTEPARTO 20287 non-null string
83 DTCONCASO 20287 non-null string
84 FONTESINF 20287 non-null string
85 ALTCAUSA 20287 non-null string
86 CONTADOR 20287 non-null string
87 IDADE_ANOS 20286 non-null float64
dtypes: category(3), float64(1), string(84)
memory usage: 13.2 MB